NEWS AT SEI
This article was originally published in News at SEI on: June 1, 2006
In Issue 4 of 2005, news@sei we presented a feature article about a study that the SEI was conducting on behalf of the Office of the Assistant Secretary of the Army (Acquisition, Logistics, & Technology). The study brought together experts from within and outside the field of software engineering and from a variety of institutions and organizations to consider the challenges of ultra-large-scale (ULS) systems: given the issues with today’s software engineering, how can we build the systems of the future that are likely to have billions of lines of code? Although a billion lines of code was the initial challenge, increased code size brings with it increased scale in many dimensions, posing challenges that strain current software foundations.
Ultra-Large-Scale Systems: The Software Challenge of the Future (ISBN 0-9786956-0-7) is the product of this 12-month study. The report, available on the Web, details a broad, multi-disciplinary research agenda for developing the ultra-large-scale systems of the future. The principal team of authors who wrote the report consists of Linda Northrop (study lead), Peter Feiler, John Goodenough, Rick Linger, Tom Longstaff, Rick Kazman, Mark Klein, and Kurt Wallnau from the SEI, along with Richard P. Gabriel, Sun Microsystems, Inc.; Douglas Schmidt, Vanderbilt University; and Kevin Sullivan, University of Virginia.
“The DoD has a goal of information dominance,” says Northrop, director of the Product Line Systems Program at the SEI. “Achieving this goal depends on the availability of increasingly complex systems characterized by thousands of platforms, sensors, decision nodes, weapons, and users, connected through heterogeneous wired and wireless networks. These systems will be ULS systems. Although they will comprise far more than just software,” says Northrop, “it is software that fundamentally will make possible the achievement of the DoD’s goal.
“Yet software is the least well understood and the most problematic element of our largest systems today. Our current understanding of software and our software-development practices will not meet the demands of the future. To make significant progress in the size and complexity of systems that can be built and deployed successfully, we require a culture shift. In this report, we identify the kinds of research that will effect such a culture shift. The United States needs a program that will fund this software research required to sustain ongoing transformations in national defense and global interdependence. The report provides the starting point for the path ahead.”
In this article, five members of the author team discuss the report and its significance. Participants are
- Richard P. Gabriel, distinguished engineer, Sun Microsystems, Inc.
- John B. Goodenough, SEI fellow and senior member of the technical staff, SEI
- Thomas A. Longstaff, deputy director for technology, CERT Program, SEI
- Douglas Schmidt, associate chair of computer science and engineering, Vanderbilt University
- Kevin Sullivan, associate professor and Virginia Engineering Foundation Faculty Fellow in Computer Science, University of Virginia
Pollak: Please describe the problems that the team considered and why these problems are important or significant.
Longstaff: The problem we addressed is the need to improve the science of scaling up our ability to develop ULS systems. Scale in itself is a difficult concept to define. Our initial investigation indicated that more work was needed by a broader group of people to think this problem through and define what science is needed. What eventually came out are the seven research areas that are described in the report. (Note: These are briefly described in the executive summary.)
Schmidt: The basic problem we have today in the DoD is that systems are increasingly getting integrated out of many parts provided by many different suppliers—the common term for this is “systems of systems.” This problem has been challenging for the past five years or so and is likely to get harder and harder in the future. Examples of systems that already have this problem are Future Combat Systems (FCS, Army), Global Information Grid (GIG, Air Force), and FORCENet (which is largely Navy driven). There is also an effort to wire all the services together, which is even harder.
The challenge that we were given is, what do we do when we have even more ambitious requirements in the future. Even FCS consists of no more than 30-35 million lines of code. We were given the charter to understand what would be the software-technology issues for systems that were a billion lines of code or larger. Lines of code per se are not so important; the idea is that these systems are much larger than systems we have to deal with today, when even today’s systems are stretching what we can do from a technology and management standpoint. Our report is an effort to look ahead to what might happen in the next five years when the systems we are dealing with will get larger by an order of magnitude or two.
Goodenough: We’re talking about a change in the way of thinking about how systems are developed. We try to build current large-scale systems today in the same way that we did smaller systems. The ULS study focused on the fact that techniques that were appropriate for smaller scale systems are not appropriate for larger systems—we need a new way of thinking about things. It’s like being out hiking in the desert and bringing along some water, but not knowing how much water you need, because you are just thinking of it as being a hot day. But the low humidity in the desert means that you have new conditions to deal with, so you have to take a different approach or you won’t be successful.
The point we’re making is not simply that ULS systems are different, it’s that the assumptions that we usually make when we create systems don’t apply. The mental model of how we build systems today is that, if we work hard enough, we can create a system that is close enough to perfection that it’s good. But in ULS systems, we have to account for the idea that the systems are so large that failures are normal, so we have to plan ahead for dealing with the failures that will occur. With a ULS system, we’ll never get it close to perfection, because when we get it to where it’s working, we’ll find out that it needs to do something different, and it will have to evolve in another direction. Or it will have certain failure characteristics that we’ll have to cope with in different ways.
We don’t think of systems today in which the evolution of the system is guided by the people who are actually using it. That’s what happens in the Internet, but DoD systems are not procured in the way that the Internet was developed. For ULS systems, though, we need to have more of an Internet-development philosophy. Different units will find different ways of using the system and will somehow define their own capabilities, which will fall within the general flow of the ULS system. We talk today about “building” systems. In the world of ULS systems, we’ll talk about “evolving” systems. We’ll never start with a new system, we’ll start with an existing system that moves on in another direction. This is not our current mental model. When we start talking about evolving, the nature of the problems we need to solve and the appropriate solution techniques change. That’s what this study is all about: how we need to change what we think the problems are and how we need to change our solution approaches, because the old solution approaches are applicable to the old problems but not to the new problems. And we don’t know what the new solution approaches are. When we decide that we can’t muddle through any longer, that requires a new set of software engineering techniques that only a few groups of people here and there in the industry are thinking about. We need to stimulate and nourish these groups of people so that these ways of dealing with systems become more vibrant and produce results that can be applied on a broader scale.
Gabriel: The problem we addressed was how to build things at much larger scale. We build medium-to-large systems today through planning and trying to define and understand requirements. For ULS systems, it’s obvious that we can’t do that. So the challenge is, how do you get people who don’t know each other to collaborate to put something together that’s coherent, and does that imply that some parts of the system are so large that they can’t be designed by people but must be designed by other software? This is important because, as we learn more about how to build software and have more capabilities we need to build in, we’ll be making these systems whether we want to or not. What we do now is the Dorothy technique: clicking our heels together and hoping for the best. If you look at systems that are built today, they don’t exactly do what people wanted; once people see and experience the system’s capabilities, they say, “Oh, that’s not what I meant.” Or “Gee, I wish it would do this other thing.” So we end up iterating. Software companies today are always doing some sort of iterative, incremental evolution of requirements, design, and implementation. The industry does that, and military contractors do that.
Sullivan: Computer science has enabled the development of truly remarkable kinds of systems, from music downloads to global information search to weather prediction to business management on a global scale. At same time, the revolution in computing has only begun. We’ve seen the revolution in desktop computing and networking of desktop machines, but we’re now looking at a new world of computing systems that are much more deeply embedded into the fabric of our lives. Our experience to date with the kinds of systems we’ve built until now has shown us that the primary impediment to the realization of these new systems is our inadequate understanding of how to construct the software components of these systems so that they do what they’re supposed to do in a way that is secure and dependable. But many decades of research in software engineering and computer science haven’t yet resolved some of the fundamental difficulties that we face in building, validating, and verifying complex pieces of software.
We believe that the most important impediment to building next-generation systems is the software. We observed in the report that we have taken a traditional engineering viewpoint on software for a number of decades. In considering the question of what are the big breakthroughs that are needed, we decided that it might be necessary to go beyond the traditional engineering perspective—from centralized systems to highly decentralized systems. We see some of this in the Internet today, with emerging classes of systems. So we looked at alternative metaphors for the construction of these systems—not just an engineering metaphor anymore, but an industry-structure metaphor or an economic metaphor or a complex-adaptive-systems metaphor or a biologically-inspired-systems metaphor. We’re also moving from a metaphor of control to a metaphor of enablement, from a Tayloresque notion of control of engineering processes to a decentralized notion of regulation and inducement. We’re thinking about producing the kinds of systems that are needed through a much more exploratory and emergent kind of process than has been common in the past.
Pollak: What approach did you use to collaborate in investigating the challenges of ULS systems?
Longstaff: We took a unique approach, which was not to pull together a group of like-minded computer scientists, but to bring together people with a variety of backgrounds and disciplines. So for example, we brought in people who understand the behavioral side of the use of computer systems, the history of the security of systems, the formal development and structuring of software algorithms, systems of systems and DoD work in this area, fundamental language issues in developing software, and the culture behind the development of computer science projects and systems.
The risk in bringing this diverse group together was that we wouldn’t have a common basis for communication. But the group did an excellent job of defining the challenges and looking at the implications of the challenges in each of their independent areas. By keeping a relatively neutral focus in the writing team, we were able to create a consistent report written by a few individuals while still retaining the broad perspective of all the people who were in the original meeting.
Schmidt: What was different in this group was its multidisciplinary nature. We had people who provided an economics and management perspective as well as people with expertise in other parts of computing such as human-computer interaction, quality issues and statistics, programming languages, system modularity, design, and software engineering. There was an attempt from the beginning to be interdisciplinary and to include both people from traditional software engineering as well as people with a different perspective.
Typically with these kinds of studies, people come up with solutions that are purely technology driven; recommending, for example, a better network, operating system, or distributed system, middleware, or language. We have these kinds of recommendations in our conclusions and certainly don’t ignore them, but the results are much broader. They include management-policy issues, human-computer interactions, the human dimension of computing, how to get stakeholders and users involved, and the economic dimensions. And these issues are not only identified in their own right, but they are woven through parts of the more conventional technology dimensions.
Goodenough: The assembly included people who are not traditionally invited into software engineering events. It was a diversity reflecting the idea that we need new perspectives, and if you need new perspectives, you can’t go to the same people you always go to. This was part of the original charge to SEI—to go outside the usual boundaries and to think innovatively. It was Linda's idea to identify people who were outside the box but who at least knew what the box was so they could contribute in a reasonable way. We got lots of good ideas from such people in the initial workshop, and then the writing team helped to pull things together. We needed people who were able to reduce innovative ideas into understandable and believable prose. We had some incredible writers, and we threw away some incredible amounts of good prose in order to get a report that was succinct.
Gabriel: I was very happy after the first meeting. I knew right then that this could be a significant event in software. I was surprised at how well run the first meeting was—it was very well planned and facilitated. The SEI team didn’t have preconceptions or try to impose expectations about what the ideas would be. The set of people who participated in the first meeting was good, and the subset of people on the writing team was outstanding.
Sullivan: The first important thing that happened was that Linda found a group of people who tend to think beyond the current state of the art—people who were not primarily incremental thinkers. The multidisciplinary nature of the team was important too. We had people with expertise in ethnographics, reliability, software architecture, financial economics—a variety of different perspectives—as well as people within the computer science field who are sympathetic for the need to reach out and exploit, import, adapt, and leverage theories and thinking from other disciplines.
Once we got those people together, the next step was to the extent possible to explore the range of issues, possibilities, and dimensions in which we might formulate a new kind of research agenda. The challenge for the writing team was then to reconcile, integrate, and refine, reduce, and focus down the key intellectual contributions made by the broader group and bring it together into a coherent presentation. In doing that, some additional insights were reached on how to put an overarching structure on the story.
Pollak: What key insights about ULS systems emerged from the study?
Longstaff: The most groundbreaking insight is the prominence of human issues with regard to ULS systems. This captures up-front attention in the report and glues together what is said later about the underlying computer science. Normally, you tend to shunt those issues off until after you develop the underlying technology. A key insight to the whole report is that ULS systems can’t be successful without thinking of the human aspects not only of the computer system but also of the surrounding infrastructure of that system—how people will interact with the development of the system, with specification and design, with evolution. That’s different from other reports that you might read.
Schmidt: A key insight is the need to be able to engage various stakeholders up and down different levels of the echelon—to get, for example, the people in the government policy-making realm to work with the people involved with the actual production of the software, those doing the R&D, and those doing the actual deployment and operation of the system. Also key is the ability to come up with a research agenda that spans different levels of the chain of command and different sectors of the industry. We characterize this in the report as a “socio-technical ecosystem,” and this represents a novel insight about how to organize the research agenda.
Also an important result was the mapping of the DoD mission and capabilities called out in the Quadrennial Defense Review Report (February 2006) with research breakthroughs, needs, and approaches in the research tracks. In the report, we provide mappings and a way forward that would be relevant both to policy makers in the DoD as well as to technologists—that combination is rare. It’s a multidimensional roadmap that enables different stakeholders and different people who need work done to take different paths and to use this report for their own purposes.
Goodenough: The key insight for me is the idea of thinking of ULS systems as ecosystems in which the behavior of participants is governed by rules, similar to the way zoning laws in a city help to shape the way the city grows. We were able to consider what are the equivalent in the ULS system domain of zoning laws and codes of behavior, which make life in a city acceptable. With ULS systems, we’re going to talk about metrics of effectiveness that are more like the gross national product—not something that is an aggregate of a lot of individual test-point probes, but a sampling of what’s going on in the system that gives insight into whether the overall system is tending toward viability or degradation. People who are running Google or the AOL mail system today use these kinds of probes to evaluate the health of their systems. The kinds of problems they are solving are perhaps more focused than the kinds of problems that we want the ULS systems to solve, but these techniques are hints of how we will need to deal with the ULS systems of the future. So we’re not talking about radically inventing something that hasn’t existed somewhere in some shape before, we’re talking about focusing attention on those solutions that already exist that can be evolved into more powerful solutions that can be generalized to work at the ULS system scale.
Gabriel: The key insights for me are: identification of different viewpoints for looking at these systems and how to operate them; the various metaphors that we used to understand ULS systems, such as the city metaphor and the ecosystem metaphor; the realization that the development process has to be more like a complex adaptive system than like a great brain trust that designs the system; and the realization that other disciplines can teach us things that will help us build and understand these systems and understand the organizations and the people that put them together, in order to be more effective.
Sullivan: The key insights correspond to the research areas that we identified in the report: human interaction; computational emergence; design; computational engineering; adaptive system infrastructure; adaptable and predictable system quality; and policy, acquisition, and management. These are the crucial areas in which we need much better understanding. I also think it is absolutely true that we need to consider people and organizations and economics, and that whole social aspect of computing systems more thoroughly. The challenges of ULS systems are difficult to handle entirely within the computer science field, and this insight dictates new kinds of collaborations between computer scientists and social scientists.
Pollak: After the release of the report, what do you hope will happen in the future?
Longstaff: I hope that senior leadership in the DoD will see what is possible with these ULS systems and will understand that in order to really achieve what we want, significant effort will have to be put into developing a different approach to assembling large-scale systems. I hope leadership will see the vision of the report and be captured by its possibilities, and that the implementation of any science agenda will be aligned with that vision.
Schmidt: Strategically, I hope that we can use this report as a way to help revitalize and create incentives for more technical work in these areas, as the starting point for getting things moving. From a more tactical point of view, the report makes it possible to identify the most promising or interesting research to investigate if it’s possible to make progress in these areas and to help continue to flesh out the research agenda defined in the report. The research areas that we identified need further prototyping and experimentation so that we can have confidence that proposed solutions will be technically sound.
Goodenough: My hope is that this study will stimulate a whole different focus of thinking about software engineering problems. Our challenge of assumptions in the report will begin to enter other peoples’ thinking. What happens, for example, if we don’t know exactly what a system is supposed to do, but we still need to validate the system and ensure that it’s safe? How do we reason that it’s sufficiently safe or functional when we don’t know as much about the system as we usually expect to know? We may seem to be asking a lot of questions that seem almost impossible or silly, but by asking them we begin to find new approaches that begin to address problems. My hope is that the solution that emerges from this work has some of the characteristics of a new paradigm of software engineering.
Gabriel: I hope that researchers in universities and industrial labs will take this seriously and that funding will come through to focus on it. A realistic hope is that it will cause enough of a stir that research in the software world moves enough in this direction that we’ll be able to make some progress. People are already researching these ideas, but these people are today considered on the fringe. I hope that such people will now be considered slightly more toward the mainstream, more legitimate.
Sullivan: One of the most important statements in the report is that software is both the key enabler and the key impediment to progress. There simply is no way that industry is going to solve these fundamental problems in basic science and engineering on its own given its short-term and relatively low-technical-risk approach. It's also unlikely, in my judgment, that the software engineering research community by itself will be able to make the progress do what’s needed, especially under the present conditions of woefully inadequate research funding and with its relatively stable perspective on how to proceed. We need to something really big, really innovative, and well informed. There really is a crying need for a visionary and substantial new emphasis on research in fundamental issues in software. I would like the DoD and other agencies to get that message loud and clear. My hope would be that the right people at the executive level in government and industry get this message and understand that if they do something about it, it will make a world of difference in their own futures.