NEWS AT SEI
This article was originally published in News at SEI on: June 1, 2007
As systems become loosely coupled groups of software modules functioning independently and assembled dynamically to exchange information and perform shared services, the ability to establish control boundaries for system and component certification and accreditation grows increasingly difficult. The same set of interconnected software modules could be implemented within a single operating system or distributed internationally. Also, connectivity could be peer-to-peer, wireless, or any number of combinations. These varying implementation choices represent drastically different threat environments. Software, however, is not currently built to consider this range of variability.
Increasing system complexity, required system adaptability, and new operational mission needs are driving software and system developers to the expanded use of technologies such as Web services and design approaches such as service-oriented architectures. The new technologies and changing operational environment raise software risks that are not addressed by existing operational risk approaches. Newly developed components join an existing operational environment and must be connected with older technology for operational effectiveness. Too often we see a patchwork of software components stitched together with home-grown solutions. It might work, but with very little or no assurance. It is likely inflexible and potentially unpredictable.
Operational support and sustainment efforts must consider that the operational environment will grow increasingly fragile as the need for increased flexibility and solutions to meet immediate needs drive development to field solutions based on newer, less tried tools and techniques that must integrate seamlessly with legacy systems. Changes in the operational environment are also occurring, independent of development activities. Continuous hardware and operating system upgrades take place as support for older versions expires. Vulnerability monitoring and incident mitigation introduce changes to infrastructure configurations and components such as firewalls and routers. Growing complexity and increasing interdependency of operational components can overwhelm problem analysis and response mechanisms. As new development joins the implemented environment, limitations on consideration of operational needs will accentuate these challenges.
The survivability of an operational process (mission thread) depends on the availability of a specific set of functions. In the systems-of-systems environment, those functions can span multiple systems [Maier 96]. Those systems must support multiple mission threads and provide functionality for local processes (local threads), which the stand-alone system was originally designed to address.
Survivability focuses on what can go wrong (the risks), the consequences of those risks on the operational process (mission thread), and the possible mitigation options for those risks. Survivability emphasizes continued operations in the presence of error and recovery following a failure. A response to a failure is not necessarily a system response but may be a change in procedures or an acceptable modification of the mission.
Survivability must account for the complexity of a networked operations environment, which can be distributed globally. Figure 1 presents a simple view of the global infrastructure environment. Figure 2 captures the variances among levels of connectivity, which can be described as strategic, operational, and tactical (see description in Table 1). Administrative controls are distributed across multiple systems, and there is limited visibility for system behavior beyond the boundaries of each local administration. Global guidance of systems has a role at the strategic level, but at the tactical level, systems must be able to act autonomously because communications and any associated controls are not as reliable.
Survivability solutions cannot assume homogenous configurations. In reality, the operational capabilities (including bandwidth and latency) will vary across the strategic, operational, and tactical infrastructure levels of the organizational infrastructure. The risk-mitigation tactics must also allow for variances in software architecture induced by these differences. Existing systems and components are expected to evolve to greater interoperability and dependency on shared infrastructure capabilities, reducing the level of functional independence and increasing the potential for reliability and dependability concerns.
Relatively dependable network
Web-centric an option: i.e., thin clients with applications executing on the server
Noticeable variations in bandwidth and reliability
Modes: May need to shift from thin to thick client if operating conditions deteriorate
May have to resort to ad-hoc and peer-to-peer configurations
Thick clients: Some applications must run locally with limited or no network access
Mission-thread requirements crossing system boundaries (joint among systems) will compete for resources with system-provided functions. SoS mission-thread execution will compete with local threads and other multi-system threads assigned to the same system for priority, bandwidth, and other resources (as shown in Figure 3).
Software-intensive, complex, human-machine systems require distributed decision making across both physical and organizational boundaries. The combination of expanding interoperability, multilevel requirements, and multiple control points creates a highly complex operational environment in which system behavior can be difficult to predict. An SoS mission thread requires reliability and dependability for individual components and solid end-to-end engineering so that the integration of those components ensures the survivability of the thread.
Seamless integration among cyberspace participating systems will require operational support and sustainment efforts to be able to recognize unacceptable conditions, resource contention among participants, and changes in resources with options to respond to the potentially bottlenecking situations affecting mission success. Both monitoring and response needs must be considered as systems and software components are designed and built since the network communication structure's controlling mechanisms provide a limited range of integration control in an SoS environment.
There is ample evidence that we are already reaching the limits of our engineering and testing practices, even in the today's less dynamic environments [Howard 06]. Unfortunately, systems engineered to work together commonly fail to produce the desired joint outcome because all circumstances in the operational environment could not be anticipated. In fact, even system upgrades can break existing interoperability among partnering systems.
Operational survivability, both today and in the future, can also be affected in following ways:
Thus, the future systems-of-systems infrastructure will fundamentally alter the relationship between system components. Each component will know far less about the time, reason, and environmental conditions in which it is invoked. Components must assume that errors are occurring. To protect itself, and to continue to execute its missions, a component within the infrastructure must adopt a defensive posture against a wide range of potential complications (or stresses) that were most likely not predicted during its development. Survivability of the mission threads will depend on how components manage these stresses.
The Survivability Analysis Framework (SAF)1 was developed to help organizations analyze and understand threats and gaps to survivability for operational mission threads within an SoS. A report will be published in later this year describing the details of SAF along with examples. In the meantime, this article provides a summary to prepare the readers for the future document.
SAF is designed to address the following:
Each critical step in a mission thread is tasked to fulfill some portion of mission-thread functionality. This tasking represents a contract of interaction between the mission-thread step and prior and subsequent steps. Pre-conditions establish the information provided to the step. Pre-conditions may trigger the execution of a step (e.g., data or a human command), or the process may be continually executed (e.g., a sensor). Each step will have outcomes (post-conditions) that may interact with subsequent steps. However, the contract with prior and subsequent steps is not necessarily static and may have to be negotiated at run time to reflect the current situation. Even the identity of prior and subsequent steps may vary across executions of the mission thread.2
Environmental, data, process, and interaction limitations can lead to potential breakdown of a step. Each limitation represents a source or type of stress on the step and, consequently, on the mission thread. However, such stress does not necessarily cause failure. Steps can be designed to manage a range of stresses and still respond appropriately or degrade gracefully. Additionally, the failure of any specific step may not necessarily doom a mission, because subsequent steps may continue to execute the thread.
Linkages among steps are driven by three primary components: people, resources (technology, systems, connectivity, etc.), and interactions (e.g. data exchange). The behavior of the linkages coupled with the activities to be addressed in each step can lead to stresses. Unmanaged stresses can potentially lead to complete failure. The mission also likely will fail if a step manages a stress in a manner incompatible with subsequent steps. For example, consider a step that receives some data as input. If the value received by the step is out of the expected range, then the step can respond in a variety of ways. For instance, it might substitute a default value in place of the out-of-range value. This substitution, however, may have dire consequences if the decision to manage the stress by substituting a default value is inconsistent with the subsequent step's expectation for a highly accurate value.
SAF captures for analysis the ways in which selected stresses are handled at critical mission thread steps. It also analyzes whether the stress-handling approaches adopted by a step are compatible with subsequent mission thread steps. SAF consists of two component groups: (1) three matrices that capture stresses on a step and potential mechanisms for managing these stresses; and (2) a process for applying the matrices to a joint mission thread.
To begin the process, select a business-process (mission-thread) scenario with sufficient complexity to cross a range of organizational and technical options useful for analysis. Establish appropriate completion criteria for the scenario to represent organizational success. Decompose the scenario into a specific sequence of end-to-end steps (unique activities) that must be performed to reach the success goals.
Across the range of mission steps, assemble a matrix for each of the following:
Using the three matrices, apply failure analysis techniques to identify potential points of failure that would critically impact successful completion of the mission thread [Alberts 05, Stamatis 03, Woody 07].
To begin to realize its potential, a broader range of operational mission threads must be analyzed with SAF. From a larger body of mission thread and development assessment examples, patterns of effectiveness can be identified and the framework refined.
Alberts, C. & Dorofee, A. Managing Information Security Risks The OCTAVE Approach. Boston, MA: Addison-Wesley, 2003.
Alberts, C. & Dorofee, A. Mission Assurance Analysis Protocol (MAAP): Assessing Risk in Complex Environments (CMU/SEI-2005-TN-032) Pittsburgh, PA: Software Engineering Institute, Carnegie Mellon University, 2005.
Howard, M. & Lipner, S. The Security Development Lifecycle SDL: A Process for Developing Demonstrably More Secure Software. Microsoft, 2006.
Leveson, N. “A Systems-Theoretic Approach to Safety in Software-Intensive Systems.” IEEE Transactions on Dependable and Secure Computing 1,1 (January-March 2004): 66-86.
Maier, M. “Architecting Principles for Systems of Systems” 567-574. Proceedings of the Sixth Annual International Symposium, International Council on Systems Engineering. Boston, MA, 1996. www.infoed.com/Open/PAPERS/systems.htm.
Stamatis, D.H. Failure Mode and Effect Analysis: FMEA from Theory to Execution, 2nd ed. Milwaukee, WI: ASQ Quality Press, 2003.
Woody, C. & Alberts, C. “Considering Operational Security Risk during System Development.” IEEE Security & Privacy 5, 1 (January/February 2007): 30-35.
1 SAF was piloted for the U.S. Department of Defense, Joint Battle Mission Command and Control (JBMC2) in analysis of a time-sensitive-targeting mission thread for the Office of the Undersecretary of Defense Acquisition & Logistics (OUSD/AT&L). A second pilot analysis was completed for time-sensitive-targeting information assurance for the U.S. Department of Defense, Electronic Systems Center, Cryptologic Systems Group, and Network Systems Division (ESC/CPSG NIS).
2 Mission threads are expected to be dynamic in content because each specific mission is unique.
The views expressed in this article are the author's only and do not represent directly or imply any official position or view of the Software Engineering Institute or Carnegie Mellon University. This article is intended to stimulate further discussion about this topic.
For more information
Please tell us what you
think with this short
(< 5 minute) survey.