Why do system-level failures still occur despite fault tolerance techniques being deployed in systems?
From a development perspective, the tight integration of a large number of components creates many potential failure modes caused by interactions that cannot be discovered by unit testing. In this project, our focus is on identifying system-wide design rules that must be satisfied in order to limit propagation of seemingly minor faults throughout the system.
Our objectives in this project are to
- develop a system fault containment and stability management framework
- identify categories of potentially unmanaged faults and their root causes
- develop an analytical approach for fault propagation that can lead to system failures
- develop effective system-level fault containment strategies
- specify and validate architecture patterns conducive to robustness and stability in systems
Our approach is to build architectural models using the Architecture Analysis and Design Language (AADL) to identify system fault behaviors that are not addressed by component-fault containment techniques, to develop a formalized analysis framework for system fault containment and stability management, and to validate system architectures in the context of this framework.
Our model-based analytic framework for this investigation includes
- root cause analysis of system-level faults
- analytic exploration of unmanaged faults
- fault-impact analysis and system-level fault containment strategies
Read a report (pdf, 688 kb) or presentation (pdf, 883 kb) on fault propagation and error modeling.