Software Engineering Institute Carnegie Mellon

William G. Wood

RSS

Technical Report
CMU/SEI-2007-TR-005

February 2007

 

Software Architecture Technology Initiative
Unlimited distribution subject to the copyright.

 

 

Abstract   1 Introduction   2 System Definition   3 Applying ADD   4 ADD Second Iteration
5 Summary of the Architecture   6 Comments on the Method   7 Conclusion
Glossary   Bibliography   PDF Download

1 Introduction

This report describes the practical application of the Attribute-Driven Design (ADD) method developed by the Carnegie Mellon Software Engineering Institute (SEI). The ADD method is an approach to defining a software architecture in which the design process is based on the quality attribute requirements the software must fulfill. ADD follows a recursive process that decomposes a system or system element by applying architectural tactics and patterns that satisfy its driving quality attribute requirements.

The example in this report applies ADD to a client-server system to satisfy several architectural drivers, such as functional requirements, design constraints, and quality attribute requirements. In particular, this example focuses on selecting patterns to satisfy typical availability requirements for fault tolerance. The design concerns and patterns presented in this report--as well as the models used to determine whether the architecture satisfies the architectural drivers--can be applied in general to include fault tolerance in a system. Most of the reasoning used throughout the design process is pragmatic and models how an experienced architect works.

1.1 Summary of ADD

This example follows the most current version of the ADD method as described in the companion technical report, Attribute-Driven Design (ADD) Version 2.0 [Wojcik 2006].1 The eight steps of the ADD method are shown in Figure 1.

Each step in the method is described in Section 4 of this report, "ADD Second Iteration." However, the method can be summarized as follows:

Figure 1: Steps of ADD2

1.2 Example Process

In the design of large-scale systems, many design activities occur in parallel with each parallel activity developing a portion of the architecture. These separate design pieces must be merged (and possibly changed) to create a consistent and complete architecture that meets all the stakeholder requirements.

In our example, we assume that the architecture team has already conducted the first iteration of ADD and developed an overview of the architecture. The results of that first iteration include placeholders for "fault-tolerance services" and other services such as "start-up services" and "persistent storage services." In Section 2 of this report, we sketch out the first iteration results to help explain the resulting architecture.

For the second iteration, the architecture team assigns the "fault-tolerance services" design requirement to a fault-tolerance expert for further refinement. This iteration, which focuses on the "fault-tolerance services" system element, is discussed in detail in Section 4. In that section, we also review the different fault-tolerance concerns, alternative design patterns, and the reasoning used to select among the alternatives to satisfy the architectural drivers. In this selection process, we must satisfy an end-to-end timing deadline starting with the failure and ending with a fully recovered system. To this end, we build a timing model to use as an analysis tool. Recall that we are drafting an architecture that satisfies the requirement that an end-to-end timing scenario be met when a fault occurs; however, we are not trying to build a complete architecture at this stage. The results of this iteration will be merged with the designs of other parallel activities.

The fault-tolerance expert is given some leeway in not satisfying all the architectural drivers. The expert may return to the architecture team and request relief from specific architectural drivers, if they force the solution space to use complex patterns that are difficult to implement. The architecture team has the opportunity to revisit some of the requirements given to the fault-tolerance expert, make changes, and allow the expert to find a more reasonable solution.

When fault tolerance is being addressed, existing services may be used in some cases, such as a health monitoring or warm restart service using a proxy. These on-hand services can become design constraints and cause the architect to make fewer design choices. In our example, however, we include the full range of design choices.

Overall, our approach to introducing fault tolerance is general and may be used as a template for designing fault-tolerance architectures.

 

 

Abstract   1 Introduction   2 System Definition   3 Applying ADD   4 ADD Second Iteration
5 Summary of the Architecture   6 Comments on the Method   7 Conclusion
Glossary   Bibliography   PDF Download

2 System Definition

This section describes the basic client-server system in our example. We are designing its architecture in terms of three architecture requirements: functional requirements, design constraints, and quality attribute requirements.

2.1 Functional Requirements

Figure 2 depicts a functional overview of our client-server example.

Figure 2: Functional Overview

The Track Manager provides a tracking service for two types of clients:

2.2 Design Constraints

Three design constraints are required:

  1. capacity restrictions: The provided processors shall have 50% spare processor and memory capacity on delivery, and the local area network (LAN) has 50% spare throughput capability. There are 100 update clients and 25 query clients. For the purposes of timing estimates, assume that there are 100 updates and 5 queries per second.
  2. persistent storage service: This service will maintain a copy of state that is checked at least once per minute by the Track Manager. If all replicas of the Track Manager fail, a restart can begin from the checkpoint file.
  3. two replicas: To satisfy the availability and reliability requirements, a Reliability, Availability, and Maintainability (RMA) study has been conducted, and the Track Manager and persistent storage elements shall all have two replicas operating during normal circumstances.

2.3 Quality Attribute Requirements

The system stakeholders agree on three quality attribute scenarios that describe the various system responses to failures. These scenarios are described in Tables 1-3.

Table 1: Quality Attribute Scenario 1: Quick Recovery

Element

Statement

Stimulus

A Track Manager software or hardware component fails.

Stimulus source

A fault occurs in a Track Manager software or hardware component.

Environment

Many software clients are using this service. At the time of failure, the component may be servicing a number of clients concurrently with other queued requests.

Artifact

Track Manager

Response

All query requests made by clients before and during the failure must be honored.

Update service requests can be ignored for up to two seconds without noticeable loss of accuracy.

Response measure

The secondary replica must be promoted to primary and start processing update requests within two seconds of the occurrence of a fault.

Any query responses that are underway (or made near the failure time) must be responded to within three seconds of additional time (on average).

Table 2: Quality Attribute Scenario 2: Slow Recovery

Element

Statement

Stimulus

A Track Manager hardware or software component fails when no backup service is available.

Stimulus source

An error occurs in a Track Manager software or hardware component.

Environment

A single copy of the Track Manager is providing services and it fails.

A spare processor is available that does not contain a copy of this component.

A copy of the component is available on persistent storage and can be transferred to the spare processor via the LAN.

Artifact

Track Manager

Response

The clients are informed that the service has become
unavailable.

A new copy of the service is started and becomes operational. The state of the component on restart may differ from that of the failed component but by no more than one minute.

The clients are informed that the service is available to receive update signals.

For some tracks, the new updates can be automatically correlated to the old tracks. For others, an administrator assists in this correlation. New tracks are started when necessary.

The clients are then informed that the service is available for new queries.

Response measure

The new copy is available within three minutes.

Table 3: Quality Attribute Scenario 3: Restart

Element

Statement

Stimulus

A new replica is started as the standby.

Stimulus source

The system resource manager starts the standby.

Environment

A single replica is servicing requests for service under
normal conditions. No other replica is present.

Artifact

New replica of the Track Manager

Response

The initialization of the new replica has a transient impact on service requests that lasts for less than two seconds.

Response measure

The initialization of the new replica has a transient impact on service requests that lasts for less than two seconds.

 

 

Abstract   1 Introduction   2 System Definition   3 Applying ADD   4 ADD Second Iteration
5 Summary of the Architecture   6 Comments on the Method   7 Conclusion
Glossary   Bibliography   PDF Download

3 Applying ADD

It takes at least two iterations through the ADD process to develop an architecture that satisfies the architectural requirements of a proposed system. As shown in Figure 3, Step 1 (described below) is conducted only once to ensure that the information you have about the requirements is sufficient. We do not discuss the design steps in the first iteration, since our primary interest is in fault tolerance. The architecture team created the architectural views shown in Figure 3 and outlined in Section 3.2. A "fault-tolerance services" element is included in this view. This element is assigned to a fault-tolerance expert for design in parallel with some other designs (for example, start-up services), which we do not describe here. The fault-tolerance expert proceeds with the second iteration of the ADD method and decomposes different aspects of the fault-tolerance services.

3.1 Step 1 of the ADD Method: Confirm There Is Sufficient Requirements Information

Section 2 lists the requirements for the example, which consist of functional requirements, design constraints, and quality attribute requirements.

3.2 Results of the First ADD Iteration

The architecture team conducts the first iteration. This iteration uses a set of architectural drivers consisting of the highest priority requirements, scenarios, and their associated design constraints. These architectural drivers, the details of the team's reasoning, and the requirements gleaned during Step 1 are not included here. The resulting architecture is shown in Figure 3 and further described in the rest of this section. In addition, the software elements are summarized in Table 6.

  1. Our design uses a client-server model where the Track Manager provides services to the update and query clients. Only the primary connectors are shown in the diagram; to simplify it, some secondary interfaces are not shown (for example, all clients, Track Manager elements, and most services would have connectors to the naming service).
  2. The Track Manager has been broken into two elements: A and B. This decomposition allows two deployment strategies to be considered:
    • Strategy 1: Both elements (A and B) operate in a single processor, P1. A and B together consume 50% of the processor duty cycle to handle 100 updates and 30 queries. This strategy satisfies the system performance requirements.
    • Strategy 2: Element A is in processor P1, and element B is in processor P2. Together, they can handle 150 update clients and 50 query clients. This strategy exceeds the system performance requirements.

Figure 3: Software Element Primary Connectivity

The results of analyzing these design strategies are shown in . Communication system bandwidth increases by 2% when the components are placed in different processors.

Table 4: Deployment Characteristics

 

P1

P2

#Updates

#Queries

P1 Load

P2 Load

Strategy 1

A, B

 

100

30

50%

N/A

Strategy 2

A

B

150

50

50%

30%

  1. The communication mechanisms between the update and query clients and the Track Manager differ:
  1. Elements A and B both contain state data that must be saved as a checkpoint in persistent storage. The elapsed times taken to copy the state to and recover the state from persistent storage are identical (see Table 5).

Table 5: Persistent Storage Elapsed Time

Table 6: Elements After Iteration 1

#

Element

Fault-Tolerance Class (Yes/No)

Allocation of
Architectural Drivers

1

Track Manager

Yes

N/A

2

Query clients

No

N/A

3

Update clients

No

N/A

4

Persistent storage

Yes

N/A

5

Track Manager A

Yes

Requirement 1, 3

6

Track Manager B

Yes

Requirement 1, 3

7

Synchronous communication

Yes

N/A

8

Asynchronous communication

Yes

N/A

9

Naming service

Yes

N/A

10

Registration service

Yes

N/A

PH1

Fault-tolerance service
elements

Unknown

Requirement 5
Scenario 1, 2, 3
ADD iteration 1: #1, #3

 

The fault-tolerance expert is told to concentrate on the fault-tolerance service elements as they apply to the Track Manager elements. After this task has been completed and approved by the architecture team, the fault-tolerance considerations for the other elements, such as synchronous communications, can proceed. These elements may or may not use the same services as the Track Manager. The design of making the other elements fault tolerant is not considered here.

3.3 Organizational Considerations

The architecture team decides to consider how to make the Track Manager fault tolerant before creating a general approach to fault tolerance. The team asks an architect with experience in fault tolerance to take this placeholder and develop a fault-tolerance architecture using these five guidelines:

  • Use the requirements, the existing design of the critical element, and the scenarios as the architectural drivers for adding fault tolerance to the Track Manager.
  • If, according to the fault-tolerance expert, the architectural drivers are forcing an overly complex solution, return to the architecture team with proposals to relax one or more of those drivers. The team will make the tradeoff decisions needed to achieve a simpler solution.
  • Capture the rationale for the fault-tolerance architecture and the alternatives that were considered. Details about each alternative are not necessary--just the rationale used when choosing between the options.
  • Don't try to address the start-up concerns. Another design team is tackling that problem. The start-up and fault-tolerance solutions will be merged at a later stage.
  • Important: Remember that your design is preliminary and will be merged with other designs proceeding in parallel. Do not develop a complete design. Stop when you are confident that your approach will satisfy the architectural drivers; for example, do not build a complete set of sequence diagrams or other UML diagrams.

 

 

Abstract   1 Introduction   2 System Definition   3 Applying ADD   4 ADD Second Iteration
5 Summary of the Architecture   6 Comments on the Method   7 Conclusion
Glossary   Bibliography   PDF Download

4 ADD Second Iteration

4.1 Step 1 of ADD: Confirm There Is Sufficient Requirements Information

This step is not necessary during each iteration. It was done once at the beginning of the ADD process.

4.2 Step 2 of ADD: Choose an Element of the System to Decompose

The fault-tolerance services element is chosen as the system element to decompose. Specifically, the Track Manager is targeted, since it is the system's primary element. As you can see in Table 7, other elements in the system must also be fault tolerant; however, the design team wanted to know the architectural impact of making the Track Manager fault tolerant before considering the other elements. This decision, of course, could lead to backtracking in later ADD iterations if a different scheme is needed to add fault tolerance to the other elements.

4.3 Step 3 of ADD: Identify Candidate Architectural Drivers

Ten drivers and their priorities are listed below in Table 7. Seven drivers are identified from the initial pool of architecture requirements. Three are identified from the design constraints resulting from the first iteration of ADD.

Consider the following points as you read Table 7:

  • Drivers labeled (high, high) bear directly on the end-to-end timing requirement of two seconds in scenario 1. This condition is the most difficult to satisfy and has the highest priority drivers.
  • Drivers labeled (medium, medium) are associated with the timing when a single copy of the Track Manager is operating, and restoration should occur within two minutes.
  • The restart scenario is least important, and a separate "start-up" design effort is considering its details. Hence, #3 drivers do not impact the design and are crossed out in the table. As a result, only nine architectural drivers should be considered.

Table 7: Architectural Driver Priorities

#

Architectural
Drivers

Section
Discussed In

Importance

Difficulty

1

Scenario 1
Quick Recovery

Section 2.3

high

high

2

Scenario 2
Slow Recovery

Section 2.3

medium

medium

3

Scenario 3
Restart

Section 2.3

low

low Section 2.2

4

Requirement 1
Track Manager Functionality

Section 2.1

high

high

5

Design Constraint 1
Capacity Restrictions

Section 2.2

high

high

6

Design Constraint 2
Persistent Storage Service

Section 2.2

medium

low

7

Design Constraint 3
Two Replicas

Section 2.2

high

high

8

ADD Step 1, #2
Deployment Characteristics

Section 3.2

high

high

9

ADD Step 1, #3
Communication Mechanisms

Section 3.2

high

low

10

ADD Step 1, #4
Checkpoint Timing

Section 3.2

high

high

4.4 Step 4 of ADD: Choose a Design Concept that Satisfies the Architectural Drivers

This step is the first design step in the ADD method.

Note: This section cross-references Section 7.1 of the SEI technical report titled Attribute-Driven Design (ADD), Version 2.0 [Wojcik 2006]. This step is the heart of that document; it's where most of the design alternatives are listed, the preferred patterns are selected, an evaluation is done to validate the design, and changes are made to correct for detected deficiencies. Within Section 7.1, there are six enumerated paragraphs. To simplify your cross-referencing, each of those paragraphs is referred to in the headings of this report as ADD substep 1, 2...and so forth.

4.4.1 Step 4, Substep 1 of ADD: Identify Design Concerns

The three design concerns associated with fault-tolerance services are3

Table 8 shows these concerns and their breakdown into subordinate concerns, the sections in this report where the alternate patterns are listed and the selections are made.

Table 8: Design Concerns

Design Concerns

Subordinate
Concerns

Alternative
Patterns Section

Selecting Design Pattern Section

Fault Preparation

Restart

Section 4.4.2.1

Section 4.4.3.1

Deployment

Section 4.4.2.2

Section 4.4.3.2

Data integrity

Section 4.4.2.3

Section 4.4.3.3

Fault Detection

Health monitoring

Section 4.4.2.4

Section 4.4.3.4

Fault Recovery

Transparency to
clients

Section 4.4.2.5

Section 4.4.3.5

Start new replica

Section 4.4.2.6

Section 4.4.3.6

Update client
behavior after transient failure

Section 4.4.2.7

Section 4.4.3.7

Update client
behavior after hard
failure

Section 4.4.2.8

Section 4.4.3.8

Query client behavior after transient failure

Section 4.4.2.9

Section 4.4.3.9

Query client behavior after hard failure

Section 4.4.2.10

Section 4.4.3.10

4.4.2 Step 4, Substep 2 of ADD: List Alternative Patterns for Subordinate Concerns

4.4.2.1 Restart

Four design alternatives for restarting a failed component are shown below in Table 9. Two discriminating parameters are related to these patterns:

Table 9 also lists "reasonable" downtime estimates (based on experience) of these discriminating parameters.

Table 9: Restart Patterns

#

Pattern Name

Replica Type

Downtime
Estimates

Loss of
Services

1

Cold Restart

Passive

> 2 minutes

Yes

2

Warm Standby

Passive

> 0.3 seconds

Perhaps

3

Master/Master

Active

> 50 milliseconds

No

4

Load Sharing

Active

> 50 milliseconds

No

4.4.2.2 Deployment

The two components can be deployed with (1) both primaries on one processor and both secondaries on the second processor or (2) each primary on a different processor. The primaries are denoted by A and B; the secondaries by A' and B'. The failure condition for B mimics that of A and is not recorded in the table.

The two discriminating parameters are

Table 10: Deployment Patterns

#

Pattern Name

P #1

P #2

A Fails

# Updates

# Queries

State
Recovery Time

1

Together

A, B

A', B'

A', B'

100

30

1.4

2

Apart

A, B'

A', B

A', B

150

50

0.8

4.4.2.3 Data Integrity

The data integrity tactic ensures that when a failure occurs, the secondary has sufficient state information to proceed correctly. The patterns are shown in Table 11.

Table 11: Data Integrity Patterns

#

Pattern Name

Communication
Loading

Standby
Processor Loading

1

Slow Checkpoint

1.2 seconds every
minute

None

2

Fast Checkpoint

1.2 seconds every 2
seconds

None

3

Checkpoint + Log Changes

1.2 seconds per minute + 100 messages per second

None

4

Checkpoint + Bundled Log Changes

1.2 seconds per minute + 1 message per x seconds

None

5

Checkpoint + Synchronize Primary and Backup

1.2 seconds every
minute
+ 1 message per x seconds

Execute to keep an updated copy of the state

4.4.2.4 Health Monitoring

A single health monitoring tactic should be considered for fault detection. Table 12 lists the patterns to consider and their discriminating parameters.

Table 12: Fault Detection Patterns

#

Pattern Name

Communication Line Loading

1

Heartbeat

4 messages ( for A, A', B, B')

2

Ping/Echo

8 messages
(ping and echo for A, A', B, B')

3

Update Client Detects Failure

0 messages

4

Query Client Detects Failure

0 messages

4.4.2.5 Transparency to Clients

We list three alternatives to make faults transparent to the clients in Table 13. Pattern 1 has no transparency, but patterns 2 and 3 provide transparency.

Table 13: Transparency Patterns

#

Pattern Name

Protocol Required

Timeout Location

1

Client Handles Failure

Unicast

Client

2

Handles Failure Proxy

Unicast

Proxy

3

Infrastructure Handles Failure

Multicast

Within the infrastructure

4.4.2.6 Start New Replica

This step is postponed, since it is closely related to the start-up mechanism that is being explored by another team.

4.4.2.7 Update Client Behavior After Transient Failure

The operation of the proxy service when a transient failure occurs has already been defined: The health monitor informs the proxy service of the failure. Then, this service sends a new secondary access code to each asynchronous communication mechanism. This access code will be used for the next update request. Essentially, this mechanism promotes the secondary to the primary.

4.4.2.8 Update Client Behavior After a Hard Failure

When a primary fails and no secondary is available, one of the design patterns in Table 14 could be used.

Table 14: Patterns for Update Client Behavior After a Hard Failure

#

Pattern Name

Impact

1

Continue to Send Updates

Unusable data is sent.

2

Stop Sending Updates

The communication line loading during downtime is saved.

3

Save Updates in a File

Larger messages are loaded on start-up.

4.4.2.9 Query Client Behavior After Transient Failure

The operation of the proxy service when such a failure occurs has already been defined: The health monitor informs the proxy service of the failure. Then, this service sends a new secondary access code to each synchronous communication mechanism. If no outstanding service request is underway, the mechanism will use this access on the next request. If a service request is underway, a new request will be issued to the new access code. It is possible for the synchronous communication to receive multiple replies (a delayed one from the primary and one from the promoted secondary). It must be able to discard the second reply.

4.4.2.10 Query Client Behavior After a Hard Failure

When a primary fails and no secondary is available, the query clients will be informed and can adjust their behavior appropriately. In that case, one of the design patterns in Table 15 could be used.

Table 15: Patterns for Query Client Behavior After a Hard Failure

#

Pattern Name

Impact

1

Continue to Send Queries

Unusable data is sent.

2

Stop Sending Queries

The communication line loading during downtime is saved.

3

Save Queries in a File

Larger messages are loaded on start-up.

4.4.3 Step 4, Substep 3 of ADD: Select Patterns from the List

This activity involves selecting a pattern from the list for each set of alternative patterns. When making your selection, you reason about which alternative is most suitable. In our example, the selections were made independently. In some cases, reasonable values were chosen as design parameters, such as heartbeat and checkpoint frequencies. In the rest of this section, we consider restart, deployment, data integrity, fault detection, transparency to client, start new replica, and client behavior after transient and hard failures. For each item, we record our reasoning, decision, and the implications of that decision.

The ADD method calls for the development of a matrix showing the interrelationships between patterns and their pros and cons on each architectural driver. It assumes that there will be a reasonable number of possible patterns and that a table is a good way to show the alternatives. Unfortunately, the inclusion of all fault tolerances as a single Step-4 design decision creates a total of 23 patterns--too many to represent in a single table. Hence, each alternative (restart, deployment, etc.) is considered separately. The pros and cons in the table are considered in separate sections below. Each section has three parts: (1) a reasoning paragraph describing the pros and cons for each pattern, (2) a decision statement emphasizing the pattern chosen, and (3) an implication statement showing the impact of this decision, including any obvious restrictions on choices not yet made.

4.4.3.1 Restart

Reasoning

Both scenario 1 and requirement 1 indicate that the restart time must be less than two seconds; thus, Cold Restart pattern is inappropriate (see Table 9). The Warm Standby pattern seems to easily satisfy the timing requirement described in scenario 1. Hence it is chosen, since it is simpler to implement than the Master/Master or Load Sharing patterns.

Decision

Use the Warm Standby pattern.

Implications

  1. A primary Track Manager for each component (A and B) receives all requests and responds to them.
  2. A secondary (standby) Track Manager for each component (A' and B') is loaded on another processor and takes up memory.

4.4.3.2 Deployment

Reasoning

The architect is familiar with having a single failover scheme for recovery from a software or hardware failure. Hence, he chooses the first Together pattern (see Table 10), even though it has a slower recovery time since the states for both A and B must be read from persistent storage, rather than just A. This pattern meets the processing requirements, although it can perform less processing. Note that the granularity of recovery differs from the granularity of failure, in that A and B must both recover when either one fails.

Decision

Use the Together pattern with both primary components that share a processor. Clearly, this option is suboptimal, since it offers reduced capability and increased recovery time. However, it was chosen for reasons of familiarity.

Implications

  1. The primary components (A and B) share a processor, as do the secondary components (A' and B').
  2. The system will never be operational with the primary components in different processors.

4.4.3.3 Data Integrity

Refer to Table 11.

Reasoning

  1. Clearly a checkpoint of state every minute is needed to satisfy scenario 2. However, a state that is one minute old cannot satisfy scenario 1, since one minute's worth of upgrades will be ignored if only the checkpoint is used on restart. Pattern 1 is rejected.
  2. Pattern 2 would satisfy the upgrade requirements of scenarios 1 and 2; however, it places an unacceptable load on the communication system. Pattern 2 is rejected.
  3. Pattern 3 would satisfy scenarios 1 and 2, but--like pattern 2--it places a significant burden on the communication system. Pattern 3 is rejected.
  4. Pattern 4 satisfies scenarios 1 and 2 if x is less than two seconds. It also puts a more reasonable load on the communication system. Having a bundled upgrade periodicity of two seconds appears to be satisfactory, though a more detailed check can be made later (see Section 5). Pattern 4 is ultimately selected.
  5. Pattern 5 also satisfies the scenarios but is more complex, since the secondary must execute every x seconds to update its copy of the state. Recovery would be faster, though, since it would not need to read in a checkpoint of the state. Pattern 5 is rejected due to its complexity.

Decision

Use the Checkpoint + Bundled Log Changes pattern. The log files will be used as the basis for promoting the new primary.

Implications

  1. The primary replica saves the state to a persistent CheckpointFile every minute.
  2. The primary keeps a local bundled file of all state changes for two seconds. The primary sends it as a LogFile every two seconds.
  3. The promoted primary reads in the CheckpointFile after it is promoted. Then it reads the LogFile and updates each state change as it is read.
  4. Next, the promoted secondary writes the newly updated state to persistent storage.
  5. The promoted secondary can now start processing updates and queries without waiting until the persistent state update has been completed.

4.4.3.4 Fault Detection

Reasoning

An approach where the clients do not detect failure is preferable, since it implies that the application developers must understand the fault-tolerance timing requirements. In comparing the two approaches (see Table 12), the ping/echo fault detection is more complex than the heartbeat detection and requires twice the bandwidth.

Decision

Use the Heartbeat pattern. We set the heartbeat at 0.25 seconds, which yields four communication messages per second.

Implications

  1. The heartbeat must be fast enough to allow the secondary to become initialized and start processing within two seconds after a failure occurs. Initializing the two checkpoint files takes 1.2 seconds. The heartbeat adds an additional 0.25 seconds, leaving 0.55 seconds spare, which seems reasonable.
  2. A health monitoring element checks for the heartbeat every 0.25 seconds. When a heartbeat is not detected, the health monitor informs all the necessary elements.
  3. If a primary Track Manager component detects an internal failure, the mechanism for communicating the failure is to not issue the heartbeat.

4.4.3.5 Transparency to Client

Reasoning

It is undesirable to have the clients handle failure, since this approach requires the programmer writing the client to understand the failover mechanism. The failover could be misinterpreted easily and render it less than robust.

The infrastructure has no built-in multicast capability, and adding this feature would be expensive. You can mimic a multicast with multiple unicasts, but this approach doubles the usage of the communication system, and is therefore undesirable. (To review the pattern options, see Table 13.)

Decision

Use the Proxy Handles Failure pattern.

Implications

  1. The proxy service registers the service methods (for example, A.a, A.b, B.c, B.d) with the name server.
  2. The proxy service starts the first components, registering them under different names (AA.a, AA.b, BB.c, and BB.d) and does likewise for the secondary components (AA'.a, AA'.b, BB'.c, and BB'.d).
  3. The client requests a service (A.a). This request causes the naming service to be invoked and to return the access code for A.a, designated as access(A.a). Next, the client invokes access(A.a).
  4. The proxy service (A.a) determines that AA is the primary replica and returns access(AA.a) to the client as a "forward request to."
  5. The client invokes access(AA.a) and continues to do so until AA fails.
  6. When the health monitor detects a heartbeat failure in AA, it informs the proxy service.
  7. The proxy informs the synchronous and asynchronous elements of the failure. These elements send their query and update requests to the newly promoted primary.

4.4.3.6 Start New Replica

This step is postponed, since, in this example, it is part of the start-up mechanism being explored by another team.

4.4.3.7 Update Client Behavior After Transient Failure

A transient failure occurs when the primary fails and a backup is scheduled to take over. In our case, the health monitor detects the failure and informs the proxy service. The proxy sends a forward-request access code to the Synchronous Communication Service (SCS). If no requests are underway, the SCS simply uses the new access code for all future requests. If a request is underway, the SCS executes a forward request with the new access code to the new Track Manager. It is possible to get two replies: one reply from the failed Track Manager component, which was inexplicably delayed beyond the failure notification, and one from the new Track Manager. If two replies are received, the second one is discarded.

4.4.3.8 Update Client Behavior After Hard Failure

Reasoning

Scenario 2 lays the foundation for this choice (see Table 14). We are willing to accept degraded behavior and restart; therefore, pattern 3 is unnecessary and complicated. There is no point in continuing to send updates without having a Track Manager available to receive them.

Decision

We chose the Stop Sending Updates pattern, which, when there is no Track Manager, stops sending updates until a new Track Manager becomes available.

Implications

The clients must be able to do two things: (1) accept an input informing them that the Track Manager has failed and (2) stop sending updates.

4.4.3.9 Query Client Behavior After Transient Failure

We chose the same pattern as the update client (see Section 4.4.3.7) for simplicity's sake.

4.4.3.10 Query Client Behavior After Hard Failure

We chose the same pattern as the update client (see Section 4.4.3.8) for simplicity's sake.

4.4.4 Step 4, Substep 4 of ADD: Determine Relationship Between Patterns and
Drivers

A summary of the selected patterns is shown below in Table 16. In the table heading

Table 16: Pattern/Driver Mapping

#

Pattern Types

Pattern Selected

Architectural Driver

0

# Replicas

Two Replicas

Two Replicas (DC#3)

1

Restart

Warm Standby

Two Replicas (DC#3)

Quick Recovery (SC#1)

2

Deployment

Distributed

Capacity Restriction (DC#1)

3

Data Integrity

Checkpoint +
Bundled Log Changes

Persistent Storage Service (DC#2)

Capacity Restrictions (DC#1)

Quick Recovery (SC#1)

Slow Recovery (SC#2)

4

Fault Detection

Heartbeat

Capacity Restriction (DC#1)

Quick Recovery (SC#1)

Other- see note below

5

Transparency to Clients

Proxy Handles Failure

Capacity Restriction (DC#1)

Other--see note below

6

New Replica

N/A

N/A

7

Update Client Behavior-
Transient

Proxy Handles Failure

N/A

8

Update Client Behavior- Hard

Stop Sending
Updates

Capacity Restriction (DC#1)

9

Query Client Behavior-
Transient

Proxy Handles Failure

N/A

10

Query Client Behavior- Hard

Stop Sending
Queries

Capacity Restriction (DC#1)

Note: There are numerous examples of decisions being made based on the architect's experience and preference rather than on a specific architectural driver. For example, in the fault detection selection in Section 4.4.3.4 the architect considered it inappropriate for clients to detect failure.

4.4.5 Step 4, Substep 5 of ADD: Capture Preliminary Architectural Views

In this section, we present preliminary architectural views including

4.4.5.1 List of the Elements

Table 17 lists the system elements and the ADD iteration in which they're developed.

Table 17: System Elements and the ADD Iteration in Which They're Developed

#

This Element

Is Developed in This ADD Iteration

1

Track Manager

Requirement

2

Query clients

Requirement

3

Update clients

Requirement

4

Persistent storage

Requirement

5

Track Manager A

1

6

Track Manager B

1

7

Synchronous communications

1

8

Asynchronous communications

1

9

Naming service

1

10

Registration service

1

11

Health monitor

2

12

Proxy server

2

13

CheckpointFileA

2

14

CheckpointFileB

2

15

LogFileA

2

16

LogFileB

2

4.4.5.2 A Software Element View of the Architecture

Figure 4 shows a functional view of the software elements in the architecture and their relationships.

Figure 4: Software Element View of the Architecture

4.4.5.3 Sequence Diagram

The sequence diagram for a query client's access to data from client A is shown in . The figure depicts two sequences:

  1. For the first request, the synchronous communication service sends the service request to the proxy. The proxy returns a "forward request to A" message. The synchronous communication service caches the forward request to A and uses it for all future requests.
  2. If A fails to issue a heartbeat to the health monitor, the latter informs the proxy that A has failed. The proxy sends a "forward request to A' " message to the synchronous communication service. The service then forwards the request to A', caches the request, and continues to send messages to A'.

Figure 5: A Sequence Diagram of Failover from A to A'

4.4.6 Step 4.6 of ADD: Evaluate and Resolve Inconsistencies

In an architecture evaluation, the architect builds models to describe the system's behavior. The architect then analyzes these models to ensure that they satisfy the architectural drivers. In our example, we develop a timeline showing the operation around the time of failure.

Figure 6 models the operation of the system over a time period that includes a failure.

Figure 6: Timing Model

The following nine events, which occur in this order, are depicted in .

  1. A save is made of state updates to the persistent LogFile.
  2. A heartbeat is detected a number of times after the state save.
  3. A crash failure occurs in the Track Manager.
  4. The health monitor detects the failure when a timeout occurs before the heartbeat.
  5. The secondary Track Manager is promoted to primary.
  6. The secondary service starts to respond to client requests, working off the backlog of requests and giving slower response times.
  7. The service returns to normal when the transient period of slow responses ends.
  8. A new replica completes initialization and is ready to synchronize with the current primary and become the secondary.
  9. The new replica has completed any needed state updates, and the process of restoring the service is completed.

Six important timing aspects of the system are shown in Figure 6:

The worst-case total time (T1) until the Track Manager recovery occurs when the failure is just after a heartbeat and just before the next write of the updates to the LogFile. In this case, the time would be

T1 = Tps + Th + TrA + TrB + TrL + Tus

T1 = 2 + 0.25 + 0.8 + 0.6 + 0.2 + 0.1 = 3.95

The result is an unacceptable time of 3.95 seconds.

4.4.6.1 Resolve Timing Inconsistencies

We can improve our models in several ways, and we must make tradeoffs among these proposed improvements. Our main objective is to reduce the restart time from 3.95 seconds to less than 2 seconds, while ensuring that the communication load remains reasonable. We can modify the important timing aspects in these ways:

Reasoning

  1. The fault-tolerance designer is reluctant to choose alternative 2, 5, or 6, since they entail requesting changes to the previous design. The expert will only do so if there is no other reasonable way to reduce the time within his or her direct control. The expert would propose such tradeoffs only after reviewing other alternatives.
  2. The LogFile save to persistent storage can occur every second and be synchronized to occur just before every fourth heartbeat. This scheme reduces the terms (Tps + Th) from 2.25 seconds to 1 second. This measure is a gain of 1.25 seconds, which would reduce the response to 2.7 seconds, which is still not good enough. This measure could be further improved by reducing the periodicity to 0.5 seconds, but that option is rejected. It would cause too much additional load on the communication mechanism.
  3. Access to persistent storage for all three files takes (TrA + TrB + TrL) or 1.6 seconds, if done sequentially. However, if accesses are concurrent using asynchronous communications, in theory they will take only 0.8 seconds, which is the time required to get the persistent state for component A. However, a detailed analysis of the persistent storage shows that the three concurrent requests will share some resources and take 1.0 seconds. This reduction is still 0.6 seconds, which leaves us with a 2.1-second response--still not good enough.
  4. The deployment decision is changed to the second option of having each component A and B in a separate processor. Hence, the worst-case access to persistent storage occurs for component A and is 0.8 seconds. If this is still done concurrently with the Logfile access, the total time for both will be .85 seconds. The savings in the previous step are now invalidated, and the 1.6 seconds now takes 0.85 seconds, which yields a 1.95-second response. This response does not provide quite enough of a margin.
  5. 5. The only way within the architect's control to further resolve this problem is to select alternative D and change the data integrity style. Taking this approach assumes that the primary and secondary states for A and B will not diverge in any way, which is outside of the architect's control. The architect then approaches those responsible for the previous design and explains the problem and the options. The designer team agrees to reduce the state upgrade time for component A to 0.6 seconds and the concurrent access with the Logfile to 0.65 seconds This change represents a further savings of 0.2 from the previous result and creates a response time of 1.75 seconds, which is within a reasonable margin.

4.4.6.2 Summary of Timing Decisions

Table 18 summarizes the timing decisions.

Table 18: Summary of Timing Decisions

#

Description

Initial Time Interval

Final Time Interval

Tps

Save FileCheckPoint

2.0

1.0 (see Note 1)

Th

Heartbeat

0.25

0.25

TrB

Recover Checkpoint for B

0.6

0.0 (see Note 2)

TrA

Recover Checkpoint for A

0.8

0.65 (see Note 3)

TrL

Recover LogFile

0.2

0.2

Tus

Update state from LogFile

0.1

0.1

T

Recovery time

3.95

1.75

Checkpoint State

60

60

Notes:

  1. The heartbeat and checkpoint save are synchronized together (reasoning point 2 in Section 4.4.6.1).
  2. Since A and B are in separate processors, we only have to recover the state of one of them for a single failure (reasoning point 4 in Section 4.4.6.1).
  3. The state recovery and checkpoint recovery are performed concurrently (reasoning point 4 and 5 in Section 4.4.6.1).

4.5 Step 5 of ADD: Instantiate Architectural Elements and Allocate Responsibilities

4.5.1 Primary A and B

The primary and backup elements of both A and B have the same behavior. The behavior of A alone is described here.

4.5.2 Persistent Storage

There are four persistent storage files: CheckpointA, CheckpointB, LogFileA, and LogFileB. All new values of these files overwrite the old values.

4.5.3 Health Monitor

The health monitor uses a timer to check whether it has received the heartbeat from A, B, A', and B'. If it fails to receive a heartbeat before the timer expires, it notifies the proxy.

4.5.4 Asynchronous Communication

The asynchronous communication mechanism receives a request from the update clients to a method (for example, A.a), and directs the request to the appropriate element.

  1. The mechanism sends the name server the method A.a and receives the access code to the proxy element for A.a.
  2. The mechanism sends the update message to the proxy element A.a.
  3. When the mechanism receives the forward request for A.a to send the message to AA.a, it sends the request to AA.a and caches the handle for AA.a.
  4. Any subsequent requests are made directly to the AA.a handle.
  5. When a failure occurs, the mechanism receives the forward request to AA'.a and uses that handle for subsequent requests.
  6. If AA.a fails and there is no standby, the mechanism informs the update client to stop sending updates.

4.5.5 Synchronous Communication

The synchronous communication element receives requests from the query clients and has almost the same behavior as the asynchronous communication element. The only difference is that it blocks the query client until it receives the answer to the query, which it then sends to the query client.

4.5.6 Proxy

The proxy element does most of the work in causing a smooth transition to the backup when the primary fails. It does the following:

  1. The proxy service registers all the methods associated with both A and B with the naming service.
  2. The proxy service starts AA, AA', BB, and BB' and registers all their methods with the naming service. It creates a cache by mapping the names used by the clients (e.g., A.a) and the names created by the elements (e.g., AA.a and AA'.a). It determines which element is primary and which is secondary.
  3. The proxy service is called by either the synchronous or asynchronous communication element when a client requests a service; for example, A.a. It replies with a "forward request" to AA.a if AA is the primary.
  4. When the health monitor signals the proxy that the primary (e.g., AA) has failed, it sends a forward request to both the synchronous and asynchronous communication elements to access all the standby methods (e.g., AA'.a), thus promoting AA' to be primary.

4.5.7 Update Clients

The failure of a primary component (e.g., A) and the switchover to A' are transparent to the update clients. Any updates sent during the window between failure and restoration are lost. But the timing window has been analyzed to be small enough for the Track Manager to continue working even with these lost messages. When the primary component fails and there is no backup, the update client will be notified and will stop sending updates until the service is restarted.

4.5.8 Query Client

The failure of the primary component when there is a backup is once again transparent to the update clients. They will have to wait slightly longer for an answer to their query, but that time has been evaluated as acceptable. When the primary component fails and there is no backup, the query client will be notified and will stop requesting queries.

The interfaces have been defined throughout Step 4 in Section 4 but are captured here for consistency and convenience. Note that some of the interfaces that were defined in the first iteration (see Section 3.2) are not repeated in Table 19.

Table 19: Summary of Interfaces

From
Element

To Element

Interface

Timing
Conditions

Descriptive Sections

Primary A

CheckpointA

Update state

60 seconds

Section 4.5.1

Primary A

LogFileA

Log changes

1 second

Section 4.5.1

Primary A

Health Monitor

Heartbeat

0.25
seconds

Section 4.5.1

Primary B

CheckpointB

Update state

60 seconds

Section 4.5.1

Primary B

LogFileB

Log changes

1 second

Section 4.5.1

Primary B

Health Monitor

Heartbeat

-

Section 4.5.1

CheckpointA

Primary A

Update state

During
recovery

Section 4.5.1

LogFileA

Primary A

Log changes

During
recovery

Section 4.5.1

CheckpointB

Primary B

Update state

During
recovery

Section 4.5.1

LogFileB

Primary B

Log changes

During
recovery

Section 4.5.1

Health
Monitor

Proxy

Primary failure

Within 1 second of detection

Section 4.5.3

Query Client

Synchronous Communication

Request for service

5 per second

Section 4.5.8

Proxy

Naming

Registration of A, B, A', B' services

During
start-up

Section 4.5.6

Proxy

Synchronous Communication

Primary failed (A or B)

During
recovery

Section 4.5.6

Proxy

Asynchronous Communication

Primary failed (A or B)

During
recovery

Section 4.5.6

4.6 Step 7 of ADD: Verify and Refine Requirements and Make Them Constraints

In Step 7, we verify that the decomposition of the fault-tolerance services supporting the Track Manager element meets the functional requirements, quality attribute requirements, and design constraints, and we show how those requirements and constraints also constrain the instantiated elements.

The architectural drivers are shown once more in Table 20 and were used in one or more pattern selections (except for scenario 3). The restart scenario was not explicitly used, since the restart design was being done in parallel and a later merging was anticipated.

Table 20: Architectural Drivers

#

Architectural
Drivers

Defined in Section

Applies to Pattern Choices

1

Scenario 1
Quick Recovery

Section 2.3

Restart, Deployment, Data Integrity, Fault Detection,

2

Scenario 2
Slow Recovery

Section 2.3

Data Integrity

3

Scenario 3
Restart

Section 2.3

Not used

4

Requirement 1
Track Manager
Functionality

Section 2.1

Restart

5

Requirement 2
Checkpoint to
Persistent Storage

Section 2.1

Deployment

6

Design Constraint 1
Spare Capacity

Section 2.2

Fault Detection

7

Design Constraint 2
Two Replicas

Section 2.2

Restart, Deployment

8

ADD Step 1 #1
Deployment
Characteristics

Section 3.2

Restart, Deployment, Data Integrity

9

ADD Step 1 #2
Communication Mechanisms

Section 3.2

Update Client
Behavior

Query Client Behavior

10

ADD Step 1 #3
Checkpoint Timing

Section 3.2

Data Integrity

Notes:

  1. The breakdown of the timing requirements allocation derived from scenario 1 is shown in Table 18.
  2. The additional capabilities required by the elements defined prior to this step (Track Manager, query clients, update clients, persistent storage, synchronous communications, and asynchronous communications) are all defined in Section 4.5. The naming service and registration service required no extensions.
  3. The responsibilities of the two new elements (proxy service) and (health monitor) are fully described in Section 4.5.

4.7 Step 8 of ADD: Repeat Steps 2 through 7 for the Next Element of the System You Wish to Decompose

Now that we have completed Steps 1 through 7, we have a decomposition of the fault-tolerance service (specifically, the Track Manager system element). We generated a collection of responsibilities, each having an interface description, functional requirements, quality attribute requirements, and design constraints. You can return to the decomposition process in Step 2 where you select the next element to decompose. In our case, we do not have child elements to further decompose.

 

 

Abstract   1 Introduction   2 System Definition   3 Applying ADD   4 ADD Second Iteration
5 Summary of the Architecture   6 Comments on the Method   7 Conclusion
Glossary   Bibliography   PDF Download

5 Summary of the Architecture

The architecture developed up to this point is described in this section. Also included are reminders about the parallel designs underway that this architecture must be resolved with and the issues that must still be tackled.

5.1 Architecture Summary

A summary of the design is given below.

5.2 Design Issues Being Resolved Elsewhere

Some designs are being resolved elsewhere:

5.3 Remaining Design Issues

This report provides an overview of how to make the Track Manager fault tolerant and to satisfy the architectural drivers, especially the end-to-end timing requirement. Some views of the software have been captured, but views are missing, such as class diagrams, layered views, container views, use cases, and many sequence diagrams detailing the information embedded in the various "Reasoning" and "Implications" sections of Section 4.4.3. In particular, a state transition diagram that shows the details of how the software elements respond together to provide the desired behavior is missing. This model is the only way to ensure that there are no "timing windows" in the behavior.

In addition, four design issues need to be addressed:

 

 

Abstract   1 Introduction   2 System Definition   3 Applying ADD   4 ADD Second Iteration
5 Summary of the Architecture   6 Comments on the Method   7 Conclusion
Glossary   Bibliography   PDF Download

6 Comments on the Method

While working with the example application described in this report, we made the following observations:

  1. The person doing the design was familiar with the fault-tolerance concerns and alternative patterns used in the example. This designer was also familiar with the ways of reasoning about selecting between alternatives and the timing model needed to evaluate the effectiveness of the choices.</