CERT-SEI

Certifying Worst Case Execution Time Estimates

The ability to validate the LambdaWBA reasoning framework is predicated on the ability to know, a priori, the worst-case execution time of the software components used in the software assemblies comprised of those components. For the initial industrial application of LambdaWBA, it was sufficient (although not optimal) to use an ad-hoc, measurement-based technique since a five minutes stress period using the highest supported data rates was used in laboratory testing of the equipment. In order to validate the LambdaWBA reasoning framework, the ad-hoc technique used in the trial was replaced for a technique that satisfied two key characteristics: 

  • principled, the technique had to be grounded by theoretical soundness
  • practical, the technique had to be repeatable in an industrial setting.

Estimating Number of Samples Required

In our first approach to estimating the worst case execution time, we ran test components as stand-alone periodic tasks and measured the CPU execution time of each test component in isolation. The running time of the longest observed execution was then used as the Worst-Case Execution Time (WCET)1. The key question towards this approach is, "How many execution time samples do we need before we can be confident that we have seen the worst case execution time?" We assume that the execution time samples are normally distributed, but with a hidden upper bound. We attempt to deduce the bound by looking for deviations from an unbounded Normal distribution model and the observed data.

The simplest approach is to collect samples until the highest observed sample xmax is lower than we would expect in an unconstrained Normal distribution.

The simplest approach is to collect samples until we predict that an unbounded Normal distribution would have generated a large number of values at or above the actual highest observed sample xmax. We do this by first collecting an initial set of 30 sample execution times. We then compute the sample mean m and standard deviation s and use them to compute an estimate of the number of samples n we would need to have seen the highest value observed so far xmax (or a higher value) at least M times. This can be computed as:

Equation 1

where F() is the cumulative distribution function (CDF) of a standard normal distribution. In our estimation procedure we assume M=10. The assumption is, that if we have enough samples that we should have seen 10 instances of the highest value, but have only seen one instance, then that value must be at or near the upper bound. Once the initial set of 30 sample execution times have been recorded, we continuously update the mean, standard deviation and sample count requirement n for each new measurement. When the number of samples exceeds the requirement n, we stop2.

An incremental improvement to this estimate for n is based on computing the probability that we could have seen a maximum value xmax by chance. Having computed the mean and standard deviation as stated above, we can compute the probability p that out of n samples, all of them will be at or below xmax if they had been drawn from a Normal distribution as:

Equation 2

Solving for n we get the required number of samples for a target p value as:

Equation 3

choosing a small value for p (e.g., 0.001), the assumption is that if the chance of seeing a maximum value of xmax by chance from an unconstrained Normal is 0.001, then it is more likely than not that xmax represents a true bound.

One problem with the approaches given above is the assumption that execution times follow a normal distribution. An analysis of execution times measured from a test component (a component that is designed to burn a fixed amount of CPU execution time) resulted in a non-symmetric distribution weighted toward higher execution times. An additional problem is that since execution time measurements are independent, any invocation of a task is equally likely to be the maximum whether it is from initial testing, or in the live system. If n invocations are done in testing, and N invocations are made over the course of execution in the live system, the probability that the maximum value was observed in testing is:

Equation 4

This implies that in order to be 99.9% sure of seeing the maximum execution time in testing, the testing period (in number of invocations) must be 1,000 times as long as the period over which the system is actually deployed. This is clearly impractical and so other techniques must be employed to get safer estimates of the WCET.

WCET Estimations Based on Extreme Value Theory

We have recently begun investigating Extreme Value Theory as a means of determining a certifiable WCET estimate. Extreme Value Theory is a branch of statistics for analyzing the tail behaviors of a distribution. It is used frequently in the insurance industry for predicting the likelihood of floods based on water level measurements, and in other applications where the extreme values of a random variable have consequences.

Extreme value theory models random variables that are formed as the maximum (or minimum) of a set of measurements of some other random variable. It has been shown [1] that as the number of measurements is increased, the distribution of the maximum values approaches one of three possible distributions in a manner analogous to the central limit theorem. For distributions like execution time measurements that have an exponential tail (i.e., the probability of observing a value decreases exponentially with higher values), the resulting extreme value distribution is the Gumbel [2].

In using Extreme Value Theory to estimate the WCET of a component, we are currently using the following approach:
  • Measure 15,000 execution time samples.
  • Segment the measurements into blocks of 100, and compute the maximum value of each block.
  • Compute the mean, m, and standard deviation, s, of the maximum values.
  • Compute the parameters of the Gumbel Distribution as:
    equations5
    Use the percent-point formula for the Gumbel distribution
  • m-βlog(log(1/p))

    to compute the execution time xwcet for which there is a probability p that the maximum value in any set of 100 invocations the maximum value will be xwcet or less.

Using a block size of 100 ensures that we have enough samples for the maximum value to converge to a Gumbel distribution, and using a total of 15,000 samples ensures that we have 150 points to be used in estimating the Gumbel distribution.

Figure 1 is a sample graph showing the WCET needed to achieve a target probability that there will be no WCET violations in the deployed system. The estimated WCET increases linearly in the number of 9s of reliability that we need. In the example plotted here, the actual maximum observed execution time was 1.387, the mean of the maximums of the blocks was 1.179 and the standard deviation was 0.0710. Unlike the approaches above that directly use the highest execution time observed in testing, the Gumbel-based approach probabilistically estimates a WCET based on a limited number of observations.

Worst Case Execution Time Prediction

 Figure 1: Worst Case Execution Time Prediction

While Extreme Value Theory appears to be the best approach to providing a certifiable WCET, there is still work to be done. Among the questions that need to be addressed is the interpretation of the success probability, and the effect of the number of execution time measurements made in testing. There are also alternate approaches [3-5] in the literature to estimating WCET using the Gumbel distribution and Extreme Value Theory that need to be evaluated against our current approach.

Other issues that need to be addressed is that extreme value theory is likely to overestimate the required WCET. The Gumbel distribution assumes the measurements are from an unbounded distribution. Because of this it assumes a small but non-zero chance that a task that normally takes 1ms might take 100ms, when in fact such a measurement might be possible in a real system only in the event of a hardware problem. It may be necessary to do some extremely long runs of actual measurements to evaluate the extent to which this is a problem.

1  Tasks created to execute these components were run at the highest priority to reduce perturbation by host OS processing as much as possible. Future work will look at reducing this interference.

2  We also stop when an upper limit of 15,000 samples is reached as a means to place a practical limit on the total number of samples used for this approach.

References

[1] Gumbel, E.J. (1958). Statistics of Extremes. Columbia University Press.

[2] NIST/SEMATECH e-Handbook of Statistical Methods.

[3] Leahy, Kelly. Efficient Estimation of Tighter Bounds for Worst Case Execution Time of Programs (WUCSE-2005-27).  Washington University in St. Louis, 2005.

[4] S. M. Petters, How much Worst Case is Needed in WCET Estimation?, In 2nd International Workshop on Worst Case Execution Time Analysis, Vienna, Austria, June 2002.

[5] S. Edgar and A. Burns. Statistical Analysis of WCET for Scheduling. In Proceedings of the 22nd IEEE Real-Time Systems Symposium (RTSS’01), London, England, December 2001. IEEE Computer Society.