Predicting Inference Degradation in Production ML Systems

After machine learning (ML) systems are deployed, their models need to be retrained to account for differences between characteristics of training and production data. These differences over time lead to inference degradation—negative changes in the quality of ML inferences—which eventually reduces the trustworthiness of systems [DSB 2016; Gil 2019]. In DoD systems, failure to recognize inference degradation can lead to costly reengineering, system decommissioning, and misinformed decisions.

In DoD systems, failure to recognize inference degradation can lead to costly reengineering, system decommissioning, and misinformed decisions.

Caption: The models we are developing to study inference quality are based on convolutional neural networks (CNNs) for object detection and will use a publicly available satellite image data set as the source for test data. To further scope our study, we will focus on inference degradation stemming from the occurrence of data drift (frequency, recurrence, and abruptness drift).

Ideally, inference degradation would be quickly and reliably identified in production ML systems, allowing appropriate action to be taken (e.g., retraining, cautioning users, or taking a capability offline). The state of engineering practice in industry relies on periodic retraining and model redeployment strategies to evade data drift, without monitoring inference degradation. Without an analytic basis for the retraining interval, this frequent retraining strategy risks correcting for inference degradation too slowly (i.e., bad inferences may be the basis for actions) or redeploying models too frequently (overconsuming potentially limited bandwidth if deployed in tactical scenarios and increasing the risk of taking a capability offline due to redeployment errors) [Diethe 2018; Manning 2018; and Tarraf 2019].

We propose to develop novel metrics that predict when a model’s inference quality (e.g., positive predictive value [PPV], accuracy) will degrade below a threshold. The expected benefits of the metrics are that they will be able to determine (1) when a model really needs to be retrained so as to avoid spending resources on unnecessary retraining, and (2) when a model needs to be retrained before its scheduled retraining time so as to minimize the time that the model is producing sub-optimal results.

2021_Predicting Inference Degradation in Production ML Systems

We will focus on models based on convolutional neural networks (CNNs) for object detection and will use a publicly available satellite image data set as the source for test data. To further scope our study, we will focus on inference degradation stemming from the occurrence of data drift (frequency, recurrence, and abruptness drift).

Our vision for this work is that (1) our new metrics are incorporated into model development pipelines' ML systems to provide better information on actions to take due to inference degradation, which includes starting the retraining process in a timely manner in order to provide continuous operations within accuracy thresholds, and (2) the community starts developing metrics and leveraging our test bed for models other than those based on CNNs and looks beyond drift metrics as the only predictor of inference degradation as part of their model development pipelines and model monitoring infrastructure.

In Context

This FY2021 project

aligns with the CMU SEI technical objectives to bring capabilities that make new missions possible or improve the likelihood of success of existing ones and to be timely to enable the DoD to field new software-enabled systems and upgrades faster than our adversaries

Mentioned in this Article

[DSP 2016]

Defense Science Board. Summer Study on Autonomy. Office of the Under Secretary of Defense for Acquisition, Technology and Logistics. June 2016. https://www.hsdl.org/?view&did=794641

[Diethe 2019]

Diethe, T.; Borchert, T.; Thereska, E.; Balle, B.; & Lawrence, N. Continual Learning in Practice. Presented at NeurIPS 2018 Workshop on Continual Learning. December 2018. https://arxiv.org/pdf/1903.05202.pdf

[Gil 2019]

Gil, Y. & Selman, B. A 20-Year Community Roadmap for Artificial Intelligence Research in the US. Computing Community Consortium (CCC) and the Association for the Advancement of Artificial Intelligence (AAAI). August 2019. https://arxiv.org/abs/1908.02624

[Manning 2018]

Manning, J.; Langerman, D.; Ramesh, B.; Gretok, E.; Wilson, C.; George, A.; & Crum, G. (2018). Machine-Learning Space Applications on SmallSat platforms with TensorFlow. Presented at the 32nd AIAA/USU Conference on Small Satellites. 2018. https://digitalcommons.usu.edu/smallsat/2018/all2018/458/

[Tarraf 2019]

Tarraf, Danielle C. et al. The Department of Defense Posture for Artificial Intelligence: Assessment and Recommendations. RAND Corporation. 2019. https://www.rand.org/pubs/research_reports/RR4229.html

Software Engineering Institute

Research Review 2021