Knowing When You Don’t Know: Engineering AI Systems in an Uncertain World

Created February 2021

The DoD increasingly deploys artificial intelligence (AI) systems to perform mission-critical tasks. The backbone of many of these AI systems consists of machine learning (ML) models that make predictions about their environment. Unfortunately, even state-of-the-art ML models can make inaccurate inferences in scenarios that involve uncertainty. Many common models do not accurately estimate uncertainty for their predictions. This can produce incorrect results; worse, it makes it harder to predict when ML systems will be wrong.

To solve this problem, ML systems must be able to quantify, reason about, and rectify uncertainty in their predictions. This project will benchmark methods for quantifying uncertainty. It will also create techniques to identify changes in deployment environments that cause uncertainty. Finally, it will develop methodologies, tools, and best practices for rectifying the causes of uncertainty in models that become uncertain. This work will improve the robustness, reliability, and maintainability of ML systems.

Why Does Uncertainty Matter?

Artificial intelligence (AI) systems are widely used for critical DoD missions, including intelligence, surveillance, and reconnaissance (ISR), logistics, autonomous operations, and cybersecurity. Modern AI systems commonly use machine learning (ML) models to make important, domain-relevant inferences about their environment. However, these models are often used in the presence of uncertainty. This can cause inaccurate results in scenarios where humans would reasonably expect high accuracy. Also, some commonly-used ML models do not accurately estimate uncertainty. This can cause their predictions to be wrong. Worse, these inaccurate estimates of uncertainty make it difficult to predict when a model’s predictions will be wrong. AI system components and humans who use the model's output are then forced to reason with incorrect inferences that they think are correct.

What causes this problem? Modern neural networks often make predictions with high confidence regardless of their accuracy. In other words, they are overconfident.

For example, finding a car in aerial imagery is a common analysis task. Since neural networks are very good at detecting objects in images, a human intelligence analyst might use a neural network-based tool to find a car in overhead photos. The upper part of the example figure shows two possible matches for the car in question. The neural network is very confident in its prediction! It computed a raw confidence of 0.9773 (or 97.73%) that the car on the left is a match and a raw confidence of 0.9834 (or 98.34%) that the car on the right is a match.

There is just one problem: the car can only be in one place at a time! Both cars cannot be matches, despite their high confidence scores. How can our intelligence analyst tell which one is correct? The raw confidence score is unrelated to whether the neural network correctly identified the car.

Calibrated uncertainty is more useful to our intelligence analyst than raw uncertainty. It factors in the likelihood that a model will make correct predictions. The calibrated confidence scores in the lower half of the example figure show a much higher confidence for the car on the right (0.9834, or 98.34%) than the car on the left (0.2463, or 24.63%). This tells our intelligence analyst to focus attention on the car on the right. Calibrating the uncertainty in a neural network’s prediction led to a better outcome.

(Images from Cars Overhead with Context Data Set, Lawrence Livermore National Laboratory, https://gdo152.llnl.gov/cowc/)

Knowing When You Don’t Know: Engineering AI Systems in an Uncertain World

Collaborators

Machine Learning Department, Carnegie Mellon University

Tepper School of Business, Carnegie Mellon University

Managing Uncertainty: Finding Out When You Don’t Know

To solve the problems caused by uncertainty in machine learning (ML), we need to find out when our ML systems do not know things. We can manage uncertainty by quantifying it, examining its sources, and fixing its causes.

From an AI Engineering perspective, many AI systems incorporate ML models as system components. Calibrating uncertainty informs downstream AI system components that a model’s inferences may not be correct and they should therefore use contingencies. Detecting and rectifying uncertainty stand as best practices for iterating on ML models. This makes the process of maintaining ML models more rigorous, which in turn leads to more reliable and accurate AI applications.

Quantifying Uncertainty: How Much Don’t You Know?

The first step is to quantify uncertainty in ML predictions. Approaches to this task include (but are not limited to) the following:

Post-training Calibration (Nieini, et al, AAAI 2015) (Guo et al., ICML 2017) (Hein et al., CVPR 2019)
Bayesian Neural Networks (Blundell et al., ICML 2016) (Gal and Ghahramani, ICML 2016)
Deep Ensembles (Lakshminarayanan et al., NeurIPS 2017) (Andrey et al., NeurIPS 2018)

How do these and other state-of-the-art techniques perform on a practical level? We will benchmark and compare these methods according to their computational runtime, data efficiency, and accuracy in quantifying uncertainty.

Finding the Causes of Uncertainty: Why Don’t You Know It?

The second step is to identify the sources of uncertainty in ML predictions. Why did a deployed model become uncertain in its predictions? We will focus on two (of potentially many) factors: dataset shift and emergence of novel classes.

Dataset shift

The data that is fed into an ML model changes from the data that was used to train the model. For example, a model that was trained on images of a lush, leafy summer landscape may falter with images of the same area taken during the winter, when the landscape is snow-covered and leafless.

To address dataset shift, we will detect when input data differs significantly from training data. We will identify the causes of uncertainty and update the model accordingly. (Rabanaser, Gunnemann, and Lipton, NeurIPS 2019)

Emergence of novel classes

After deployment, an ML model could encounter objects that it has never seen before. For example, a ML classifier might incorrectly identify new types of vehicles that were not part of its training dataset.

To address the emergence of novel classes, we will extend the concept of open-world models to find relationships between new types of objects. These models assume that they have incomplete information about a domain and can reason about situations where they lack training data. (Cortes, et al., COLT 2016) (Rudd et al., TPMI 2017) (Oza et al., CVPR 2019)

Rectifying Uncertainty: How Can You Fix It?

The third step is to rectify uncertainty. Now that we have detected and quantified uncertainty in our ML models, how can we make them more confident in uncertain cases in the future?

One idea is to label the uncertain cases and retrain the model. This approach poses the following challenges:

Labeling all potentially uncertain cases can be very labor-intensive. We use active learning (Settles, 1995) to identify instances of uncertainty, have an expert label them, and update the model with the new labeled data to be more confident in similar instances in the future.
Retraining, validating, and redeploying a model can be time intensive. We are developing best practices for verifying and validating models, using online learning whenever possible (Bottou, 1998).

Looking Ahead

Our three-pronged approach to managing and mitigating uncertainty will speed the development of robust, useful, and accurate AI systems:

ML models will become more transparent about uncertainty, resulting in safer and more reliable systems.
Deployed AI systems can adapt to changes in their environments more quickly and efficiently.
AI could potentially be used for missions where it’s currently believed to be too unreliable or opaque.

First, we will evaluate uncertainty modeling techniques on relevant data for Department of Defense and Intelligence Community (DoD/IC) missions. By the end of the project, we will deliver uncertainty modeling software to DoD/IC collaborators for use in a relevant environment or even an operational setting.