Pathfinding Project Explores Large Language Models for the Intelligence Mission

Large language model (LLM) applications such as ChatGPT seem a beneficial fit for the data-heavy intelligence community (IC). But IC agencies cannot expose their sensitive information to public models, and LLM output cannot always be trusted. A 2023 SEI study explored how IC organizations might establish their own trustworthy LLM.

In early 2023, the Office of the Director of National Intelligence (ODNI) began investigating use cases for LLMs within the IC. “Technologies like LLMs have the potential to greatly enhance current mission workflows but can also reveal new insights in our existing and future data sets that can’t necessarily be derived with legacy approaches,” said Bob Lawton, chief of mission capabilities in the ODNI Office of Science and Technology.

ODNI turned to the SEI, which has been researching AI engineering for the agency since 2020. The resulting project asked, for the first time, how the IC might set up a baseline, stand-alone LLM; customize LLMs for intelligence use cases; and evaluate the trustworthiness of LLMs across use cases.

If intelligence analysts are to use these tools, they have to trust them, or at least know their limitations.

Shannon Gallagher

AI Engineering Team Lead, SEI AI Division

SEI researchers focused on two hallmark LLM use cases: question answering with source attribution and document summarization. “Intelligence analysts frequently need to query data sets, review large corpora of documents, accurately distill the important information, and report it out for different audiences,” said Shannon Gallagher, the SEI’s AI engineering team lead and the project’s principal investigator.

The most cost-effective method of building a domain-specific LLM is to adjust an existing foundational model. One way is to augment the model with external tools at inference time. Another way—more permanent but costly—is fine tuning, which further trains the foundational model on custom data.

The SEI researchers tried both approaches. The solution had to be scalable, one of the SEI’s three pillars of AI engineering, so they stood up LLMs of four sizes in both on-premises and cloud environments and fine-tuned them on a custom set of documents. “We benchmarked actual resources that would be needed, like cost, data, compute cycles, and time,” said Gallagher.

The results, detailed in a September report, showed that using unclassified infrastructure for the LLM could be affordable if the fine-tuning data set is small and unclassified. The mix of fine tuning and augmentation would vary across intelligence agencies, though the report recommends using the quicker, cheaper augmentation until models can be fairly compared.

Assessing LLM performance is an open area of research. “There’s a limited set of metrics for evaluating LLM performance, especially for national security applications,” said Gallagher. The SEI is starting to develop quantitative metrics for LLM trustworthiness, security, and reliability. “We need to know these attributes before LLM systems are deployed in any automated function, even with humans in the loop. If intelligence analysts are to use these tools, they have to trust them, or at least know their limitations.”

Attributing answers to sources is a human-centered AI principle the SEI followed to help users trust the responses of the project’s test LLM. But the system’s hallucinations, biased data sources, and high sensitivity to prompt wording led the researchers to conclude that, for high-stakes intelligence tasks, LLM output is not trustworthy without expert review.

ODNI plans to use the project results to inform IC senior leadership about the potential uses, limitations, and implementation considerations of LLMs and in forthcoming AI policies and standards for the IC, including those prescribed in the recent Executive Order on Safe, Secure, and Trustworthy AI.

Download the SEI report A Retrospective in Engineering Large Language Models for National Security.