Applying Causal Learning to Evaluate Large Language Models (LLMs)

March 2, 2026 • SEI Report

By
Michael D. Konrad, Andrew O. Mellinger, Linda Parker Gates, David James Shepard, and Nicholas Testa

This report describes how the SEI applied causal discovery to LLM summarization, exemplifying a way to address bias when evaluating complex new technology.

Publisher

Software Engineering Institute

DOI (Digital Object Identifier)

10.1184/R1/30251989

Topic or Tag

AI Engineering and Machine Learning

Abstract

As the SEI’s body of causal work has evolved into an end-to-end causal discovery and inference method and tool suitable for detecting bias in ML and AI models, SEI researchers are beginning to investigate whether the first step of the method, causal discovery, can also be applied to LLMs. The SEI’s approach to exploring this question comprises three steps: (1) obtain a dataset of story/summary pairs to use as ground-truth, (2) design prompt styles (e.g., purpose, tone) with which to prompt a Summarizer LLM to summarize a story from one of those pairs, and (3) design a set of summarization-quality features employed by an Evaluator LLM to score the quality of summaries generated by the Summarizer LLM. In this way, SEI researchers created a dataset of higher level features for input to causal discovery. The resulting causal graph demonstrates that a causal relationship between the focus of a prompt style and summary quality is often discoverable when both features overlap. This overall approach may benefit software engineering and LLM research by providing a more formal methodology for assessing the nuanced cause-and-effect relationships unique to a given LLM, reducing confounding.