Applying Causal Learning to Evaluate Large Language Models (LLMs)

SEI Report
This report describes how the SEI applied causal discovery to LLM summarization, exemplifying a way to address bias when evaluating complex new technology.
Publisher

Software Engineering Institute

DOI (Digital Object Identifier)
10.1184/R1/30251989

Abstract

As the SEI’s body of causal work has evolved into an end-to-end causal discovery and inference method and tool suitable for detecting bias in ML and AI models, SEI researchers are beginning to investigate whether the first step of the method, causal discovery, can also be applied to LLMs. The SEI’s approach to exploring this question comprises three steps: (1) obtain a dataset of story/summary pairs to use as ground-truth, (2) design prompt styles (e.g., purpose, tone) with which to prompt a Summarizer LLM to summarize a story from one of those pairs, and (3) design a set of summarization-quality features employed by an Evaluator LLM to score the quality of summaries generated by the Summarizer LLM. In this way, SEI researchers created a dataset of higher level features for input to causal discovery. The resulting causal graph demonstrates that a causal relationship between the focus of a prompt style and summary quality is often discoverable when both features overlap. This overall approach may benefit software engineering and LLM research by providing a more formal methodology for assessing the nuanced cause-and-effect relationships unique to a given LLM, reducing confounding.

Cite This SEI Report

Konrad, M., Mellinger, A., Gates, L., Shepard, D., & Testa, N. (2026, March 2). Applying Causal Learning to Evaluate Large Language Models (LLMs). Retrieved March 7, 2026, from https://doi.org/10.1184/R1/30251989.

@techreport{konrad_2026,
author={Konrad, Michael and Mellinger, Andrew and Gates, Linda Parker and Shepard, David and Testa, Nicholas},
title={Applying Causal Learning to Evaluate Large Language Models (LLMs)},
month={{Mar},
year={{2026},
howpublished={Carnegie Mellon University, Software Engineering Institute's Digital Library},
url={https://doi.org/10.1184/R1/30251989},
note={Accessed: 2026-Mar-7}
}

Konrad, Michael, Andrew Mellinger, Linda Parker Gates, David Shepard, and Nicholas Testa. "Applying Causal Learning to Evaluate Large Language Models (LLMs)." Carnegie Mellon University, Software Engineering Institute's Digital Library. Software Engineering Institute, March 2, 2026. https://doi.org/10.1184/R1/30251989.

M. Konrad, A. Mellinger, L. Gates, D. Shepard, and N. Testa, "Applying Causal Learning to Evaluate Large Language Models (LLMs)," Carnegie Mellon University, Software Engineering Institute's Digital Library. Software Engineering Institute, 2-Mar-2026 [Online]. Available: https://doi.org/10.1184/R1/30251989. [Accessed: 7-Mar-2026].

Konrad, Michael, Andrew Mellinger, Linda Parker Gates, David Shepard, and Nicholas Testa. "Applying Causal Learning to Evaluate Large Language Models (LLMs)." Carnegie Mellon University, Software Engineering Institute's Digital Library, Software Engineering Institute, 2 Mar. 2026. https://doi.org/10.1184/R1/30251989. Accessed 7 Mar. 2026.

Konrad, Michael; Mellinger, Andrew; Gates, Linda Parker; Shepard, David; & Testa, Nicholas. Applying Causal Learning to Evaluate Large Language Models (LLMs). Software Engineering Institute. 2026. https://doi.org/10.1184/R1/30251989