Detection of Malicious Code Using Static Taint Analysis

Detecting malicious code is a challenge, particularly when it’s implanted in otherwise legitimate software. If undetected, malware injected into legitimate software can result in costly compromises of computer systems, such as in the SolarWinds incident of 2020. This project aims to produce a tool to detect such malicious code to prevent system compromises. Our goal is to enable analysts throughout the U.S. Department of Defense (DoD) to detect malicious code before it can compromise DoD systems.

Our tool will be able to detect two types of malicious code: (1) exfiltration of sensitive data and (2) time bombs, logic bombs, remote access Trojans (RATs), and other malicious code that calls a potentially sensitive system application programming interface (API) function (e.g., starting a new process) in response to a potentially suspicious trigger (e.g., on a specific date or in response to an incoming network packet). We use only static analysis, not dynamic analysis, so we do not need to execute potentially malicious code. While our initial focus is on C/C++ codebases, we perform all analysis at the level of LLVM intermediate representation (IR), so we will also have some support for binaries via lifting to LLVM IR.

Our goal is to enable analysts throughout the DoD to detect malicious code before it can compromise DoD systems.

Will Klieber

Software Security Engineer

Our tool uses taint analysis based on the Interprocedural, Finite, Distributive Subset (IFDS) algorithm. A limitation of traditional taint analysis is that it conflates all flow paths from a given source to a given sink, so a malicious flow path can be “hidden” by a benign flow path. In contrast, in this project we will identify the conditions under which the flows happen, especially conditions that might indicate malicious code, so that a benign flow does not mask a malicious flow.

Currently, our tool’s output is a set of tuples of the form (source, sink, conditional_edge). Each such tuple indicates that there is a flow path from source to sink that depends on conditional_edge, where conditional_edge is an outgoing edge (in the control-flow graph) of a conditional jump. (For sensitive operations without a source-to-sink flow, the source field is NULL and the sink field is the sensitive API call.) This output isn’t easy for a human analyst to use, so we are working on improving the tool to produce output that concisely and precisely characterizes the potentially malicious behaviors of the codebase. This upgraded output will allow a human analyst to quickly and accurately determine whether the behavior is benign or malicious. We aim to filter out obvious false positives and provide enough information to enable an analyst to quickly dismiss almost all remaining false positives without manually examining the source code or decompiled binary.

Our project addresses the DoD’s operational need for better capabilities for detecting malicious code inserted through supply-chain attacks. While some tools support this task, they still require time-consuming and expensive manual analysis. If successful, our tool will produce a level of assurance (with respect to applicable classes of malicious code) that requires 10–100 times less manual effort than necessary with existing tools.

2023_Detection of Malicious Code Using Static Taint Analysis

In Context: This FY2023-24 Project

builds on the CMU SEI’s diverse expertise and experience in software analysis and malware analysis
aligns with the CMU SEI technical objective to be trustworthy in construction and implementation and resilient in the face of operational uncertainties, including known and yet unseen adversary capabilities
aligns with the OUSD(R&E) critical technology priority of leveraging advanced computing and software technologies

Software Engineering Institute

Research Review 2023