Software Engineering Institute Carnegie Mellon

State of the Practice of Intrusion Detection Technologies

Appendix E Related Efforts

In addition to the commercial vendor and research communities, several organizations are tackling the wider issues of IDS interoperability, evaluation, and user education. We describe some of these efforts below.  

 

Lincoln Laboratory (LL)

The following is quoted directly from a project-specific Lincoln Laboratory's Web site.

The Information Systems Technology Group of MIT Lincoln Laboratory, under Defense Advanced Research Projects Agency (DARPA ITO) and Air Force Research Laboratory (AFRL/SNHS) sponsorship, is collecting and distributing the first standard corpora for evaluation of computer network intrusion detection systems. LL is also coordinating, with AFRL, the first formal, repeatable, and statistically-significant evaluations of intrusion detection systems. These evaluations measure probability of detection and probability of false-alarm for each system under test.

These evaluations are contributing to the intrusion detection research field by providing direction for research efforts and an objective calibration of the current technical state-of-the-art. They are of interest to all researchers working on the general problem of workstation and network intrusion detection. The evaluation is designed to be simple, to focus on core technology issues, and to encourage the widest possible participation by eliminating security and privacy concerns, and by providing data types that are used commonly by the majority of intrusion detection systems.

This work represents the most comprehensive evaluation of research ID systems that has been performed to date. While the work is flawed in many respects, it does provide a basis for making a rough comparison of existing systems under a common set of circumstances. Only limited descriptions of the experiment have appeared in print and it may be that a more detailed exposition of the work will alleviate some of the criticisms.

The most detailed description available at the time of this report is Kristopher Kendall's BS/MS Thesis [B119]. The Lincoln Labs team has made presentations on the experiment at various meetings attended by one of the authors of this report.

These include the August, 1999 DARPA PI meeting in Phoenix, AZ and the Recent Advances in Intrusion Detection Workshop (RAID 99) at Purdue University in September, 1999.

Presentations [B120], [B121] similar to ones given at those meetings appear at the Lincoln Labs experiment site, http:www.mit.edu1. A paper describing some aspects of the 1998 experiment [B144] was scheduled to be published in January, 2000.

For a variety of political, legal, and technical reasons, it was determined that actual network data could not be used for the evaluation; synthetic data was generated from which data was derived for use in the evaluation. A network was set up that generated traffic said to be similar to that seen at the network boundary of a typical Air Force base. This traffic consisted of synthetic background traffic to which attacks on a few machines inside the base were added. The data consisted of tcpdump data sniffed just outside the base, Sun Solaris BSM audit data from an attack victim host inside the base, and file system data from the host (for integrity checking). The data was made available to those who were evaluated in 1998. In 1999, additional tcpdump data from inside the base and audit data from a victim running Windows NT were provided.

There are a number of unanswered questions concerning the background data. Although it is claimed to be similar to actual data in terms of a variety of statistical characteristics, there is no evidence to demonstrate that the false alarm behavior that it elicits from the systems being evaluated is similar to the behavior they would demonstrate using real data. Since false alarms are a key measure of the evaluation, this is a critical failing. The traffic load or data rate used for the evaluation are not specified.2 Although this is not a real time evaluation, data is tagged with times and it is possible that simulated data rate may affect the performance of some systems for both attack detection and false alarm behavior. The risks associated with the use of synthetic data in experimental settings are well known. The customary risk mitigation methods, such as the performance of pilot studies and targeted comparisons between the responses to synthetic and natural data, appear not to have been applied in this case.

A common problem in the conduct of any experimental evaluation is the choice of an appropriate unit of analysis that can be used to express the results. This needs to be chosen with care to avoid experimental bias. In 1998, the unit of analysis was chosen to be a "session" or a run3 of one of the protocols represented in the tcpdump data.
The evaluatees were provided with some of weeks of "training" data in which all sessions were identified and sessions containing an attack were labeled with the attack they contained.

Evaluation data was similarly divided into sessions and evaluatees were asked to label each session with a measure that indicated the degree of confidence that they had in the detection of an attack in that session. Whether a measure other than 0 (NO) or 1 (YES) is appropriate depends to a large extent on the detection and evaluation approach used in the system under test.

The use of "session" as the unit of analysis may be questionable. For the attacks that are inserted into the test data, the attack is likely to be embedded in a session, although complex, multistage attacks can be spread over multiple sessions involving multiple sources and protocols. It is not the case that every message or packet that is part of an attack-containing session is part of the attack. Similarly, it is possible for a system to raise a false alarm based on examination of messages from several distinct sessions. This is a hard problem and it is not clear what unit of analysis is appropriate, but whatever choice is made should be justified as a part of the experiment design.

The results of the evaluation are presented as Receiver Operating Curves (ROC) which may or may not be appropriate for all systems. The ROC plots percentage of attacks detected against percentage of false alarms. Under the proper circumstances, it is a powerful method for presenting system performance in a way that separates system behavior from environmental factors. If a binary measure of detection confidence is used, the ROC will consist of a single operating point. Lines are usually drawn from the (0,0) point to the operating point and from the operating point to the (1,1) point on the graph. This is based on the assumption that the decision criteria that produced the point could be changed to move the decision towards the point of universal rejection (no sessions are intrusions) with 0% detection and 0% false alarms or towards the point of universal acceptance (all sessions are intrusions) with 100% detection and 100% false alarm rate. For example, intuitively, this does not seem applicable to rule-based systems that have no memory. If the measure varies, then a curve can be obtained by varying a threshold and counting as detections only those whose measures exceed the threshold. These curves are meaningful if the measure is based on some knowledge of the distribution of attacks in the background. It was the intent of the "training data" to provide this knowledge, but this approach raises additional questions about the applicability of training based on artificial data to the real world4 and on the feasibility of performing such training in the real world.

For 1999, no unit of analysis is specified and ROC-like curves with an X axis giving false alarms per day 5 (rather than percent false alarms) is used.

Detections are still reported as percent detections based on the number of attacks injected. Curves of this form appear in the keyword recognition area of speech perception, but we have been unable to discover a rationale for their use. Unless there is a relatively constant relationship between errors per unit time and percent errors, this presentation introduces substantial environmental bias into the results.6

In summary, the Lincoln Labs evaluation represents a monumental but incomplete effort. Many questions about the details of the evaluation design and its implementation are not answered in the published literature. The way in which the results are presented appears to be questionable, and it is not clear that the training methodology used with systems that require training can be replicated in actual deployment.  

 

International Computer Security Association (ICSA)

The ICSA Intrusion Detection Systems Consortium (IDSC) was established in 1998 to provide an open forum in which ID product developers could work toward common goals such as educating end users, creating industry standards, achieving product interoperability, and maintaining product and marketing integrity. Any commercial vendor of intrusion detection or vulnerability assessment products and services are welcome to join this association.

The mission of the IDSC is to facilitate the adoption of intrusion detection products by defining common terminology, increasing market awareness, maintaining product integrity, and influencing industry standards. [B23, http://www.icsa.net/services.consortia/intrusion]  

 

System Administrator and Network Security (SANS) Organization

SANS has established ID'Net, a test environment within which IDS product developers can demonstrate their products. The goal of ID'Net is to showcase all of the available IDS systems and their abilities in a real-time, controlled environment to a target audience, collect valuable data, and be an effective learning and teaching tool for all who participate. ID'Net consists of a DNS server and a SMTP server running Linux and a Windows NT-based web server. Other boxes on the net include vendor and SANS ID systems.

Tests performed within ID'Net include scanning for vulnerabilities and inviting other systems to launch attacks against the environment in which the products are installed. At the May, 1999 SANS Conference, seven vendors participated by installing their ID products and then monitoring how successfully those products scanned for eleven types of vulnerabilities and detected nineteen types of attacks. SANS plans to enhance the test environment's capabilities and invite other vendors to participate at future SANS conferences.

For more description on the next three efforts, refer to Network Intrusion Detection by Steven Northcutt [B76].  

 

Open Platform for Secure Enterprise Connectivity (OPSEC Checkpoint)

OPSEC has been around for over a year and is stable and widely used as an application programming interface (API). The API is published and available as a software development kit.  

 

Common Content Inspection (CCI¾Checkpoint)

CCI is also an API. Once the firewall or IDS sensor has grabbed the packet, file, or communication stream and realized that it needs additional inspection, it redirects it to an inspection engine within CCI. The types of inspections that are performed include intrusion signatures, prohibited URLs, viruses, hostile applets, and scanning for content that is company proprietary.  

 

Adaptive Network Security Alliance (ANSA¾ISS)

ANSA allows vendors to create systems that work with and enhance the ISS family of products. This interoperability specification will support four functional areas: automated response, lockdown, decision support, and security management. Further information is available at http://ansa.iss.net.  

 

Emerging Standards

Within the past few years, the ID community has taken action to create standards for the communication of intrusion data between different ID components.

Having standards to communicate intrusion data would enable ID systems from multiple vendors to combine and inter-operate, thus forming a more complete and comprehensive IDS.
There are two significant efforts underway to define standards for the communication and exchange of intrusion data.
 

 

Common Intrusion Detection Framework (CIDF) (Research Community Pursuit)

The CIDF project, which is sponsored by DARPA, is developing protocols and APIs to enable ID research projects to share data and resources. CIDF is not intended as a standard that will influence the commercial marketplace; it is a research project. In addition to defining protocols for communication between ID systems, this project has also created a Common Intrusion Specification Language (CISL) that defines a standard way to represent intrusion data. Further information about CIDF is available at http://gost.isi.edu/cidf.  

 

Intrusion Detection Working Group (IDWG) (Primarily Vendor Community Pursuit)

The IDWG is an Internet Engineering Task Force working group which was formed by vendors in the ID community who did not like some of the work done by the CIDF. As stated in its charter, this working groups intends to "define data formats and exchange procedures for sharing information of interest to ID and response systems, and to management systems that may need to interact with them." The charter and mailing list archive for the IDWG are available at http://www.ietf.org/html.charters/idwg-charter.html and at http://www.semper.org/idwg-public.  

 

 

 

1 This site is password protected. For information concerning access, contact intrusion@sst.ll.mit.edu.

2 We were given raw file sizes for 2 days of inside and 1 day of outside data for 1999. These correspond to average data rates of 34 to 48 killobits per second or less depending on the tcpdump header overhead. We do not know how this compares to average rates for the typical AFB.

3 In the tcpdump data, a session is represented by a start time, duration, protocol, source (host and port) and destination (host and port).

4 Both the frequency and distribution of attack types contained in the evaluation data seems to be substantially different from what we have observed in the wild. We know of no published attack data for the sites on which the evaluation data was modeled, but suspect that the evaluation data is not typical in this respect.

5 False alarms per day appear overlaid on the percent false alarm scales for some of the ROCs presented for 1999. Even here, this usage should be accompanied by caveats indicating that these figures are a function of the data rate and its composition as well as the system under evaluation.

6 Simply varying the rate at which the data is presented to the system by a factor of 10 in either direction would change the false alarm rate by a factor of 100 in the absence of rate dependent sensitivities in the evaluated system. Under the same circumstances, the true detection percentage would remain constant.
 

 


[Title Page]     [Abstract]     [Figures]     [Acknowledgments]     [Executive Summary]     [Preface]     [Section 1]     [Section 2]     [Section 3]     [Section 4]     [Section 5]     [Appendix A]     [Appendix B]     [Appendix C]     [Appendix D]     [Appendix E]     [Appendix F]     [DTIC page]     [PDF file]