Runtime Assurance for Big Data Systems

Created June 2018

Rapid growth of data has accelerated evolution in the scale of software systems, but design-time analysis cannot assure their runtime performance. To address this challenge, we designed a reference architecture to automatically generate and insert monitors and aggregate the metric streams. It monitored 20 metrics from a 10,000-node system, with minimal impact on CPU and network utilization.

How Do You Assure Runtime Performance of a System You Can’t Fully Control?

The exponential growth of data in the last decade has fueled rapid evolution in the scale of software systems. From modern commercial airlines, to health-care systems, to Internet giants Google and Facebook, to the 14,000 million "things" that make up the Internet of Things, big data systems have become ubiquitous. After these systems are deployed to production, there is typically little control over changes to the number of input sources and rates, the number of users and external applications, and the quality of service on a shared cloud infrastructure.

For these systems, design-time analysis cannot assure runtime performance in production as the production environment continues to evolve. These systems must include sophisticated runtime observability and measurement of a system’s health and status, feeding into historical data repositories for forensic analysis, capacity planning, and predictive analytics. But challenges include economical creation and insertion of monitors into hundreds or thousands of computation and storage nodes; efficient, low-overhead collection and storage of measurements; and application-aware aggregation and visualization.

Our Collaborators

We worked with a team of students in the Master of Software Engineering program at Carnegie Mellon University. The students developed and tested the prototype implementation of the reference architecture.

Our Model-Driven Architecture Provides Observability for Big Data Storage

We built a reference architecture to address these challenges. We used some open source components—including the Eclipse Modeling Framework, Cassandra, and a daemon called collectd—to streamline our development effort and provide advanced capabilities right out of the box. The architecture uses a model-driven approach to facilitate rapid customization of a framework to different systems' observability requirements without the need for custom code for each deployment, so it reduces both cost and effort. The architecture, tool set, and runtime framework allow a designer to describe a heterogeneous big data storage system as a model and deploy it automatically to configure an observability framework.

To assess the performance and scalability of our observability architecture, we performed a series of tests using the Amazon Web Services cloud platform. We configured a pool of test daemons to simulate metrics collection from 100 to 10,000 database nodes. Our scalability tests demonstrated that the implementation can monitor 20 different metrics from 10,000 database nodes with a sampling interval of 20 seconds. In shorter collection intervals, we lost metrics due to the sustained write load required in the metrics database. This indicates that observability at scale must be able to support very high write loads in a metrics collection database.

Looking Ahead

Our current implementation provides observability at the database layer. Future work will extend these model-driven capabilities to other layers in a big data system and improve the scalability of our framework. We will also investigate the potential of automated resource discovery to compose an observability system dynamically.

Software Engineering Institute