Date of Award

Spring 1-1-2019

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

First Advisor

Dirk Grunwald

Second Advisor

Sangtae Ha

Third Advisor

Qin Lv

Fourth Advisor

Daniel Massey

Fifth Advisor

Eric Keller

Abstract

We provide a domain specific language called the Streaming Analytics Language (SAL) to write concise but expressive analyses of streaming temporal graphs. We target problems where the data comes as an infinite stream and where the volume is prohibitive, requiring a single pass over the data and tight spatial and temporal complexity constraints. Also, each item in the stream can be thought of as an edge in a graph, and each edge has an associated timestamp and duration.

A real-world problem that is a streaming temporal graph is cyber security data. Machines communicate with each other within a network, forming a streaming sequence of edges with temporal information. As such, we elucidate the value of SAL by applying it to a large range of cyber-related problems. With a combination of vertex-centric computations that create features per vertex, and subgraph matching to find communication patterns of interest, we cover a wide spectrum of important cyber use cases. As an example, we discuss Verizon’s Data Breach Investigations Report, and show how SAL can be used to capture most of the nine different categories of cyber breaches. Also, we apply SAL to discovering botnet activity within network traffic in 13 different scenarios, with an average area under the curve (AUC) of the receiver operating characteristic (ROC) of 0.87.

Besides SAL as a language, as another contribution we present an implementation we call the Streaming Analytics Machine (SAM). With SAM, we can run SAL programs in parallel on a cluster, achieving rates of a million netflows per second, and scaling to 128 nodes or 2560 cores. We compare SAM to another streaming framework, Apache Flink, and find that Flink cannot scale past 32 nodes for the problem of finding triangles (a subgraph of three interconnected nodes) within the streaming graph. Also, SAM excels when the subgraphs are frequent, continuing to find the expected number of subgraphs, while Flink performance degrades and under-reports. Together, SAL and SAM provide an expressive and scalable infrastructure for performing analyses on streaming temporal graphs.

Share

COinS