A simple approach to anomaly detection in periodic big data streams
By Simon Kern and Richard Zinck.
One of the signature traits of big data is that large volumes are created in short periods of time. This data often comes from connected devices, such as mobile phones, vehicle fleets or industrial machinery. The reasons for generating and observing this data are many, yet a common problem is the detection of anomalous behavior. This may be a machine in a factory that is on the verge of malfunctioning, say due to the imminent breaking of some part, or a member of a vehicle fleet that has experienced unusual or hazardous environmental conditions.
Monitoring data is one way to detect such problems early by identifying irregular behavior. Early detection allows for intervention, often reducing the cost of a malfunction. The entire proposition hinges on the idea that it is possible to separate normal from irregular behavior with a high degree of confidence. As in any classification problem, errors can be made in at least two ways: a normal behavior can be identified as irregular or an irregular behavior not detected.
This is important in the assessment of the quality of the system as well as in the operational setting. The key question in operations is when to alert a person and start a potentially costly process of investigation. There is a high potential for frustration if done incorrectly. On the other hand, the system should reliably detect issues so that they can be addressed early. We solve this problem by introducing different error levels. The detection of a rare event, and potential anomaly, raises an alert of lowest intensity. As such alerts accumulate, an alert of higher degree is triggered that leads to investigation.
Here, we use a simple algorithm that can detect rare and potentially irregular behavior in a time series that has periodic patterns (inspired by ). The pattern may arise from differences in utilization of a service during the day, as well as patterns depending on the weekday. Recently, Twitter released an open source implementation for detecting anomalies in such time series using GNU R. The algorithm combines a series of sophisticated procedures to model periodicity and smoothing. The periodicity is extracted and rare events are identified against a normalized distribution. 
Our solution uses structural knowledge about the system, combined with the data, to identify timespans where similar behavior is expected. This is used as a basis for aggregating data within comparable periods, consisting e.g. of business days. The aggregation is discretized into time slices that are small enough to capture a gradient, yet not so small that we do not have enough data per time slot. The initial discretization scheme is based on an educated guess. Nevertheless, an optimal discretization provided the data can be determined e.g. using simulated annealing or a Markov Chain Monte Carlo procedure.
In the case presented here, we segmented the week into weekdays and weekends. Each segment is then broken down into an hourly statistic for our observable. In each such tuple of (type of day, hour) we have a number of measurements from previous weeks. The distribution within these bins can then be used to identify rare events. With our data, the distributions have a dominant mode and we hence use a normal distribution as a model to identify rare events. This can be achieved efficiently by calculating a running mean and standard deviation for each tuple. If a new data point is considered a potential anomaly if it lies outside of the range of a specified multiple of the standard deviation within its bin. The periodicity is captured by the segmentation and discretization. Global trends can be captured by restricting the memory in the calculation of the running statistics. Popular choices are exponentially smoothed statistics, such that the value of an older data point enters the aggregated statistic with a lesser weight.
We tested this simple anomaly detection algorithm against the data provided in the open source Twitter library and the exact same points were classified as anomalies ( abs(value – mean) >= 2.5 * standard deviation, see Figure 1). In the twitter example, the data are binned in minutes and every weekday is assumed to have the same pattern. We modified our algorithm to work with these assumptions.
The question remains valid why not to use the library provided by Twitter. And we are using it – as one of our benchmarks. Nevertheless, we want our solution to work in a big data environment, and libraries in GNU R do not scale well. Hence, we used Apache Spark/Spark streaming to implement our solution. Furthermore, Twitters approach combines several algorithms that introduce a high degree of complexity, such as fitting a polygon to the data to model periodicity. Each of these procedures introduces potential sources of errors and makes the result more difficult to interpret. So, as long as we get no advantage from the more complex algorithm, we will stick to the simple, scalable solution.
Next, we plan to use methods from unsupervised learning to identify anomalous combinations of a set of observables. A problem in the operation of machinery or a fleet may result in an odd combination of signals. By using unsupervised learning techniques, we can identify vectors of observables that are outside of the boundary of clusters identified in normal behavior.
It is our intention to make the resulting software available under an open source license, so stay tuned.