In the first part of this primer, we explored what it means for data to be anomalous and how difficult it can be to categorise both anomalies and normal states of behaviour.

The data we collect can fluctuate periodically and the state of a system can grow and develop over time, all the while staying “normal”. Trends and fluctuations that would be considered normal in one environment can be anomalous in another.

In this part of the primer, we take a look at how different machine learning techniques address these issues and how the shape and makeup of the data to be analysed guides the choice of the algorithm to be used.

**There’s more than one way to do it**

Before we get into specific algorithms, we need to cover some foundations. For all machine learning approaches, there are the three base classes of training:

- Supervised – completely labelled training data
- Semi-supervised – some labelled, mostly unlabelled data
- Unsupervised – no labelled data

The following graphic shows the training classes in detail:

*Three approaches to training – credit: Goldstein et. al. 2016*

Supervised training requires a complete set of data classified by cases and categories. This allows for a clear distinction between cases but can make adaptation and maintenance of the detector difficult. Semi-supervised approaches usually have a small amount of labelled data and a much larger pool of unlabelled data. These approaches become useful when labelling the data is expensive or otherwise difficult. Unsupervised learning relies on the algorithm to find an internal structure in the dataset without any prior classification. Which approach to choose depends on the specific task and data available. While a supervised training approach appears more desirable for ease of diagnosis, unsupervised algorithms can more easily adapt to and incorporate new sources.

**The right tool for the job**

There are many techniques and algorithms that have been employed in anomaly detection. Each approach has certain base assumptions about the data used to evaluate the system state that is key to choosing the right technique. These methods to evaluate the system state fall into five main categories – classification, distance-based techniques, statistical methods, information theoretical approaches and spectral decomposition techniques.

The following overview describes the different approaches, gives examples for applications, and outlines what criteria determine the suitability of the algorithm for the data at hand.

**1. Classification**

*One-class anomaly detection. The colour does not matter if the ball can be detected as different from all others. Credit: Gyaani Guy on Blogspot*

An example of the application of a classification based anomaly detection method is the use of host-based intrusion detection systems. A neural network learns the normal sequences of system calls and states and flags unusual sequences to a human supervisor who can investigate the anomaly and intervene if necessary.

Classification requires labelled data, it falls into the supervised training class as described above. To train the neural network in the example above, abnormal system states can be part of the training data as historical events or simulated intrusions.

The simplest way to use a classification method for anomaly detection is called one-class anomaly detection. The system is either in a normal state, or it is not. A blue ball in a sea of otherwise red balls does not need to be recognised as “blue”, nor to the other elements need to be classified as “red”. Particular colours do not play a role, only the redness or non-redness of the balls. The blue ball is just outside the normal, expected system state. Further classifications can be made distinguishing various normal states or exceptional states, but these classifications depend on the available training data and method or algorithm used to build the detection system. Algorithms typically used for these are multilayer perceptron neural networks, support vector machines, Naive Bayes algorithms or a set of rules that describe the desired normal system state such that any deviation can be used to raise an alert.

**2. Distance-based**

*Credit: Improved Outcomes*

Distance-based algorithms fall into two categories, the nearest neighbour type algorithms and the clustering based algorithms. The nearest neighbour algorithm focuses on the data points in relation to each other while the clustering algorithm tests for central points and structures the data according to the location of these points.

Both types are unsupervised, they determine the structure of the data on their own without preexisting labels. The algorithms require a specific measure to find the distances between the data points. Here, the distance can mean more than a physical dimension: Penne, tagliatelle and fettuccine are close to each other in terms of information while ice cream is far away from all of them. In the case of fixed categories like pasta and desserts, a grid distance like the Manhattan-distance is a useful measure. For other types of data, distance measures such as the Euclidean (geometrical distance, such as between two points on a sheet of paper), Hamming (for example the number of non-equal letters in words of equal length) or Mahalanobis (a multi-dimensional version of distance in multiples of the standard deviation of a distribution, or how unlikely a data point is in a distribution) distance could be the best choice. The right way to measure these distances is part of what needs to be extracted from the data before the training stage to get the best output from the algorithm.

**2.1 Nearest Neighbour**

An industrial application of nearest neighbour based anomaly detection is monitoring the financial status of companies to prevent financial distress. The system monitors the financial status of a company with respect to the normal transactions and behaviour in its field and can flag a warning to the company leadership if it detects a trajectory that could lead to significant financial consequences.

The nearest neighbour method is based on the assumption that normal data points exist in dense neighbourhoods of regular observations. Anomalies occur outside that neighbourhood or are points or patterns that do not fit the neighbourhood’s pattern, such as its density, the number of data points per area covered by the cluster. The neighbourhood itself is only defined by its members, not by any particular point in the scale of the data. In the example above, the accounting data is made up of dollar values. Looking at positive and negative values as forwards and backwards directions, the account balances become like distances on a map. This is an example of how the Euclidean distance could be employed when training a distance-based algorithm.

**2.2 Clustering**

An example of a clustering based anomaly detection application is the ADMIT network intrusion detection system. It profiles the users of the network by their activity and if a new cluster of activity, i.e. a new, unusual user is detected, it flags this cluster as an anomalous user or intruder to the system administrator. A useful distance measure here could be the Manhattan distance for categories like the protocols or commands used by the users in the network. Another interesting current example is the use of clustering algorithms by researchers in the SETI (Search for ExtraTerrestrial Intelligence) group to identify unusual behaviour in the lightcurves of stars.

Clustering methods find clusters of similar data and employ various ways of determining the centre of the cluster the individual observations belong to. An anomaly detected by a clustering algorithm would be determined by its distance to the central point of all clusters. An observation that is far from the centres of all clusters is anomalous.

A common clustering algorithm used for anomaly detection is DBSCAN.

In comparison, one could say that nearest neighbour algorithms track data like a herder, finding the stray depending on the current location of the flock. Clustering algorithms find settlements of data, the fixed locations around which the data gathers. Another way of looking at it is a frame of reference, similar to the idea used in physics: The reference frame in the nearest neighbour algorithm is a co-moving reference frame, it moves with the data. The reference frame in the clustering algorithm is fixed by the reference points for the data, the data moves relative to it.

**3. Statistical**

*Joint multivariate normal distribution – the individual factors of the sample set are normally distributed but have a different individual distribution shape due to a difference in variance. Credit: IkamusumeFan, Wikimedia Commons, CC3.0*

Statistical methods of anomaly detection are used e.g. to prevent credit card fraud. The purchases are assumed to follow a normal distribution of spending amount and location. For each transaction, a multivariate version of Grubb’s test for outliers can be applied to establish a probability for the transaction to be legitimate. Multiple transactions that spend large amounts or are in places an unlikely distance away from previous transactions are considered to have a low probability to occur. If the probability becomes smaller than a set threshold, the system flags the transaction to a supervisor for the account who can evaluate the transaction and decide to contact the customer.

For this to work, the data has to follow a discernible distribution. A fair, 6-sided die has an equal probability to land on any side. Counting multiple rolls, we would not expect any side to show up significantly more often than any other (if it did, *that* would be anomalous!). Such an equal distribution is called *uniform*. However, roll two dice and add the values on the faces and the resulting numbers are no longer uniformly distributed. The distribution of the sum of the dice is called the normal distribution and follows the famous bell curve shape. For the sum of two dice, a *uniform* distribution would be considered anomalous. Neither of these distributions is better than the other – they correspond to the different ways the numbers were produced by rolling the dice. Similarly, the appropriate distribution that describing the data has to be found before any modelling approach. If the distribution of a system’s states can be modelled with a defined probability for the “normal” states, anomalies can be identified as low-probability system states through statistical methods. Statistical approaches to anomaly detection can be broadly separated into one of two categories: Parametric and non-parametric. Parametric methods assume that the distribution of normal states can be described by a function consisting of the observation and a set of parameters, such as a Gaussian distribution described by its mean and variance. Another popular parametric method is regression analysis such as the AutoRegressive Integrated Moving Average (ARIMA) model for time series analysis. Non-parametric anomaly detection techniques include histograms -directly counting and binning the observations- and kernel functions such as the Parzen window or kernel density estimate where the population is modelled as a sum of normal distributions.

**4. Information theoretical**

*This is easy to describe exactly – a black dot at the centre of a white screen of a certain size.*

*This is less easy. The exact position of each black or white dot in an image of noise is hard to capture since it follows no obvious patterns.*

An example of the application of the information theoretical approach is the detection of insider trading as an anomaly in stock market data. The market data that shows unexpected changes in information content is flagged as questionable and can be investigated in detail to expose possible misconduct.

Using information theoretical methods to detect anomalies has the advantage of being largely data agnostic and focussing on the informational content of the data itself through concepts such as the Kolmogorov complexity of the data or the entropy of the dataset. The underlying assumption of this approach is the idea that the information content of the system can be determined and its least informative, or highly chaotic, states can be calculated and identified as anomalous. One example of this idea is the compressibility of a dataset. A well-ordered dataset has a low entropy. It can be described and easily reproduced. A disordered dataset has a high entropy and is hard to describe and reproduce exactly. In other words, a compression algorithm produces a smaller file for the image with the single dot than for the image with the noise.

While being very powerful in its universality, this approach can be very computationally expensive with exponential orders of complexity depending on the complexity of the data. Text can be compressed very easily, while video data or pure numbers are already very concentrated and dense in their information content and can make the calculations harder. This limits the use of this method to smaller datasets or applications that do not require a real-time response.

**5. Spectral**

*The spectral composition of a string’s harmonic response changes when it has a mass attached to the string. Credit: UNSW ‘How Harmonic are Harmonics?’*

A proposed application for anomaly detection by spectral decomposition is the analysis of footage taken by unmanned aerial drones in search and rescue operations. Due to its difference in structure, colour and texture to the environment, man-made materials would stand out against the composition of the background and the image could be flagged to an investigator to verify its contents and make rescue operations faster.

Spectral techniques rely on lower-dimensional representations of the data such as principal component analysis. Transformations into these representations make the difference between normal and anomalous system states stand out more explicitly than in the raw data, like a discordant frequency in a musical chord or a buzz in the sound of a musical instrument. The image above compares the normal spectrum of an open string with its spectrum when a small piece of tape has been wrapped around it as an additional mass. The changes in the spectrum are representative of the anomalous mass. While the spectrum changes in different ways depending on how the string is made to vibrate, the anomaly is clearly visible in the spectral analysis.

This example also shows that the spectral approach is most useful for data with high dimensionality (such as having many columns in a spreadsheet). A sound with a limited spectrum might not show the anomalies as well as a sound composed of multiple overtones.

Since spectral analysis techniques rely on a reduction of dimensionality, they can also be used as a preprocessing step for other anomaly detection algorithms to make the data easier to handle.

**Summary**

We have seen five main categories of algorithms used for anomaly detection. The following table shows their key attributes to provide a quick guide for the choice of algorithm:

Algorithm |
Training |
Size of dataset |
Pros |
Cons |

Classification | Supervised/unsupervised(one-class SVM) | Large | Trains fast, tests fast | More classes need a larger training set, sensitive to class imbalance, may only provide a label, not the probability of an anomaly |

Distance-based | Semi-supervised/Unsupervised | Any, as long as clusters are large enough | Very flexible with respect to data, can use semi-supervised assistance | Needs distance measure, may require extensive feature engineering, often computationally complex |

Statistical | Supervised/Unsupervised | Any | Gives statistically sound anomaly scores (confidence intervals), implementation can be simple | Choice of statistic can be hard, vulnerable to the assumption of distribution |

Information theoretical | Unsupervised | Small | Data agnostic | Requires information theoretical measure, anomaly scores can be hard to obtain/justify, can be very computationally complex |

Spectral | Unsupervised | Small/medium; High dimensional | Can be used as a preprocessing step for other algorithms | Usually high computational complexity, anomalies must be visible in lower dimensional representation |

In conclusion, anomaly detection is a challenge well suited for machine learning. As we have seen, the answers to those challenges with machine learning are as varied as the problems. In a world of increasing data-driven applications, from automated processing to the internet of things, anomaly detection becomes an invaluable tool to assist with the control and maintenance of modern technology. The unusual becomes valuable when seized upon to prevent harm or gain novel insights. While its current strength lies in identifying undesirable behaviour, anomaly detection may also become the tool to reveal the unexpected, the next stepping stone on the path of human progress.