In some of my previous posts, I talked about anomaly detection — ways of finding the odd one out, the unexpected little thing that’s rare, mostly undefined and often unwanted.
What if you were looking for a specific, rare thing? What if you wanted to find the nugget of gold in a pile of river mud, the needle in a haystack?
In a situation where we have lots of examples for one or more cases, and few examples of others, we call the dataset “imbalanced”.
For machine learning, these datasets pose a particular challenge. It is hard to learn from rare examples, especially considering that machine learning thrives on repeated exposure to large quantities of data.
Now, even with imbalanced datasets your ML model can still be successful and extremely precise. If the distinction between the common examples and the rare types is strong enough (think one is a Zebra while the other is a garden hose), the right model will be able to distinguish the different types of data very well. But often, the data is noisy or the common and rare cases are so similar that they are hard to distinguish even for human experts.
Medical or financial applications come to mind: We want to be able to diagnose rare diseases with the same certainty as common ones and businesses want certainty before they accuse their customers of illicit dealings.
So how imbalanced is “imbalanced”? While a lot of that is really up to the dataset itself, as a rule of thumb any difference greater than 1:10 between two classes can be considered imbalanced.
But the difference in the number of examples is not really the clincher. It’s how well your model can deal with it. And that’s where imbalance can trip you up:
Accuracy is a very common metric for machine learning models. It’s a simple number that tells us how much of its predictions the model got “right” — and that sounds like what you want. But for imbalanced data, this can be very deceiving. If youwanted to build a “vintage-car-detector” and only 1 in 1000 cars is a vintage model, then you could simply have the model output “modern car” by default without any learning and it would be 99.9% accurate, but useless. Not only would you miss out on all the vintage cars, the model could result in losing vintage cars irrevocably because their number is limited.
Another way that a badly calibrated model could result in a bad business outcome is misclassifying a modern car as vintage and upon attempting to sell it causing the customer to accuse you of fraud. This can be costly in both money for possible lawsuits as well as immeasurable damage to your reputation.
What we need is a way to express the needs of the business in terms of getting the answer right or wrong.
If this feels daunting or just annoying because it looks mathy, you can have a look at this example of a hypothetical use case that uses the metrics I’m introducing in a practical setting: Data Science in Business — Precision and Recall.
The easiest way to show how confused a model is about its results is called a “confusion matrix”. For a binary/yes-no case, it’s a simple 2×2 grid comparing the real true (let’s call it 1) and false (let’s call this one 0) values against the true and false values given by the model. That results in four possibilities:
- The true value is 1, the model says it’s 1.
- The true value is 0, the model says it’s 0.
- The true value is 1, the model says it’s 0.
- The true value is 0, the model says it’s 1.
The technical terms for these cases are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN). The 2×2 grid has one field for each of these:
Now we can distinguish the business critical cases of missing an event or falsely predicting one.
We can take these values and their ratios to create new metrics that give more detailed information about the model’s performance, especially when the number of cases in the dataset are not equal. The two most important metrics in this regard are called “Recall” and “Precision”. Recall is also called the True Positive Rate or just Hit Rate, because it represents the fraction of positive target values that have been identified correctly. Precision is the measure of the correctly identified positive values in all proposed positive values. Precision is often also called the Positive Predictive Value, because it is an indicator of a model’s reliability. While they might sound very similar, they describe very different behaviours of the model. Imagine our example from before, the vintage-classifier, in reverse: If it classified every car as vintage, it would have perfect recall, but its precision would be horrible. And it would get worse the more data points you add!
These two metrics are much better suited to capture a model’s performance in relation to the priorities of the business. If you were evaluating vital machine parts for possible faults, it might be more acceptable to discard a few false positives than to risk machine failure. In other investigations, it might be more important to reduce the number of interventions due to false classifications, be it to minimise cost or customer inconvenience.
In the former case, we would tune the model prioritised on recall, in the latter, we would prioritise precision. For the examples we look at below, we will compare precision and recall against accuracy.
There is a plethora of other combinations that give further insight into a model’s behaviour, of course. If you want to dive deeper into this, a good start is the wikipedia page on the confusion matrix.
I’ve taken a dataset from Kaggle, Daniel Perico’s Earthquake data, and added a column with a boolean for whether the event is an earthquake or not (it replaced the “type” column that has various other possibilities, such as quarry explosions or sonic booms). About 3% of the dataset are non-earthquake events. If this was a predictive model, we could imagine that we would have to decide between options such as expensive evacuations in case of false alarms and injury or loss of life in case of missed events.
I dropped any columns that had a time component or a unique identifier of the event, scaled the continuous data and converted the categorical columns to one-hot encoded features. All experiments except where specified were run using a default Random Forest algorithm, because it is popular, easy to use and has cross validation “built in” due to its bootstrapping functionality.
Most of the examples here are using the imbalanced-learn library which has a collection of utilities to optimise imbalanced datasets for machine learning.
The jupyter notebooks with the examples discussed here (and many more) can be found on github.
There are various ways to address the imbalance in a dataset that differ in the amount of change made to the dataset. The simplest way to address imbalance is of course to do nothing, and this will form our baseline results to compare against other measures.
The next least destructive measure is to try and salvage missing data by training a model on the complete dataset to “guess” missing values to fill them in. This option is called imputation. (This is, of course, not limited to imbalanced data. However, I feel it is important to include it as a possible comparison point for what effect “more data” would have on the results.)
The next degree of change to the training data is undersampling. This refers to using fewer samples of the majority class than available in the dataset to achieve a better ratio between the classes. There are a variety of options from randomly choosing to excluding specific data points from the majority class in different ways to achieve a better distinction between the classes. The latter is generally referred to as cleaning undersampling.
The opposite of undersampling is oversampling. In the simplest form of oversampling, the available data points of the minority class get multiplied in a copy-paste fashion to balance out the class ratio. In more elaborate methods, an algorithm determines specific characteristics of the minority class to generate new, similar but not equal data points to bolster the minority class. Examples of these types of algorithms include SMOTE or AdaSYN. The field of synthetic data is developing all the time, so it is worth looking for the latest findings and experiments — a competition to find the best oversampler is currently underway as part of the smote-variants package!
A significant difference between imputation and resampling methods is that imputation is, in principle, an attempt at getting more data overall. While the imputation could be limited to fix only minority cases, it can also serve to give more majority examples. There can be very little control over the ratio of minority to majority cases with this method, and it can never produce more minority cases than were part of the dataset to begin with.
The resampling methods allow to control and aim for a ratio of majority to minority cases, and depending on the modelling algorithm used, the ratio might not need to be 1:1 to make a good model.
Of course there are ways to mix and match between these options as well. However, keep in mind that these steps are not independent — trying to e.g. create synthetic data on top of imputed values requires these values to be highly accurate, else the synthetic data will propagate the errors made by the imputation.
The table below shows results for some of the tests run with the simplified earthquake data:
These results show some important features of tools that can be used to address imbalanced data. The first thing to notice is that the only metric that could be significantly increased here is the recall.
This makes sense, since capturing a rare case in general is easier than distinguishing it correctly from another case. That is why all increases in recall come at a cost in precision. Still, this could be a boon: a detailed look at the cases the model keeps getting wrong could give some clues for e.g. feature engineering to improve the classifier. I won’t go further into this topic because feature engineering is a general practice in machine learning and not limited to imbalanced data.
Some of the treatments were quite fast, while others required a lot of computation or were not compatible with the dataset altogether. It depends on the business case what balance is required, how much data is available and how much computing power and time are allocated for model development.
(NB: There is a combined measure, called the F1-score, that evaluates precision and recall together and allows for weighting the importance of one against the other.)
So far, the interventions we’ve looked at have dealt with the imbalance at a preprocessing level. Couldn’t we just tell the model that these minority cases are more important? Expressing the priority of classes for the modelling algorithm is generally called cost-sensitive modelling and can be done in a variety of ways. Many algorithms, such as Random Forest, have built-in support for imbalanced data in that way. The classes receive an associated weight to penalise the algorithm more strongly for missing more important cases.
If a desired technique is not included in the modelling algorithm, they often provide options to add this functionality by hand. A custom loss function can make the algorithm cost-sensitive, or a tool to measure the difference between classes can be implemented as a new decision criterion for the algorithm. One such example is the Hellinger distance criterion that can replace the built-in decision criteria in Random Forest.
The imbalanced-learn library contains a few standard algorithms that have been enhanced through built-in balancing measures.
If this all looked really confusing, here’s an example of a hypothetical scenario that uses the metrics introduced above in a realistic setting: Data Science and Business — Precision and Recall.
Getting it straight
The most important aspect of any method to address your imbalance is the same as in all business propositions — how do we measure success?
We have seen that the methods to address data imbalance can change the dataset by a lot, so you have to make sure that your validation set is untouched by those methods.
Algorithms with built-in methods to address imbalanced training data make this very easy, of course. Here, you can just keep a validation set as usual.
However, with oversampling methods such as AdaSYN or SMOTE, you are creating new data. One the one hand, you want to give these algorithms the whole dataset as a basis to produce the best results. On the other hand, you need to make sure that you can still separate out a section of unmodified data points to create your validation set.
If you don’t, your validation set won’t represent naturally occurring data. It’ll represent the natural data plus the interpretation of the imbalanced data algorithm. Depending on your situation, that algorithm might have used a lot more compute and data than your model will have, but it’s still an approximation of the world and any misconceptions it has will translate into the data it generates.
This can be annoying, because many oversampling algorithms have no easy way to separate out the synthetic data from the original data. It’s tempting to manipulate the entire dataset and split the complete result into train, test and validation data, but that is misleading and dangerous. Only when your model is tested on what made your dataset so challenging in the first place are you getting an impression of its real performance.
So make sure that you always test your model on a validation dataset that is purely natural data, most importantly, that the ratio between classes is similar to what you would expect your model to have to deal with when in production.
We haven’t discussed time series here, but this is especially true for them. Even if you are training a simple classifier, make sure that you are validating your model on more recent data than your training data. This ensures that your model is tested on data that most closely resembles the real world data it will work on eventually. One reason why this is important is concept drift — a change in the underlying patterns in your data that your model won’t recognise because it is trained only on data that doesn’t contain the new patterns. To find out more about this, read this blog post by my colleague Patrick Robotham:
Keeping it level
To sum up, imbalanced data can be a tough challenge for a machine learning project. And while there are many options to address the issue, there is no one-size-fits-all solution to the problem.
One of the reasons for that is that the most important aspect to identify when dealing with imbalanced datasets is where the priorities of the underlying business questions are:
- Can you deal with a few more falsely classified items as long as you get most of your targets?
- Put the emphasis on recall.
- Would misidentifying your subjects be more harmful than missing a few targets?
- Focus on precision.
Depending on the makeup of your dataset, the restrictions on your model and the requirements of your business case, different approaches can give you a different balance between the two.
I have illustrated a variety of algorithmic approaches with a special focus on a hierarchical view of manipulations on the dataset, ordered by the changes made to the original dataset to obtain a training set. That is, of course, only one way to view these techniques. To get a deeper understanding, I invite you to download the repository of examples from github which includes details and options not discussed here.
If you have any further questions or would like to discuss a specific business problem, get in touch with us at Eliiza.
We have experience with imbalanced data across a variety of industries and applications from anomaly detection to sentiment analysis or computer vision and we’d love to help.
An overview of classification algorithms for imbalanced datasets, V. Ganganwar 2012
Oversampling for imbalanced learning based on K-means and SMOTE, F. Last et. al. 2017
Learning from imbalanced data, H. He & E. A. Garcia, 2009
Data Science and Business — Precision and Recall (a hypothetical case study), Eliiza AI, Medium
Confusion Matrix, Wikipedia
- General Manager