Techniques to handle imbalanced data in machine learning

Understanding Imbalanced Data

The classical data imbalance problem is recognized as one of the major problems in the field of data mining and machine learning as most machine learning algorithms assume that data is equally distributed. In the case of imbalanced data, majority classes dominate over minority classes, causing the machine learning classifiers to be more biased towards majority classes. This causes poor classification of minority classes. Classifiers may even predict all the test data as majority classes. Some of the real-world examples involve oil-spill detection, network intrusion detection, fraud detection, and rare diseases[1][2][3].

If you take the example of rare diseases, a machine learning model may suffer from accuracy paradox, which makes it difficult to control false positives (or Type I Error) and false negatives (or Type II Error). This means that the patient may suffer from a rare disease but the machine learning model will not predict so since the majority of the data will be from patients without the disease. In the example of fraud detection, the goal is to identify whether the transaction is fraudulent or not. Because most transactions are not fraudulent, this causes the model to predict the fraudulent transactions as valid. 

To overcome these challenges, several approaches have been developed that can be implemented during the pre-processing stage. One commonly used strategy is called  resampling, which includes undersampling and oversampling techniques. If one balances the dataset by removing the instance from the overrepresented class then its called undersampling. Oversampling can be achieved by adding similar instances of underrepresented class to balance the skewed class ratio. Resampling could be done with or without replacement. The first two approaches are depicted in the image below and are explained in the following sections in detail.


Image source

Coding example

The code below gives a simple example of how the majority samples dominate the minority samples causing more false positive predictions. In this example, I used Naïve Bayes model to classify the data. Although the accuracy seems good, you can see from the precision that it’s predicting mainly majority classes. From the confusion matrix you can see that there are more false positives coming from the minority classes.

#Abalone (Imbalanced: 19) data set

#A imbalanced version of the Abalone data set,
#where the positive examples belong to class 19 and the negative examples belong to the rest.
# R Version 3.5.0
#2: Type.                 Imbalanced
#3: Origin.               Real world
#4: Instances.            4174
#5: Features.             8
#6: Classes.              2
#7: Missing values.       No
#8: IR:                   128.87

# @attribute Sex {M, F, I}
# @attribute Length real [0.075, 0.815]
# @attribute Diameter real [0.055, 0.65]
# @attribute Height real [0.0, 1.13]
# @attribute Whole_weight real [0.0020, 2.8255]
# @attribute Shucked_weight real [0.0010, 1.488]
# @attribute Viscera_weight real [5.0E-4, 0.76]
# @attribute Shell_weight real [0.0015, 1.005]
# @attribute Class {positive, negative}
# @inputs Sex, Length, Diameter, Height, Whole_weight, Shucked_weight, Viscera_weight, Shell_weight
# @outputs Class
# @data
library(caret) # if not exist install the libraries

temp <- tempfile()
data <- read.table(unz(temp, "abalone19.dat"),
colnames(data) <- c("Sex","Length","Diameter","Height","Whole_weight",
# function to factor from zero to n
factorFromZero<- function(dataframe,parameter){
  dataframe[,parameter]<- as.integer(as.factor(dataframe[,parameter]))
  return (eval(parse(text=paste0("dataframe%>% mutate(", parameter, " = ", parameter," - 1 )"))))
# converting strings into factors
data <- factorFromZero(data,"Sex")
data$Sex <- as.factor(data$Sex)
# Converting the categorical data into N-Dimensional binary features
data <- cbind(, model.matrix(~ Sex + 0)))),data)
data <- data %>% select(-Sex)
data$Class <- ifelse(data$Class=='negative', 1, 0)
#class destribution
data_class_destribution <- data %>% group_by(Class) %>% summarize(class_count = n())
index <- createDataPartition(data$Class, p = 0.8, list = FALSE)
train <- data[index, ]
test  <- data[-index, ]

train_features <- train[,-ncol(train)]
train_labels <- train[,ncol(train)]
test_features <- test[,-ncol(test)]
test_labels <- test[,ncol(test)]
# trained the model
model <- naiveBayes(as.factor(train_labels) ~ ., data = train_features)

# predicting the labels of the unseen data
predicted_labels <- predict(model, newdata = test_features)
Accuracy(predicted_labels, t(test_labels))

# generating the confusion matrix to identify false positive and false negetive output
conf_matrix <- table(predicted_labels, test_labels)
# this will explains us how the majority samples dominate the minority samples
(precision <- diag(conf_matrix) / rowSums(conf_matrix))
# AUC scores
# F1 scores
F1_Score(t(test_labels), predicted_labels, positive = NULL)

> Accuracy(predicted_labels, t(test_labels))
[1] 0.8189448

> conf_matrix
predicted_labels   0   1
               0   3 146
               1   5 680
> (precision <- diag(conf_matrix) / rowSums(conf_matrix))
         0          1
0.02013423 0.99270073
> AUC(predicted_labels,t(test_labels))
[1] 0.5991223
> F1_Score(t(test_labels), predicted_labels, positive = NULL)
[1] 0.03821656


Resampling is the process of reconstructing the data sample from the actual data sets either by non-statistical estimation or statistical estimation. In non-statistical estimation, we randomly draw samples from the actual population hoping that the data distribution has a similar distribution to the actual population. Statistical estimation, however, involves estimating the parameters of the actual population and then drawing the subsamples. In this way, we extract data samples that carry most of the information from the actual population. These resampling techniques help us in drawing the samples when the data is highly imbalanced.



Image source

Random undersampling is a method in which we randomly select the samples from the majority class and discard the remaining. Because we assume that any random sample accurately reflects the distribution of the data, this is a naïve approach. This is a classical method in which the goal is to balance class distributions through the random elimination of majority class examples. This leads to discarding potentially useful data that could be important for classifiers.

Most commonly used approaches are based on k-nearest neighbor. These approaches will select the sample set and then exhaustively search the entire dataset and select the k-NN and discard the other data. This method assumes k-NN carries all the information that we need regarding those classes[7][8].

Several other undersampling techniques are available which are based on two different types of noise model hypotheses. In one of the noise models, it is assumed that samples near the boundary are noise. This is discarded in order to obtain the maximum accuracy [7].

In another noise model, it is assumed that majority class samples concentrated in the same location as minority class samples are noise. Discarding these samples from the data creates a clear boundary that can assist in classification [7][8].



Image source

While undersampling aims to achieve an equal distribution by eliminating majority class samples, oversampling does this by replicating the minority samples so that the distribution is balanced. But naïve oversampling has a few shortcomings. It increases the probability of overfitting as it makes exact replications of the minority samples rather than sampling from the distribution of minority samples. Another problem is that as the number of samples increases, the complexity of the model increases, which in turn increases the running time of the models[9][10].

One commonly used oversampling method that helps to overcome these issues is SMOTE. It creates the new samples by interpolating based on the distances between the point and its nearest neighbors. SMOTE calculates the distances for the minority samples near the decision boundary and generates the new samples. This causes the decision boundary to move further away from the majority classes and avoid the overfitting problem[9][10].

Synthetic samples

Synthetic samples are artificially generated from the original data sample when the data is scarce. These datasets mirror the distribution of the original data sample. The most commonly used algorithms for generating synthetic data are SMOTE [14] and ADASYN [15]. The SMOTE algorithm generates the synthetic data from the minority samples only as described in the section above. The ADASYN algorithm uses weighted distribution of the minority samples which are not well separated from the majority samples. In this way, it reduces the bias in the minority samples and helps shift the decision boundary towards the minority samples that are not easy to classify.

Feature selection

In order to tackle the imbalance problem, we calculate the one-sided metric such as correlation coefficient (CC) and odds ratios (OR) or two-sided metric evaluation such as information gain (IG) and chi-square (CHI) on both the positive class and negative class. Based on the scores, we then identify the significant features from each class and take the union of these features to obtain the final set of features. Then, we use this data to classify the problem.

Identifying these features will help us generate a clear decision boundary with respect to each class. This helps the models to classify the data more accurately. In [13] the authors use the features of the negative class to discard all the documents that are highly associated with these features. This performs the function of intelligent subsampling and potentially helps reduce the imbalance problem.

This technique is mainly focused on the text classification and web categorization domains [11][12] because they deal with a lot of features. The problem is that if we have a small number of features take the union of significant positive class and negative class features, we may end up getting back majority of the features. This does not help in feature selection as we end up getting back almost all the features and does not address the imbalance problem.    


Imbalanced data is one of the potential problems in the field of data mining and machine learning. This problem can be approached by properly analyzing the data. A few approaches that help us in tackling the problem at the data point level are undersampling, oversampling, and feature selection. Moving forward, there is still a lot of research required in handling the data imbalance problem more efficiently.