Similarity Majority Under-Sampling
Technique for Easing Imbalanced
Classification Problem
Jinyan Li
1(&)
, Simon Fong
1(&)
, Shimin Hu
1
, Raymond K. Wong
2
,
and Sabah Mohammed
3
1
Department of Computer and Information Science, University of Macau,
Taipa, Macau SAR, China
{yb47432,ccfong,yb72021}@umac.mo
2
School of Computer Science and Engineering,
University of New South Wales, Sydney, NSW, Australia
[email protected]
3
Department of Computer Science, Lakehead University, Thunder Bay, Canada
[email protected]
Abstract.Imbalanced classification problem is an enthusiastic topic in the
fields of data mining, machine learning and pattern recognition. The imbalanced
distributions of different class samples result in the classifier being over-fitted by
learning too many majority class samples and under-fitted in recognizing
minority class samples. Prior methods attempt to ease imbalanced problem
through sampling techniques, in order to re-assign and rebalance the distribu-
tions of imbalanced dataset. In this paper, we proposed a novel notion to
under-sample the majority class size for adjusting the original imbalanced class
distributions. This method is called Similarity Majority Under-sampling Tech-
nique (SMUTE). By calculating the similarity of each majority class sample and
observing its surrounding minority class samples, SMUTE effectively separates
the majority and minority class samples to increase the recognition power for
each class. The experimental results show that SMUTE could outperform the
current under-sampling methods when the same under-sampling rate is used.
Keywords:Imbalanced classification
fiUnder-samplingfiSimilarity measure
SMUTE
1 Introduction
Classification is a popular data mining task. A trained classifier is a classification model
which is inferred from training data that predicts the category of unknown samples.
However, most of current classifiers assume that the distribution of dataset is balanced.
Practically, most datasets found in real life are imbalanced. This gives rise to weak-
ening the recognition power of the classifier with respect to minority class, and
probably overfitting the model with too much training samples from majority class.
In essence, the imbalanced problem which degrades the classification accuracy is
rooted at the imbalanced dataset, where majority class samples outnumbers those of the
©Springer Nature Singapore Pte Ltd. 2018
Y. L. Boo et al. (Eds.): AusDM 2017, CCIS 845, pp. 3–23, 2018.
https://doi.org/10.1007/978-981-13-0292-3_1