Class imbalance problem by Dr Meenakshi (Data Mining)
Size: 7.26 MB
Language: en
Added: Jun 23, 2024
Slides: 27 pages
Slide Content
UNIT-5 CLASS IMBALANCE PROBLEM
What is the Class Imbalance Problem? It is the problem in machine learning where the total number of a class of data (positive) is far less than the total number of another class of data (negative) . This problem is extremely common in practice and can be observed in various disciplines including fraud detection, anomaly detection, medical diagnosis, oil spillage detection, facial recognition, etc . ABOUT PROBLEM Given a dataset of transaction data, we would like to find out which are fraudulent and which are genuine ones. Now, it highly cost to the e-commerce company if a fraudulent transaction goes through as this impacts our customers trust in us, and costs us money. So we want to catch as many fraudulent transactions as possible. If there is a dataset consisting of 10000 genuine and 10 fraudulent transactions, the classifier will tend to classify fraudulent transactions as genuine transactions. The reason can be easily explained by the numbers. Suppose the machine learning algorithm has two possibly outputs as follows: Model 1 classified 7 out of 10 fraudulent transactions as genuine transactions and 10 out of 10000 genuine transactions as fraudulent transactions. Model 2 classified 2 out of 10 fraudulent transactions as genuine transactions and 100 out of 10000 genuine transactions as fraudulent transactions.
How to depict which model is the better solution ? M1 OR M2????? To tell the machine learning algorithm (or the researcher) that Model 2 is better than Model 1, we need to show that Model 2 above is better than Model 1 above. For that, we will need better metrics than just counting the number of mistakes made. T he concept of True Positive, True Negative, False Positive and False Negative has been introduced: True Positive (TP) – An example that is positive and is classified correctly as positive True Negative (TN) – An example that is negative and is classified correctly as negative False Positive (FP) – An example that is negative but is classified wrongly as positive False Negative (FN) – An example that is positive but is classified wrongly as negative Based on this w e will have True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate:
Graph Mining Graphs become increasingly important in modelling complicated structures, such as circuits, images, chemical compounds, protein structures, biological networks, social networks, the Web, workflows, and XML documents. Many graph search algorithms have been developed in chemical informatics, computer vision, video indexing, and text retrieval. With the increasing demand on the analysis of large amounts of structured data, graph mining has become an active and important theme in data mining. Among the various kinds of graph patterns, frequent substructures are the very basic patterns that can be discovered in a collection of graphs. They are useful for characterizing graph sets, discriminating different groups of graphs, classifying and clustering graphs, building graph indices, and facilitating similarity search in graph databases. Recent studies have developed several graph mining methods and applied them to the discovery of interesting patterns in various applications.
Social network Social network can be defined as the set of relationships between individuals where each individual is a social entity. It represents both the collection of ties between people as well as the strength of those ties . In a general way, Social network is used as a measure of social “connectedness”, within the social networks for observing and calculating the quality and quantity of information flow within individuals and also within groups.