Ppt on CLASS IMBALANCE PROBLEM in Data Mining

bhdbd061 12 views 27 slides Jun 23, 2024
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

Class imbalance problem by Dr Meenakshi (Data Mining)


Slide Content

UNIT-5 CLASS IMBALANCE PROBLEM

What is the Class Imbalance Problem? It is the problem in machine learning where  the total number of a class of data (positive) is far less than the total number of another class of data (negative) . This problem is extremely common in practice and can be observed in various disciplines including fraud detection, anomaly detection, medical diagnosis, oil spillage detection, facial recognition, etc . ABOUT PROBLEM Given a dataset of transaction data, we would like to find out which are fraudulent and which are genuine ones. Now, it highly cost to the e-commerce company if a fraudulent transaction goes through as this impacts our customers trust in us, and costs us money. So we want to catch as many fraudulent transactions as possible. If there is a dataset consisting of 10000 genuine and 10 fraudulent transactions, the classifier will tend to classify fraudulent transactions as genuine transactions. The reason can be easily explained by the numbers. Suppose the machine learning algorithm has two possibly outputs as follows: Model 1 classified 7 out of 10 fraudulent transactions as genuine transactions and 10 out of 10000 genuine transactions as fraudulent transactions. Model 2 classified 2 out of 10 fraudulent transactions as genuine transactions and 100 out of 10000 genuine transactions as fraudulent transactions.

How to depict which model is the better solution ? M1 OR M2????? To tell the machine learning algorithm (or the researcher) that Model 2 is better than Model 1, we need to show that Model 2 above is better than Model 1 above. For that, we will need better metrics than just counting the number of mistakes made. T he concept of True Positive, True Negative, False Positive and False Negative has been introduced: True Positive (TP) – An example that is  positive  and is classified correctly as  positive True Negative (TN) – An example that is  negative  and is classified correctly as  negative False Positive (FP) – An example that is  negative  but is classified wrongly as  positive False Negative (FN) – An example that is  positive  but is classified wrongly as  negative Based on this w e will have True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate:

Graph Mining Graphs become increasingly important in modelling complicated structures, such as circuits, images, chemical compounds, protein structures, biological networks, social networks, the Web, workflows, and XML documents. Many graph search algorithms have been developed in chemical informatics, computer vision, video indexing, and text retrieval. With the increasing demand on the analysis of large amounts of structured data, graph mining has become an active and important theme in data mining. Among the various kinds of graph patterns, frequent substructures are the very basic patterns that can be discovered in a collection of graphs. They are useful for characterizing graph sets, discriminating different groups of graphs, classifying and clustering graphs, building graph indices, and facilitating similarity search in graph databases. Recent studies have developed several graph mining methods and applied them to the discovery of interesting patterns in various applications.

Social network Social network can be defined as the set of relationships between individuals where each individual is a social entity. It represents both the collection of ties between people as well as the strength of those ties . In a general way, Social network is used as a measure of social “connectedness”, within the social networks for observing and calculating the quality and quantity of information flow within individuals and also within groups.