Data mining approaches and methods

1,572 views 85 slides Aug 05, 2018
Slide 1
Slide 1 of 85
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85

About This Presentation

This slide is about the data mining tasks and techniques.This slide consist of all the major mining task algorithms with their respective examples.


Slide Content

Data mining techniques and tasks Prepared by : Prabesh Pradhan

The process of collecting, searching through, and analyzing a large amount of data in a database, as to discover pattern s or relationships E xtraction of useful patterns from data sources, e.g., databases, data warehouses, web. Patterns must be valid, novel, potentially useful, understandable. DATA MINING

DATA MINING MODELS AND TASKS

Predictive:- It makes prediction about values of data using known results from different data or based on historical data. Descriptive:- I t identifies patterns or relationship in data, it serves as a way to explore properties of data.

Data mining tasks The process of collecting, searching through, and analyzing a large amount of data in a database, as to discover patterns or relationships Extraction of useful patterns from data sources, e.g. databases, data warehouses, web. Patterns must be valid, novel, potentially useful, understandable.

CLASSIFICATION Classification derives a model to determine the class of an object based on its attributes . Given a collection of records, each record contains a set of attributes , one of the attributes is the class . For eg : pattern recognition

The value of attribute is examined as it varies over time . A time series plot is used to visualize time series . Ex:- stock exchange TIME SERIES ANALAYSIS

Clustering is the task of segmenting a diverse group into a number of similar subgroups or clusters. Most similar data are grouped in clusters Ex:-Bank customer CLUSTERING

Abstraction or generalization of data resulting in a smaller set which gives general overview of a data. alternatively , summary type information can be derived from data. SUMMARIZATION

Data mining techniques Association Classification Clustering Prediction Sequential Patterns Decision trees

Association The pattern is discovered based on a relationship between items in the same transaction The association technique is also known as  relation technique The association technique is used in  market basket analysis  to identify a set of products that customers frequently purchase together.

Classification It classify each item in a set of data into one of a predefined set of classes or groups. It makes use of mathematical techniques such as decision trees, linear programming, neural network, and statistics We develop the software that can learn how to classify the data items into groups. For example, we can apply classification in the application that “given all records of employees who left the company, predict who will probably leave the company in a future period.”

Clustering It makes a meaningful or useful cluster of objects which have similar characteristics using the automatic technique. The clustering technique defines the classes and puts objects in each class, while in the classification techniques, objects are assigned into predefined classes.

Prediction  The prediction analysis technique can be used in the sale to predict profit for the future if we consider the sale is an independent variable, profit could be a dependent variable. It is based on the historical sale and profit data, we can draw a fitted regression curve that is used for profit prediction

Sequential patterns It seeks to discover or identify similar patterns, regular events or trends in transaction data over a business period. Eg : In sales, with historical transaction data, businesses can identify a set of items that customers buy together different times in a year. Then businesses can use this information to recommend customers buy it with better deals based on their purchasing frequency in the past.

Classification and prediction Prepared by : Sonahang Rai

Decision Tree It is a classification scheme which generates a tree and a set of rules from given data set. A decision tree consist of root node, edges and leaf. Each internal node denotes a test on an attribute. Each branch node denotes outcome of the test. Each leaf node holds the class label.

Decision tree example

ID3 algorithm Calculate the entropy of every attribute using the dataset. Split the set into subsets using the attribute for which entropy is minimum(or information gain is maximum) Make a decision tree node containing that attribute Recurse on subset using remaining attributes

Entropy Info(D)=- where D=number of rows class/D (D)=   Information gain Gain(A)=Info(d)- (D)  

ID3 example

Finding entropy and gain Finding info(d) Info(D)= - ( )- ( )= 0.940 bits Finding for age (D)= * (- ( )- ( ))+ * (- ( )- ( ))+ * (- ( )- ( )) = 0.694 bits Thus, Gain(age)=Info(D)- (D)=0.945-0.694=0.246 bits Similarly we can find gain of all other attributes. Gain(income)=0.029,Gain(student)=0.151,Gain( credit_rating )=0.048 Among all the gains the gain of age is the maximum thus it is taken as the root node.  

Rule Based Classification It uses the set of IF-THEN rules for the classification. IF-THEN rule is an expression of the form IF condition THEN conclusion. IF part is called rule antecedent or precondition, it can consist of one or more attributes test. THEN part is called rule consequent, it consist a class prediction. Ex: R1:IF age=youth and student=yes THEN buys_computer =yes Or R1: (age=youth)Λ(student=yes)=>( buys_computer =yes)

A rule R can be assessed by its coverage and accuracy –Given a tuple X from a data D –Let n cover: # of tuples covered by R –n correct: # of tuples correctly classify by R –|D|: # of tuples in D Coverage(R)= Accuracy(R)=  

R: IF age=youth AND student=yes THEN buys_computer =yes |D| = 14 N cover = 2 N correct = 2 Coverage(R)= =14.25% Correct(R)= =100%  

Rules extraction from the decision tree One rule is created for each path from the root to a leaf node Each splitting criterion is logically AND to form the rule antecedent (IF part) Leaf node holds the class prediction for rule consequent (THEN part) Logical OR is implied between each of the extracted rules

Example of rules extraction from the decision tree Rule’s (age=young)˄(student=no)=>( buy_computer =no) (age=young)˄(student=yes)=>( buy_computeryes ) (age=middle-aged)=>( buy_computer =yes) (age=senior)˄( credit_rating =fair)=>( buy_computer =no) (age=young)˄( credit_rating =excellent)=>( buy_computer =yes)

Genetic algorithm It is used for finding optimized solutions to search problems. It is based on the theory of natural selection and evolution of biology. Selection: Survival of the fittest Evolution: Origin of species from a common descendent It is excellent in searching large and complex data sets.

Gene: A part of chromosome. Chromosome: A set of gene. Population : No of individuals present with same length of chromosome. Fitness: It is the value assigned to individual. Fitness function: Function which assigns fitness value. Selection: Selecting values for next generation. Mutation: Changing a random gene.

Algorithm Generate random population of  n  chromosomes Evaluate the fitness of each chromosome in the population Select two parent chromosomes from a population according to their fitness With a crossover probability cross over the parents to form a new offspring. With a mutation probability mutate new offspring .  If the end condition is satisfied,  stop , and return the best solution in current population

Linear Regression It  is a data mining technique used to predict a range of numeric values (also called continuous values), given a particular dataset. It is used to estimate a relationship between two variables. Here involves a response/dependent variable y and a single predictor/independent variable x y = + x where (y-intercept) and (slope) are regression coefficients  

X years experience Y salary(in $ 1000s) 3 30 8 57 9 64 13 72 3 36 6 43 11 59 21 90 1 20 16 83

Non-linear regression, Association and frequent pattern Prepared by : Biplap Bhattarai

Often the relationship between x and y cannot be approximated with a straight line. In this case, a nonlinear regression technique may be used. Alternatively, the data could be preprocessed to make the relationship linear. Most Non-linear models can be modeled by polynomial regression model can be transformed into linear regression model. For Example: Y=w +w 1 x+w 2 x 2 +w 3 x 3 +w 4 x 4 Non- Linear Regression:

For Example: Y=w +w 1 x+w 2 x 2 +w 3 x 3 +w 4 x 4 Convertible to Linear with new Variables: X 2 = x 2 ,x 3 =x 3 ,x 4 =x 4 Y=w +w 1 x 1 +w 2 x 2 +w 3 x 3 +w 4 x 4 Which is easily solved by method of least squares using software for regression analysis. Non- Linear Regression:

Association Rules: Detects sets of attributes that frequently co-occur, and rules among them, e.g. 90% of the people who buy cookies, also buy milk (60% of all grocery shoppers buy both) Frequent Pattern: Frequent Pattern is a pattern (a set of items, subsequences) that occurs frequently in a dataset. For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set, is a frequent itemset . A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.  Data Mining Techniques

Association Rule Mining: Association rule mining is a procedure which is meant to find frequent patterns, correlations, associations or casual structures from datasets found in various kind of databases (relational, transactional). Data Mining Techniques

The Apriori Principle: If an itemset is frequent, then all of its subsets must also be frequent. Conversely, if an subset is infrequent, then all of its supersets must be infrequent, too. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data. Apriori is designed to operate on database containing transactions (for example, collections of items bought by customers, or details of a website frequentation). DEFINITION OF APRIORI ALGORITHM

Apriori principle holds due to the following property of the support measure: ∀X,Y:(X⊂Y)→s(X)≥s(Y) For all x,y if x is the subset of y implies support of x is greater or equal to support of y. DEFINITION OF APRIORI ALGORITHM

Frequent Itemsets : All the sets which contain the item with the minimum support (denoted by L i For i th itemset ) Apriori Property: Any subset of frequent itemset must be frequent. Join Operation: To find l k , a set of candidate k- itemsets is generated by joining L k-1 with itself KEY CONCEPTS

Given minimum required support s : Search for all individual elements (1 element item set) that have a minimum support of s Repeat 2.1 From the result of the previous search for i-element item-sets, search for all i+1 element item-sets that have a minimum support of s 2.2 This becomes the sets of all frequent (i+1) element item-sets that are interesting Until item-set size reaches maximum The Apriori Algorithm

The Frequent itemsets are items in L1, L2, L3

Pros of the Apriori algorithm It is an easy-to-implement and easy-to-understand algorithm. It can be used on large itemsets . Cons of the Apriori Algorithm Sometimes, it may need to find a large number of candidate rules which can be computationally expensive. Calculating support is also expensive because it has to go through the entire database. The Apriori Algorithm

FP tree algorithm, which use to identify frequent patterns in the area of Data Mining. The Frequent Pattern Tree (FP-Tree) is a compact structure that stores quantitative information about frequent patterns in a database. Frequent Pattern Tree (FP Tree)

One root labeled as “null” with a set of item-prefix subtrees as children Each node in the item-prefix subtree consists of three fields: Item-name: registers which item is represented by the node; Count: the number of transactions represented by the portion of the path reaching the node; Node-link: links to the next node in the FP-tree carrying the same item-name, or null if there is none. Frequent Pattern Tree (FP Tree)

Question :Find all frequent itemsets or frequent patterns in the following database using FP-growth algorithm. Take minimum support as 30%. Frequent Pattern Tree (FP Tree)

Step 1 - Calculate Minimum support First should calculate the minimum support count. Question says minimum support should be 30%. It calculate as follows: Minimum support count(30/100 * 8) = 2.4 As a result, 2.4 appears but to empower the easy calculation it can be rounded to to the ceiling value. Now, Minimum support count is ceiling(30/100 * 8) = 3 Frequent Pattern Tree (FP Tree)

Step 2 - Find frequency of occurrence Now time to find the frequency of occurrence of each item in the Table. For example, item A occurs in row 1,row 2,row 3,row 4 and row 7. Totally 5 times occurs in the Database table. You can see the counted frequency of occurrence of each item in Table below . Frequent Pattern Tree (FP Tree)

Step 3 - Prioritize the items In Table 2 you can see the numbers written in Red pen. Those are the priority of each item according to it's frequency of occurrence. Item B got the highest priority (1) due to it's highest number of occurrences. At the same time you have opportunity to drop the items which not fulfill the minimum support requirement .For instance, if Table contain F which has frequency 1, then you can drop it. Frequent Pattern Tree (FP Tree)

Step 4 -Order the items according to priority As you see in the Table below new column added to the Table before. In the Ordered Items column all the items are queued according to it's priority, which mentioned in the Red ink in Table. For example, in the case of ordering row 1, the highest priority item is B and after that D, A and E respectively. Frequent Pattern Tree (FP Tree)

Row 1: Note that all FP trees have 'null' node as the root node. So draw the root node first and attach the items of the row 1 one by one respectively. And write their occurrences in front of it. Frequent Pattern Tree (FP Tree)

Row 2: Then update the above tree by entering the items of row 2. Then without creating another branch you can go through the previous branch up to E and then you have to create new node after that for C. (When you going through the branch  second time you should erase one and write two for indicating the two times you visit to that node. If you visit through three times then write three after erase two.) Figure 2 shows the FP tree after adding row 1 and row 2. Frequent Pattern Tree (FP Tree)

Row 3: In row 3 you have to visit B,A,E and C respectively. So you may think you can follow the same branch again by replacing the values of B,A,E and C. But you can't do that you have opportunity to come through the B. But can't connect B to existing A overtaking D. As a result you should draw another A and connect it to B and then connect new E to that A and new C to new E. Frequent Pattern Tree (FP Tree)

Row 4: Then row 4 contain B,D,A. Now we can just rename the frequency of occurrences in the existing branch. As B:4,D,A:3. Row 5: In fifth raw have only item D. Now we have opportunity draw new branch from 'null' node. Frequent Pattern Tree (FP Tree)

Row 6: B and D appears in row 6. So just change the B:4 to B:5 and D:3 to D:4. Row 7: Attach two new nodes A and E to the D node which hanging on the null node. Then mark D,A,E as D:2,A:1 and E:1. Row 8 : Attach new node C to B. Change the traverse times.(B:6,C:1) Frequent Pattern Tree (FP Tree)

Step 6 - Validation After the five steps the final FP tree as follows: How we know is this correct? Now count the frequency of occurrence of each item of the FP tree and compare it with Table. If both counts equal, then it is positive point to indicate your tree is correct. Frequent Pattern Tree (FP Tree)

FP Growth Stands for frequent pattern growth It is a scalable technique for mining frequent pattern in a database After constructing the FP-Tree it’s possible to mine it to find the complete set of frequent patterns FP growth improves Apriority to a big extent Frequent Item set Mining is possible without candidate generation Only “two scan” to the database is needed FP-Growth

Simply a two step procedure – Step 1: Build a compact data structure called the FP-tree Built using 2 passes over the data-set. – Step 2: Extracts frequent item sets directly from the FP- tree FP-Growth Algorithm Process

FP-Growth Example: Example: With Min Support 3 FP-Tree

FP-Growth Example: For T: We Start with drawing tree whose end nodes are Ts , Keeping only support of T

FP-Growth Example: We Take out T one by one, and as we do so, we push its support to every node up the chain to the root that was part of the same transaction in which T was:

FP-Growth Example: As, D is infrequent (min support=3), so we remove D Itemset ={C,T},{W,T},{A,T},{C,W,T},{C,A,T},{W,A,T},{T}

FP-Growth Example: We Then, starting from main fp -tree, we consider each item that appears in the tree and create new FP-trees for C, W, A, and D.

Market Basket Analysis Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. For example, if you are in an US, and purchase Diaper, Milk then you are likely to buy beer. The set of items a customer buys is referred to as an itemset , and market basket analysis seeks to find relationships between purchases.

Market Basket Analysis Typically the relationship will be in the form of a rule: IF {milk, diaper} THEN {beer}. The probability that a customer will buy milk without a diaper or vice versa is referred to as the support for the rule. The conditional probability that a customer will purchase beer is referred to as the confidence.

How is it used? In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what a customer might have bought if the idea had occurred to them . Market basket analysis can be used in deciding the location and promotion of goods inside a store.

How is it used? In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what a customer might have bought if the idea had occurred to them . Market basket analysis can be used in deciding the location and promotion of goods inside a store.

Types: There are two main types of MBA: Predictive MBA is used to classify cliques of item purchases, events and services that largely occur in sequence. Differential MBA removes a high volume of insignificant results and can lead to very in-depth results. It compares information between different stores, demographics, seasons of the year, days of the week and other factors.

Clustering Prepared by : Trilok Pratap Kafle

Clustering Cluster :- Cluster is a group of objects that belongs to the same class. similar objects are grouped in one cluster and dissimilar objects are grouped in another cluster. Clustering is the process of making a group of abstract (something different) objects into classes of similar objects.

Application of cluster analysis Clustering can help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns. Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing. Clustering also helps in classifying documents on the web for information discovery.

Clustering methods Partitioning method Hierarchical method

Partitioning method Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the following requirements − Each group contains at least one object. Each object must belong to exactly one group. Types of Partitioning method K-mean K- medoids

K-mean K-Means is one of the most popular "clustering" algorithms. K-means stores K centroids that it uses to define clusters.  A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

Steps in K-mean Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. Centroid is the centre (mean point) of the cluster. 3. Assign each object to the cluster with the nearest seed point. 4. Go back to Step 2, stop when no more new assignment.

=>2, 3, 6, 8, 9, 12, 15, 18, 22 – break into 3 clusters – Cluster 1 - 2, 12, 18 – mean = 10.6 – Cluster 2 - 6, 9, 22 – mean = 12.3 – Cluster 3 – 3, 8, 15 – mean = 8.6 • Re-assign – Cluster 1 - mean = 0 – Cluster 2 – 12, 15, 18, 22 - mean = 16.75 – Cluster 3 – 2, 3, 6, 8, 9 – mean = 5.6 • Re-assign – Cluster 1 – 2 – mean = 2 – Cluster 2 – 12, 15, 18, 22 – mean = 16.75 – Cluster 3 = 3, 6, 8, 9 – mean = 6.5 Re-assign – Cluster 1 – 2, 3 – mean = 2.5 – Cluster 2 – 12, 15, 18, 22 – mean = 16.75 – Cluster 3 – 6, 8, 9 – mean = 7.6 • Re-assign – Cluster 1 – 2, 3 – mean = 2.5 – Cluster 2 – 12, 15, 18, 22 - mean = 16.75 – Cluster 3 – 6, 8, 9 – mean = 7.6 • No change, so we’re done

K- Mediods It is also a algorithm which breaks the data sets into group. It also attempt to minimize the distance between points labeled to be the cluster of that cluster. Data point is taken as center and work with a generalization of the Manhattan Norm.

Hierarchical Method Hierarchical clustering,  also known as  hierarchical cluster analysis,  is an algorithm that groups similar objects into groups called  clusters . The endpoint is a set of clusters ,  where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.  For example, all files and folders on the hard disk are organized in a hierarchy.  There are two types of hierarchical clustering:- 1.  Divisive   2.  Agglomerative

Divisive Method In  divisive  or  top-down clustering  method we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation. There is evidence that divisive algorithms produce more accurate hierarchies than agglomerative  algorithms in some circumstances but is conceptually more complex.

Agglomerative Method In  agglomerative  or  bottom-up clustering  method we assign each observation to its own cluster. Then, compute the similarity (e.g., distance) between each of the clusters and join the two most similar clusters. Repeat the process until there is only a single cluster left. Before any clustering is performed, it is required to determine the proximity matrix containing the distance between each point using a distance function. Then, the matrix is updated to display the distance between each cluster. The following three methods differ in how the distance between each cluster is measured.

Single Linkage In single linkage hierarchical clustering, the distance between two clusters is defined as the  shortest  distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two closest points. Complete Linkage In complete linkage hierarchical clustering, the distance between two clusters is defined as the  longest  distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points.

Average Linkage In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster. For example, the distance between clusters “r” and “s” to the left is equal to the average length each arrow between connecting the points of one cluster to the other.