Association Rule Mining in Data Mining.pptx

647 views 20 slides Apr 18, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

This ppt is about Association Rule Minning in Data Mining


Slide Content

Association Rule Mining

Association Rule Mining Proposed: by Agrawal et al in 1993. Definition: A technique in data mining for discovering interesting relationships between variables in large datasets. Purpose : Identifies patterns, correlations, or causal structures among sets of items. Application: Market Basket Analysis Cross selling strategies Customer segmentation

Market Basket Analysis Object: To understand the purchase behavior of customers by identifying the set of products bought together. Association rules: A ⇒ D (customers who purchase product A also tend to buy product D at the same time)

Key Concepts: Itemset – set of items ( Eg : x = {milk, bread, cereal} is an itemset) K-itemset – an itemset with k items ( Eg : x = {milk, bread, cereal} is a 3-itemset) Support(s)-Fraction of transactions that contain both X & Y Confidence(c)-Measures how often items in Y appear in transactions that contain X Process: Step 1: Collect and preprocess data (sales transactions). Step 2: Identify frequent itemsets using algorithms (e.g., Apriori , FP-Growth, Eclat Algorithm). Step 3: Generate association rules from these itemsets . Step 4: Evaluate the rules based on metrics (support, confidence, lift).

Market Basket Analysis - Example Ex-{ Shoes,Trouser } =>Shirt S=Frequent( Shoes,Trouser , Shirt)/N =2/4 =0.5 C=Frequent( Shoes,Trouser , Shirt)/Frequent( Shoes,Trouser ) =2/3 =0.67

Apriori Algorithm The Apriori algorithm is a technique for finding frequent itemsets in transactional data. Steps : Define Minimum Support and Minimum Confidence Thresholds Find Frequent Individual Items Generate Candidate Itemsets of Size 2 Prune Infrequent Candidate Itemsets Generate Candidate Itemsets of Size 3 Continue Generating and Pruning Candidate Itemsets Until No More Frequent Itemsets Are Found Generate Association Rules from Frequent Itemsets

Apriori Algorithm - Example

Suppose Min_confidence = 40% If confidence >= min_condidence We accept the Rule {C1R,A2R,A2C} Subset={C1R,A2R},{C1R,A2C},{A2R,A2C},{C1R},{A2R},{A2C} Rule 1- {C1R,A2R} Confidence=support(C1R,A2R,A2C)/support(C1R,A2R) =(2/4)/(2/4) =100%>40% So Rule 1 is Accepted. Rule 2- {C1R,A2C} Confidence=support(C1R,A2R,A2C)/support(C1R,A2C) =(2/4)/(3/4) =2/3 =66.6%>40% So Rule 2 is Accepted. Like this we check all subsets .

Apriori Advantages/Disadvantages Advantages Uses large itemset property Easily parallized Easy to implement Disadvantages Assumes transaction database is more resident Requiers many database scans

Frequent Pattern Growth Approach An efficient data mining algorithm for finding frequent itemsets without candidate generation. Ideal for large datasets where the Apriori algorithm's performance degrades due to the generation of a large number of candidate sets. Steps Calculate minimum support. Find frequency of occurence . Prioritize the items Order the items according to prioriy Validation

Frequent Pattern Growth - Example TID Item 1 E,A,D,B 2 D,A,C,E,B 3 C,A,B,E 4 B,A,D 5 D 6 D,B 7 A,D,E 8 B,C Assume minimum support = 30% Minimum support count = (30/100 * 8 ) = 2.4

TID Item Ordered item 1 E,A,D,B B,A,D,E 2 D,A,C,E,B B,D,A,E,C 3 C,A,B,E B,A,E,C 4 B,A,D B,D,A 5 D D 6 D,B B,D 7 A,D,E D,A,E 8 B,C B,C TID Frequency A 5 B 6 C 3 D 6 E 4 TID Frequency Priority A 5 3 B 6 1 C 3 5 D 6 2 E 4 4

Advantages of FP-Growth Increased Efficiency Requires only two passes over the dataset, minimizing the time and resources needed for data processing. Significantly faster as it avoids the exhaustive candidate generation phase that Apriori undergoes. Elimination of Candidate Generation Directly mines the frequent itemsets from the FP-tree without generating candidate itemsets . Lower Memory Requirements The FP-tree data structure is compact and often requires less memory than storing extensive candidate sets, especially beneficial for dense datasets.

Eclat Algorithm Eclat stands for Equivalence Class Clustering and bottom-up Lattice Traversal. It is a depth-first search algorithm used for mining frequent itemsets , introduced as an efficient alternative to the Apriori algorithm. Focuses on vertical database format, where transactions are represented as sets of items.

Example for Eclat Algorithm T ID Itemset T1 Bread,Butter,Jam T2 Butter, Coke T3 Butter, Milk T4 Bread, Butter, Coke T5 Bread, Milk, T6 Butter, Milk, T7 Bread, Milk T8 Bread, Butter, Milk, Jam T9 Bread, Butter,Milk T ID Bread Butter Milk Coke Jam T1 1 1 1 T2 1 1 T3 1 1 T4 1 1 1 T5 1 1 T6 1 1 T7 1 1 T8 1 1 1 1 T9 1 1 1 K=1,Minimum Support=2 Item T ID Set Bread {T1,T4,T5,T7,T8,T9} Butter {T1,T2,T3,T4,T6,T8,T9} Milk {T3,T5,T6,T7,T8,T9} Coke {T2,T4} Jam {T1,T8}

Item T ID Set {Bread,Butter} {T1,T4,T8,T9} {Bread,Milk} {T5,T7,T8,T9} { Bread,Coke } {T4} { Bread,Jam } {T1,T8} {Butter,Milk} {T3,T6,T8,T9} { Butter,Coke } {T2,T4} { Butter,Jam } {T1,T8} { Milk,Jam } {T8} Item T ID Set {Bread,Butter,Milk} {{T8,T9} { Bread,Butter,Jam } {T1,T8} { Bread,Butter,Coke } {T4} { Bread,Milk,Jam } {T8} Item T ID Set { Bread,Butter,Milk,Jam } {T8} K=2 K=3 K=4

Item Bought Recommended Products Bread Butter Bread Milk Bread Jam Butter Milk Butter Jam Bread and Butter Milk Bread and Butter Jam stop at k=4 because there are no more item- Tid Set pairs to combine minimum support =2, we conclude the following rules form the given dataset

Eclat Algorithm Advantages/Disadvantages Advantages Depth-first search reduces memory requirements Usually (considerably) faster than Apriori No need to scan the database to find the support of (k+1) itemsets , for k>=1 Disadvantages The TID-sets can be quite long, hence expensive to manipulate

Thank You!
Tags