Association and Correlation analysis.....

567 views 61 slides Jun 04, 2024
Slide 1
Slide 1 of 61
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61

About This Presentation

Association and Correlation analysis


Slide Content

UNIT 4: Association and Correlation analysis

Data mining: semi-automatic discovery of interesting patterns from large data sets Knowledge discovery is a process: Preprocessing Data mining Postprocessing Summary: What is Data Mining?

Data mining Input data Results Preprocessing Postprocessing Operational Database Selection Selection Utilization Cleaned Verified Focused Eval. of interes- tingness Raw data Time based selection Selected usable patterns 1 3 2 Summary: Typical KDD Process

Basics FP Growth algorithm Examples Apriori Algorithm Applications Association Rules

What Is An Itemset? A set of items together is called an itemset. If any itemset has k- items it is called a k- itemset . An itemset consists of two or more items. An itemset that occurs frequently is called a frequent itemset . Thus frequent itemset mining is a data mining technique to identify the items that often occur together. For Example , Bread and butter, Laptop and Antivirus software, etc.

What Is A Frequent Itemset? A set of items is called frequent if it satisfies a minimum threshold value for support and confidence . Support shows transactions with items purchased together in a single transaction . Confidence shows transactions where the items are purchased one after the other . For frequent itemset mining method, we consider only those transactions which meet minimum threshold support and confidence requirements . Insights from these mining algorithms offer a lot of benefits, cost-cutting and improved competitive advantage .

Frequent Pattern Mining (FPM) The frequent pattern mining algorithm is one of the most important techniques of data mining to discover relationships between different items in a dataset . FPM has many applications in the field of data analysis, software bugs, cross- marketing, sale campaign analysis, market basket analysis, etc . Frequent itemsets discovered through Apriori have many applications in data mining tasks. Tasks such as finding interesting patterns in the database, finding out sequence and Mining of association rules is the most important of them . Association rules apply to supermarket transaction data, that is, to examine the customer behavior in terms of the purchased products. Association rules describe how often the items are purchased together.

This is the most typical example of association mining. Data is collected using barcode scanners in most supermarkets. This database, known as the “market basket” database, consists of a large number of records on past transactions. A single record lists all the items bought by a customer in one sale. Knowing which groups are inclined towards which set of items gives these shops the freedom to adjust the store layout and the store catalogue to place the optimally concerning one another. Market Basket Analysis

When people buy green tea, it is evident that they may also buy honey with it. This relationship is depicted as a conditional algorithm, as given below. IF {green tea} THEN {honey} It represents that items stated on the right are more likely to be ordered with the items on the left side. Market basket analysis in data mining helps us understand that relationship and how helpful it would be to alter our decisions based on the analysis.

Typical representation formats for association rules: Green Tea  Honey[0.5%, 60%] buys: Green Tea  buys: Honey [0.5%, 60%] Green Tea and Honey are bought together in 0.5% of the rows in the database." " IF buys Green Tea , THEN buys Honey in 60% of the cases. Association Rules: Basics

Green Tea  Honey [0.5%, 60%] Association Rules: Basics Antecedent , left-hand side (LHS), body Consequent , right-hand side (RHS), head Support , frequency ("in how big part of the data the things in left- and right-hand sides occur together") Confidence , strength ("if the left-hand side occurs, how likely the right-hand side occurs") " IF buys Green Tea , THEN buys Honey in 60% of the cases in 0.5% of the rows" 1 2 3 4

A=>B Association Rules: Basics Support is a measure of how frequently an itemset (A) appears in a dataset. It indicates the proportion of transactions or records in the dataset that contain the item set. Confidence : denotes probability that a transaction containing A also contains B.

Association rule Support(X) = (Number of transactions containing X) / (Total number of transactions) Confidence(X => Y) = (Number of transactions containing X and Y) / (Number of transactions containing X)

Apriori algorithm  is given by R. Agrawal and R. Srikant in 1994 for finding frequent item sets in a dataset for Boolean association rule . Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties . Apriori algorithm refers to the algorithm which is used to calculate the association rules between objects. This algorithm uses two steps “join” and “prune” to reduce the search space. It is an iterative approach to discover the most frequent item sets. It means how two or more objects are related to one another. In other words, we can say that the Apriori algorithm is an association rule leaning that analyses that people who bought product A also bought product. Apriori Algorithm

Method: Initially, scan DB once to get frequent 1-itemset. Generate length (k+1) candidate itemsets from length k frequent itemsets. Test the MINIMUM SUPPORT. Terminate when no frequent or candidate set can be generated Apriori Algorithm

First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted by L1 . Next, L1 is used to find L2 , the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k- itemsets can be found. The finding of each Lk requires one full scan of the database. Apriori Algorithm

Apriori Algorithm

Apriori Algorithm – Example 2 Find the frequent itemsets and generate association rules on this. Assume that minimum support threshold (s = 33.33%) and minimum confident threshold (c = 60%)

Apriori Algorithm – Example 2 Let’s start,                    

There is only one itemset with minimum support 2. So only one itemset is frequent.

Same example with minimum support =3 Frequent Itemset (I) = {Coke, Chips}

Advantages of Apriori Algorithm It is used to calculate large item sets. Simple to understand and apply. Disadvantages of Apriori Algorithms Apriori algorithm is an expensive method to find support since the calculation has to pass through the whole database. Sometimes, you need a huge number of candidate rules, so it becomes computationally more expensive. Apriori Algorithm

Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100 # of scans: 100 Bottleneck: candidate-generation-and-test Can we avoid candidate generation? Bottleneck of Frequent-pattern Mining

The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate generations, thus improving performance. For so much, it uses a divide-and-conquer strategy. The core of this method is the usage of a special data structure named frequent-pattern tree (FP-tree) , which retains the item set association information. What is FP Growth Algorithm?

The  FP Growth algorithm in data mining is a popular method for frequent pattern mining. The algorithm is efficient for mining frequent item sets in large datasets. It works by constructing a frequent pattern tree (FP-tree) from the input dataset. FP Growth algorithm was developed by Han in  2000  and is a powerful tool for frequent pattern mining in data mining. It is widely used in various applications such as market basket analysis, bioinformatics, and web usage mining. What is FP Growth Algorithm?

FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach : Step 1: Build a compact data structure called the FP-tree Built using 2 passes over the data-set . Step 2: Extracts frequent itemsets directly from the FP-tree Process of FP growth

Working on FP Growth Algorithm The working of the FP Growth algorithm in data mining can be summarized in the following steps: Scan the database: In this step, the algorithm scans the input dataset to determine the frequency of each item. This determines the order in which items are added to the FP tree, with the most frequent items added first. Sort items: In this step, the items in the dataset are sorted in descending order of frequency. The infrequent items that do not meet the minimum support threshold are removed from the dataset. This helps to reduce the dataset's size and improve the algorithm's efficiency.

Construct the FP-tree: In this step, the FP-tree is constructed. The FP-tree is a compact data structure that stores the frequent itemsets and their support counts. Generate frequent itemsets : Once the FP-tree has been constructed, frequent itemsets can be generated by recursively mining the tree. Starting at the bottom of the tree, the algorithm finds all combinations of frequent item sets that satisfy the minimum support threshold. Generate association rules: Once all frequent item sets have been generated, the algorithm post-processes the generated frequent item sets to generate association rules, which can be used to identify interesting relationships between the items in the dataset.

Now, for each item, the  Conditional Pattern Base  is computed which is path labels of all the paths which lead to any node of the given item in the frequent-pattern tree. Note that the items in the below table are arranged in the ascending order of their frequencies.  Y M O

Now for each item, the  Conditional Frequent Pattern Tree is built.  It is done by taking the set of elements that is common in all the paths in the Conditional Pattern Base of that item and calculating its support count by summing the support counts of all the paths in the Conditional Pattern Base.   

From the Conditional Frequent Pattern tree, the  Frequent Pattern rules  are generated by pairing the items of the Conditional Frequent Pattern Tree set to the corresponding to the item as given in the below table.  

Advantages of FP Growth Algorithm The FP Growth algorithm in data mining has several advantages over other frequent itemset mining algorithms, as mentioned below: Efficiency: FP Growth algorithm is faster and more memory-efficient than other frequent itemset mining algorithms such as Apriori , especially on large datasets with high dimensionality. This is because it generates frequent itemsets by constructing the FP-Tree, which compresses the database and requires only two scans. Scalability: FP Growth algorithm scales well with increasing database size and itemset dimensionality, making it suitable for mining frequent itemsets in large datasets. Resistant to noise: FP Growth algorithm is more resistant to noise in the data than other frequent itemset mining algorithms, as it generates only frequent itemsets and ignores infrequent itemsets that may be caused by noise. Parallelization: FP Growth algorithm can be easily parallelized, making it suitable for distributed computing environments and allowing it to take advantage of multi-core processors.

Disadvantages of FP Growth Algorithm While the FP Growth algorithm in data mining has several advantages, it also has some limitations and disadvantages, as mentioned below: Memory consumption: Although the FP Growth algorithm is more memory-efficient than other frequent itemset mining algorithms, storing the FP-Tree and the conditional pattern bases can still require a significant amount of memory, especially for large datasets. Complex implementation: The FP Growth algorithm is more complex than other frequent itemset mining algorithms, making it more difficult to understand and implement.

Association rule learning is a type of unsupervised learning methods that tests for the dependence of one data element on another data element and create appropriately so that it can be more effective. It tries to discover some interesting relations or relations among the variables of the dataset. It depends on several rules to find interesting relations between variables in the database. Applications of Association Rules

There are various applications of Association Rule which are as follows − Items purchased on a credit card, such as rental cars and hotel rooms, support insight into the following product that customer are likely to buy. Optional services purchased by tele -connection users (call waiting, call forwarding, DSL, speed call, etc.) support decide how to bundle these functions to maximize revenue . Applications of Association Rules

Banking services used by retail users (money industry accounts, CDs, investment services, car loans, etc.) recognize users likely to needed other services . Unusual group of insurance claims can be an expression of fraud and can spark higher investigation . Medical patient histories can supports expressions of likely complications based on definite set of treatments.