Frequent Pattern Analysis, Apriori and FP Growth Algorithm

Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S. A. Shivarkar
Assistant Professor
Contact No.8275032712
[email protected]
Subject-Data Mining and Warehousing (CO314)
Unit –V:Frequent Pattern Analysis

Content
Market Basket Analysis, Frequent item set, closed item set & Association
Rules, mining multilevel association rules, constraint based association rule
mining
Generating Association Rules from Frequent Item sets, Apriori Algorithm,
Improving the Efficiency of Apriori, FP Growth Algorithm
Mining Various Kinds of Association Rules: Mining multilevel association
rules, constraint based association rule mining, Meta rule-Guided Mining of
Association Rules.

What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsetsand association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?—Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.

Why is Frequent Pattern Analysis Important?
Freq. pattern: An intrinsic and important property of
datasets
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications

Basic Concepts: Frequent Patterns
Customer
buys diaper
Customer
buys both
Customer
buys beer
Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
itemset: A set of one or more
items
k-itemsetX = {x
1, …, x
k}
(absolute) support, or, support
countof X: Frequency or
occurrence of an itemsetX
(relative)support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
An itemsetX is frequentif X’s
support is no less than a minsup
threshold

Basic Concepts: Association Rules
Customer
buys diaper
Customer
buys both
Customer
buys beer
Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
Find all the rules X Ywith
minimum support and confidence
support, s, probabilitythat a
transaction contains X Y
confidence, c,conditional
probabilitythat a transaction
having X also contains Y
Let minsup= 50%, minconf= 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)

The Downward Closure Property and Scalable Mining Methods
The downward closureproperty of frequent patterns
Any subset of a frequent itemsetmust be frequent
If {beer, diaper, nuts}is frequent, so is {beer, diaper}
i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB’94)
Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
Vertical data format approach (Charm—Zaki& Hsiao
@SDM’02)

Apriori: A Candidate Generation & Test Approach
Apriori pruning principle: If there is anyitemsetwhich is infrequent, its
superset should not be generated/tested! (Agrawal & Srikant
@VLDB’94, Mannila, et al. @ KDD’ 94)
Method:
Initially, scan DB once to get frequent 1-itemset
Generatelength (k+1) candidateitemsetsfrom length k frequent
itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be generated

The Apriori Algorithm—An Example 1

The Apriori Algorithm—An Example 2

The Apriori Algorithm—An Example 3
A database has five transactions. Let min sup D 60% and min confD 80%.
(a) Find all frequent itemsetsusing Apriori and FP-growth, respectively. Compare the
efficiency of the two mining processes.
(b) List all the strong association rules with support s and confidence c

The Apriori Algorithm Pseudo-Code
C
k: Candidate itemsetof size k
L
k: frequent itemsetof size k
L
1= {frequent items};
for(k= 1; L
k!=; k++) do begin
C
k+1= candidates generated from L
k;
for eachtransaction tin database do
increment the count of all candidates in C
k+1that
are contained in t
L
k+1= candidates in C
k+1with min_support
end
return
kL
k;

Implementation of Apriori
How to generate candidates?
Step 1: self-joining L
k
Step 2: pruning
Example of Candidate-generation
L
3={abc, abd, acd, ace, bcd}
Self-joining: L
3*L
3
abcdfrom abcand abd
acdefrom acdand ace
Pruning:
acdeis removed because adeis not in L
3
C
4 = {abcd}

Further Challenges and Improvement in Apriori
Major computational challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates

Improving the Efficiency of Apriori
Hash-based technique
Hashing itemsetsinto corresponding buckets
A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck,
for k > 1
Transaction reduction
Reducing the number of transactions scanned in future iterations
A transaction that does not contain any frequent k-itemsetscannot contain any frequent
(k C1)-itemsets.
Therefore, such a transaction can be marked or removed from further consideration
because subsequent database scans for j-itemsets, where j > k, will not need to
consider such a transaction.

Improving the Efficiency of Apriori
Partitioning
Partitioning the data to find candidate itemsets
Sampling
Mining on a subset of the given data
The basic idea of the sampling approach is to pick a random sample S of the given data
D, and then search for frequent itemsetsin S instead of D. In this way, we trade off
some degree of accuracy against efficiency
Dynamic item set counting
Adding candidate itemsetsat different points during a scan

Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FP Growth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abcabcdis a frequent pattern

Construct FP-tree from a Transaction Database Example 1

Construct FP-tree from a Transaction Database Example 1 cont…

Construct FP-tree from a Transaction Database Example 2

Mining Various Kinds of Association Rules: Mining
multilevel association rules
Pattern Mining in Multilevel, Multidimensional Space
Multilevel associations involve concepts at different abstraction levels.
Multidimensional associations involve more than one dimension or
predicate (e.g., rules that relate what a customer buys to his or her
age).
Quantitative association rules involve numeric attributes that have an
implicit ordering among values (e.g., age).

Mining Multilevel Associations
For many applications, strong associations discovered at high abstraction
levels, though with high support
May want to drill down to find novel patterns at more detailed levels.
A concept hierarchy defines a sequence of mappings from a set of low-level concepts
to a higher-level, more general concept set
Data can be generalized by replacing low-level concepts within the data by their
corresponding higher-level concepts, or ancestors, from a concept hierarchy.
Association rules generated from mining data at multiple abstraction levels are called
multiple-level or multilevel association
Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.

Mining Multilevel Associations

Mining Multidimensional Associations: Multilevel
mining with uniform support

Mining Multidimensional Associations: Multilevel
mining with reduced support.

Constraint-Based Frequent Pattern Mining
Constraint based association rule mining aims to develop a systematic method
by which the user can find important association among items in a database of
transactions.
The users specify intuition or expectations as constraints to confine the search
space.
This strategy is known as constraint-based mining. The constraints can include
the following:
Knowledge type constraints
Data constraints
Interestingness constraints
Rule constraints

Constraint-Based Frequent Pattern Mining
Knowledge type constraints: These specify the type of knowledge to be
mined, such as association, correlation, classification, or clustering.
Data constraints: These specify the set of task-relevant data.
Dimension/level constraints: These specify the desired dimensions (or
attributes) of the data, the abstraction levels, or the level of the concept
hierarchies to be used in mining.
Interestingness constraints: These specify thresholds on statistical measures of
rule interestingness such as support, confidence, and correlation.

Constraint-Based Frequent Pattern Mining
Rule constraints: These specify the form of, or conditions on, the rules to be
mined. Such constraints may be expressed as meta rules (rule templates), as
the maximum or minimum number of predicates that can occur in the rule
antecedent or consequent, or as relationships among attributes, attribute
values, and/or aggregates.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 31
Reference
Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts and
Techniques”,Elsevier Publishers, ISBN:9780123814791, 9780123814807.
https://onlinecourses.nptel.ac.in/noc24_cs22

Frequent Pattern Analysis, Apriori and FP Growth Algorithm

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Frequent Pattern Analysis, Apriori and FP Growth Algorithm

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......