Feequent Item Mining - Data Mining - Pattern Mining

JasonPulikkottil 30 views 41 slides Jun 22, 2024
Slide 1
Slide 1 of 41
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41

About This Presentation

What is data mining?
Pattern Mining
What patterns
Why are they useful


Slide Content

Frequent Item Mining

What is data mining?
•Pattern Mining
•What patterns
•Why are they useful

3
Definition: Frequent Itemset
•Itemset
–A collection of one or more items
•Example: {Milk, Bread, Diaper}
–k-itemset
•An itemset that contains k items
•Support count ()
–Frequency of occurrence of an itemset
–E.g. ({Milk, Bread,Diaper}) = 2
•Support
–Fraction of transactions that contain an itemset
–E.g. s({Milk, Bread, Diaper}) = 2/5
•Frequent Itemset
–An itemset whose support is greater than or
equal to a minsupthresholdTID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Frequent Itemsets Mining
TID Transactions
100 { A, B, E }
200 { B, D }
300 { A, B, E }
400 { A, C }
500 { B, C }
600 { A, C }
700 { A, B }
800 { A, B, C, E }
900 { A, B, C }
1000 { A, C, E }
•Minimum support level
50%
–{A},{B},{C},{A,B}, {A,C}
•How to link this to Data
Cube?

Three Different Views of FIM
•Transactional Database
–How we do store a transactional
database?
•Horizontal, Vertical, Transaction-Item
Pair
•Binary Matrix
•Bipartite Graph
•How does the FIM formulated in
these different settings?
5TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

6
Frequent Itemset Generationnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there
are 2
d
possible
candidate itemsets

7
Frequent Itemset Generation
•Brute-force approach:
–Each itemset in the lattice is a candidatefrequent itemset
–Count the support of each candidate by scanning the
database
–Match each transaction against every candidate
–Complexity ~ O(NMw) => Expensive since M = 2
d
!!!TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
NTransactions
List of
Candidates
M
w

8
Reducing Number of Candidates
•Apriori principle:
–If an itemset is frequent, then all of its subsets must also
be frequent
•Apriori principle holds due to the following property
of the support measure:
–Support of an itemset never exceeds the support of its
subsets
–This is known as the anti-monotoneproperty of support)()()(:, YsXsYXYX 

9
Illustrating Apriori Principle
Found to be
Infrequentnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Pruned
supersets

10
Illustrating Apriori PrincipleItem Count
Bread 4
Coke 2
Milk 4
Beer 3
Diaper 4
Eggs 1 Itemset Count
{Bread,Milk} 3
{Bread,Beer} 2
{Bread,Diaper} 3
{Milk,Beer} 2
{Milk,Diaper} 3
{Beer,Diaper} 3 Itemset Count
{Bread,Milk,Diaper} 3

Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,
6
C
1+
6
C
2+
6
C
3= 41
With support-based pruning,
6 + 6 + 1 = 13

Apriori
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499, 1994

13
How to Generate Candidates?
•Suppose the items in L
k-1are listed in an order
•Step 1: self-joining L
k-1
insert intoC
k
select p.item
1, p.item
2, …, p.item
k-1, q.item
k-1
from L
k-1p, L
k-1 q
where p.item
1=q.item
1, …, p.item
k-2=q.item
k-2, p.item
k-1 < q.item
k-1
•Step 2: pruning
forall itemsets c in C
kdo
forall (k-1)-subsets s of c do
if (s is not in L
k-1) then delete cfrom C
k

14
Challenges of Frequent Itemset Mining
•Challenges
–Multiple scans of transaction database
–Huge number of candidates
–Tedious workload of support counting for candidates
•Improving Apriori: general ideas
–Reduce passes of transaction database scans
–Shrink number of candidates
–Facilitate support counting of candidates

15
Alternative Methods for Frequent Itemset
Generation
•Representation of Database
–horizontal vs vertical data layoutTIDItems
1A,B,E
2B,C,D
3C,E
4A,C,D
5A,B,C,D
6A,E
7A,B
8A,B,C
9A,C,D
10B Horizontal
Data LayoutA B C D E
1 1 2 2 1
4 2 3 4 3
5 5 4 5 6
6 7 8 9
7 8 9
810
9 Vertical Data Layout

16
ECLAT
•For each item, store a list of transaction ids
(tids)TIDItems
1A,B,E
2B,C,D
3C,E
4A,C,D
5A,B,C,D
6A,E
7A,B
8A,B,C
9A,C,D
10B
Horizontal
Data Layout A BC D E
1 1 2 2 1
4 2 3 4 3
5 5 4 5 6
6 7 8 9
7 8 9
810
9
Vertical Data Layout
TID-list

17
ECLAT
•Determine support of any k-itemset by intersecting tid-lists of
two of its (k-1) subsets.
•3 traversal approaches:
–top-down, bottom-up and hybrid
•Advantage: very fast support counting
•Disadvantage: intermediate tid-lists may become too large for
memoryA
1
4
5
6
7
8
9 B
1
2
5
7
8
10
 AB
1
5
7
8

20
FP-growth Algorithm
•Use a compressed representation of the
database using an FP-tree
•Once an FP-tree has been constructed, it uses
a recursive divide-and-conquer approach to
mine the frequent itemsets

21
FP-tree constructionTID Items
1 {A,B}
2 {B,C,D}
3{A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6{A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
null
A:1
B:1
null
A:1
B:1
B:1
C:1
D:1
After reading TID=1:
After reading TID=2:

22
FP-Tree Construction
null
A:7
B:5
B:3
C:3
D:1
C:1
D:1
C:3
D:1
D:1
E:1
E:1TID Items
1 {A,B}
2 {B,C,D}
3{A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6{A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
Pointers are used to assist
frequent itemset generation
D:1
E:1
Transaction
DatabaseItemPointer
A
B
C
D
E
Header table

23
FP-growth
null
A:7
B:5
B:1
C:1
D:1
C:1
D:1
C:3
D:1
D:1
Conditional Pattern base
for D:
P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
(B:1,C:1)}
Recursively apply FP-
growth on P
Frequent Itemsets found
(with sup > 1):
AD, BD, CD, ACD, BCD
D:1

25
Compact Representation of Frequent Itemsets
•Some itemsets are redundant because they have identical support
as their supersets
•Number of frequent itemsets
•Need a compact representationTIDA1A2A3A4A5A6A7A8A9A10B1B2B3B4B5B6B7B8B9B10C1C2C3C4C5C6C7C8C9C10
1111111111100000000000000000000
2111111111100000000000000000000
3111111111100000000000000000000
4111111111100000000000000000000
5111111111100000000000000000000
6000000000011111111110000000000
7000000000011111111110000000000
8000000000011111111110000000000
9000000000011111111110000000000
10000000000011111111110000000000
11000000000000000000001111111111
12000000000000000000001111111111
13000000000000000000001111111111
14000000000000000000001111111111
15000000000000000000001111111111 








10
1
10
3
k
k

26
Maximal Frequent Itemsetnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCD
E
Border
Infrequent
Itemsets
Maximal
Itemsets
An itemset is maximal frequent if none of its immediate supersets
is frequent

27
Closed Itemset
•An itemset is closed if none of its immediate supersets has the
same support as the itemsetTID Items
1 {A,B}
2 {B,C,D}
3{A,B,C,D}
4 {A,B,D}
5{A,B,C,D} ItemsetSupport
{A} 4
{B} 5
{C} 3
{D} 4
{A,B} 4
{A,C} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3 ItemsetSupport
{A,B,C} 2
{A,B,D} 3
{A,C,D} 2
{B,C,D} 3
{A,B,C,D} 2

28
Maximal vs Closed ItemsetsTIDItems
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2
3 24 34 45
12 2 24 4 4 2
3 4
2 4
Transaction Ids
Not supported by
any transactions

29
Maximal vs Closed Frequent Itemsetsnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Minimum support = 2
# Closed = 9
# Maximal = 4
Closed and
maximal
Closed but
not maximal

30
Maximal vs Closed ItemsetsFrequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets

Beyond Itemsets
•Sequence Mining
–Finding frequent subsequences from a collection of sequences
•Graph Mining
–Finding frequent (connected) subgraphs from a collection of
graphs
•Tree Mining
–Finding frequent (embedded) subtrees from a set of
trees/graphs
•Geometric Structure Mining
–Finding frequent substructures from 3-D or 2-D geometric
graphs
•Among others…

Frequent Pattern Mining
B
A
E
A B
C
C
F
B
D
F
F
D
EA B
A
C
AE
D
C
F
D
A
B
A
C
E
A
D
A
B
D C
A
A B
B
D
D
C
C
A B
D C

Why Frequent Pattern Mining is So
Important?
•Application Domains
–Business, biology, chemistry, WWW, computer/networing security, …
•Summarizing the underlying datasets, providing key insights
•Basic tools for other data mining tasks
–Assocation rule mining
–Classification
–Clustering
–Change Detection
–etc…

Network motifs: recurring patterns that
occur significantly more than in
randomized nets
•Do motifs have specific roles in the network?
•Many possible distinct subgraphs

The 13 three-node connected
subgraphs

199 4-node directed connectedsubgraphs
And it grows fast for larger subgraphs : 93645-node subgraphs,
1,530,8436-node…

Finding network motifs –
an overview
•Generation of a suitable random ensemble (reference
networks)
•Network motifs detection process:
Count how many times each subgraph
appears
Compute statistical significance for each
subgraph –probability of appearing in
random as much as in real network
(P-val or Z-score)

Real = 5 Rand=0.5±0.6
Zscore (#Standard Deviations)=7.5
Ensemble
of networks

39
References
•R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in
large databases. SIGMOD, 207-216, 1993.
•R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499,
1994.
•R. J. Bayardo. Efficiently mining long patterns
from databases. SIGMOD, 85-93, 1998.

References:
•Christian Borgelt, Efficient Implementations of
Apriori and Eclat, FIMI’03
•Ferenc Bodon, A fast APRIORI implementation,
FIMI’03
•Ferenc Bodon, A Survey on Frequent Itemset
Mining, Technical Report, Budapest University
of Technology and Economic, 2006

Important websites:
•FIMI workshop
–Not only Apriori and FIM
•FP-tree, ECLAT, Closed, Maximal
–http://fimi.cs.helsinki.fi/
•Christian Borgelt’s website
–http://www.borgelt.net/software.html
•Ferenc Bodon’s website
–http://www.cs.bme.hu/~bodon/en/apriori/