Feequent Item Mining - Data Mining - Pattern Mining

Frequent Item Mining

What is data mining?
•Pattern Mining
•What patterns
•Why are they useful

3
Definition: Frequent Itemset
•Itemset
–A collection of one or more items
•Example: {Milk, Bread, Diaper}
–k-itemset
•An itemset that contains k items
•Support count ()
–Frequency of occurrence of an itemset
–E.g. ({Milk, Bread,Diaper}) = 2
•Support
–Fraction of transactions that contain an itemset
–E.g. s({Milk, Bread, Diaper}) = 2/5
•Frequent Itemset
–An itemset whose support is greater than or
equal to a minsupthresholdTID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Frequent Itemsets Mining
TID Transactions
100 { A, B, E }
200 { B, D }
300 { A, B, E }
400 { A, C }
500 { B, C }
600 { A, C }
700 { A, B }
800 { A, B, C, E }
900 { A, B, C }
1000 { A, C, E }
•Minimum support level
50%
–{A},{B},{C},{A,B}, {A,C}
•How to link this to Data
Cube?

Three Different Views of FIM
•Transactional Database
–How we do store a transactional
database?
•Horizontal, Vertical, Transaction-Item
Pair
•Binary Matrix
•Bipartite Graph
•How does the FIM formulated in
these different settings?
5TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

6
Frequent Itemset Generationnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there
are 2
d
possible
candidate itemsets

7
Frequent Itemset Generation
•Brute-force approach:
–Each itemset in the lattice is a candidatefrequent itemset
–Count the support of each candidate by scanning the
database
–Match each transaction against every candidate
–Complexity ~ O(NMw) => Expensive since M = 2
d
!!!TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
NTransactions
List of
Candidates
M
w

8
Reducing Number of Candidates
•Apriori principle:
–If an itemset is frequent, then all of its subsets must also
be frequent
•Apriori principle holds due to the following property
of the support measure:
–Support of an itemset never exceeds the support of its
subsets
–This is known as the anti-monotoneproperty of support)()()(:, YsXsYXYX 

9
Illustrating Apriori Principle
Found to be
Infrequentnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Pruned
supersets

10
Illustrating Apriori PrincipleItem Count
Bread 4
Coke 2
Milk 4
Beer 3
Diaper 4
Eggs 1 Itemset Count
{Bread,Milk} 3
{Bread,Beer} 2
{Bread,Diaper} 3
{Milk,Beer} 2
{Milk,Diaper} 3
{Beer,Diaper} 3 Itemset Count
{Bread,Milk,Diaper} 3

Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,
6
C
1+
6
C
2+
6
C
3= 41
With support-based pruning,
6 + 6 + 1 = 13

Apriori
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499, 1994

13
How to Generate Candidates?
•Suppose the items in L
k-1are listed in an order
•Step 1: self-joining L
k-1
insert intoC
k
select p.item
1, p.item
2, …, p.item
k-1, q.item
k-1
from L
k-1p, L
k-1 q
where p.item
1=q.item
1, …, p.item
k-2=q.item
k-2, p.item
k-1 < q.item
k-1
•Step 2: pruning
forall itemsets c in C
kdo
forall (k-1)-subsets s of c do
if (s is not in L
k-1) then delete cfrom C
k

14
Challenges of Frequent Itemset Mining
•Challenges
–Multiple scans of transaction database
–Huge number of candidates
–Tedious workload of support counting for candidates
•Improving Apriori: general ideas
–Reduce passes of transaction database scans
–Shrink number of candidates
–Facilitate support counting of candidates

15
Alternative Methods for Frequent Itemset
Generation
•Representation of Database
–horizontal vs vertical data layoutTIDItems
1A,B,E
2B,C,D
3C,E
4A,C,D
5A,B,C,D
6A,E
7A,B
8A,B,C
9A,C,D
10B Horizontal
Data LayoutA B C D E
1 1 2 2 1
4 2 3 4 3
5 5 4 5 6
6 7 8 9
7 8 9
810
9 Vertical Data Layout

16
ECLAT
•For each item, store a list of transaction ids
(tids)TIDItems
1A,B,E
2B,C,D
3C,E
4A,C,D
5A,B,C,D
6A,E
7A,B
8A,B,C
9A,C,D
10B
Horizontal
Data Layout A BC D E
1 1 2 2 1
4 2 3 4 3
5 5 4 5 6
6 7 8 9
7 8 9
810
9
Vertical Data Layout
TID-list

17
ECLAT
•Determine support of any k-itemset by intersecting tid-lists of
two of its (k-1) subsets.
•3 traversal approaches:
–top-down, bottom-up and hybrid
•Advantage: very fast support counting
•Disadvantage: intermediate tid-lists may become too large for
memoryA
1
4
5
6
7
8
9 B
1
2
5
7
8
10
 AB
1
5
7
8

20
FP-growth Algorithm
•Use a compressed representation of the
database using an FP-tree
•Once an FP-tree has been constructed, it uses
a recursive divide-and-conquer approach to
mine the frequent itemsets

21
FP-tree constructionTID Items
1 {A,B}
2 {B,C,D}
3{A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6{A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
null
A:1
B:1
null
A:1
B:1
B:1
C:1
D:1
After reading TID=1:
After reading TID=2:

22
FP-Tree Construction
null
A:7
B:5
B:3
C:3
D:1
C:1
D:1
C:3
D:1
D:1
E:1
E:1TID Items
1 {A,B}
2 {B,C,D}
3{A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6{A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
Pointers are used to assist
frequent itemset generation
D:1
E:1
Transaction
DatabaseItemPointer
A
B
C
D
E
Header table

23
FP-growth
null
A:7
B:5
B:1
C:1
D:1
C:1
D:1
C:3
D:1
D:1
Conditional Pattern base
for D:
P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
(B:1,C:1)}
Recursively apply FP-
growth on P
Frequent Itemsets found
(with sup > 1):
AD, BD, CD, ACD, BCD
D:1

25
Compact Representation of Frequent Itemsets
•Some itemsets are redundant because they have identical support
as their supersets
•Number of frequent itemsets
•Need a compact representationTIDA1A2A3A4A5A6A7A8A9A10B1B2B3B4B5B6B7B8B9B10C1C2C3C4C5C6C7C8C9C10
1111111111100000000000000000000
2111111111100000000000000000000
3111111111100000000000000000000
4111111111100000000000000000000
5111111111100000000000000000000
6000000000011111111110000000000
7000000000011111111110000000000
8000000000011111111110000000000
9000000000011111111110000000000
10000000000011111111110000000000
11000000000000000000001111111111
12000000000000000000001111111111
13000000000000000000001111111111
14000000000000000000001111111111
15000000000000000000001111111111 








10
1
10
3
k
k

26
Maximal Frequent Itemsetnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCD
E
Border
Infrequent
Itemsets
Maximal
Itemsets
An itemset is maximal frequent if none of its immediate supersets
is frequent

27
Closed Itemset
•An itemset is closed if none of its immediate supersets has the
same support as the itemsetTID Items
1 {A,B}
2 {B,C,D}
3{A,B,C,D}
4 {A,B,D}
5{A,B,C,D} ItemsetSupport
{A} 4
{B} 5
{C} 3
{D} 4
{A,B} 4
{A,C} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3 ItemsetSupport
{A,B,C} 2
{A,B,D} 3
{A,C,D} 2
{B,C,D} 3
{A,B,C,D} 2

28
Maximal vs Closed ItemsetsTIDItems
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2
3 24 34 45
12 2 24 4 4 2
3 4
2 4
Transaction Ids
Not supported by
any transactions

29
Maximal vs Closed Frequent Itemsetsnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Minimum support = 2
# Closed = 9
# Maximal = 4
Closed and
maximal
Closed but
not maximal

30
Maximal vs Closed ItemsetsFrequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets

Beyond Itemsets
•Sequence Mining
–Finding frequent subsequences from a collection of sequences
•Graph Mining
–Finding frequent (connected) subgraphs from a collection of
graphs
•Tree Mining
–Finding frequent (embedded) subtrees from a set of
trees/graphs
•Geometric Structure Mining
–Finding frequent substructures from 3-D or 2-D geometric
graphs
•Among others…

Frequent Pattern Mining
B
A
E
A B
C
C
F
B
D
F
F
D
EA B
A
C
AE
D
C
F
D
A
B
A
C
E
A
D
A
B
D C
A
A B
B
D
D
C
C
A B
D C

Why Frequent Pattern Mining is So
Important?
•Application Domains
–Business, biology, chemistry, WWW, computer/networing security, …
•Summarizing the underlying datasets, providing key insights
•Basic tools for other data mining tasks
–Assocation rule mining
–Classification
–Clustering
–Change Detection
–etc…

Network motifs: recurring patterns that
occur significantly more than in
randomized nets
•Do motifs have specific roles in the network?
•Many possible distinct subgraphs

The 13 three-node connected
subgraphs

199 4-node directed connectedsubgraphs
And it grows fast for larger subgraphs : 93645-node subgraphs,
1,530,8436-node…

Finding network motifs –
an overview
•Generation of a suitable random ensemble (reference
networks)
•Network motifs detection process:
Count how many times each subgraph
appears
Compute statistical significance for each
subgraph –probability of appearing in
random as much as in real network
(P-val or Z-score)

Real = 5 Rand=0.5±0.6
Zscore (#Standard Deviations)=7.5
Ensemble
of networks

39
References
•R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in
large databases. SIGMOD, 207-216, 1993.
•R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499,
1994.
•R. J. Bayardo. Efficiently mining long patterns
from databases. SIGMOD, 85-93, 1998.

References:
•Christian Borgelt, Efficient Implementations of
Apriori and Eclat, FIMI’03
•Ferenc Bodon, A fast APRIORI implementation,
FIMI’03
•Ferenc Bodon, A Survey on Frequent Itemset
Mining, Technical Report, Budapest University
of Technology and Economic, 2006

Important websites:
•FIMI workshop
–Not only Apriori and FIM
•FP-tree, ECLAT, Closed, Maximal
–http://fimi.cs.helsinki.fi/
•Christian Borgelt’s website
–http://www.borgelt.net/software.html
•Ferenc Bodon’s website
–http://www.cs.bme.hu/~bodon/en/apriori/

Feequent Item Mining - Data Mining - Pattern Mining

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Feequent Item Mining - Data Mining - Pattern Mining

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 20

Slide 21

Slide 22

Slide 23

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......