Association rule mining

dama2211 43,874 views 54 slides Mar 18, 2016
Slide 1
Slide 1 of 54
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54

About This Presentation

Data mining and warehousing (jntuh syllabus)


Slide Content

Lecture-27Lecture-27
Association rule miningAssociation rule mining

What Is Association Mining?What Is Association Mining?
Association rule miningAssociation rule mining
Finding frequent patterns, associations, correlations, or Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in causal structures among sets of items or objects in
transaction databases, relational databases, and other transaction databases, relational databases, and other
information repositories.information repositories.
ApplicationsApplications
Basket data analysis, cross-marketing, catalog design, Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.loss-leader analysis, clustering, classification, etc.
Lecture-27 - Association rule miningLecture-27 - Association rule mining

Association MiningAssociation Mining
Rule formRule form
prediction (Boolean variables) prediction (Boolean variables) => =>
prediction (Boolean variables) [support, prediction (Boolean variables) [support,
confidence]confidence]
Computer => antivirus_software [support Computer => antivirus_software [support
=2%, confidence = 60%]=2%, confidence = 60%]
buys (x, “computer”) buys (x, “computer”) ® ® buys (x, buys (x,
“antivirus_software”) [0.5%, 60%]“antivirus_software”) [0.5%, 60%]
Lecture-27 - Association rule miningLecture-27 - Association rule mining

Association Rule: Basic ConceptsAssociation Rule: Basic Concepts
Given a database of transactions each transaction Given a database of transactions each transaction
is a list of items (purchased by a customer in a is a list of items (purchased by a customer in a
visit)visit)
Find all rules that correlate the presence of one Find all rules that correlate the presence of one
set of items with that of another set of itemsset of items with that of another set of items
Find frequent patternsFind frequent patterns
Example for frequent itemset mining is market Example for frequent itemset mining is market
basket analysis.basket analysis.
Lecture-27 - Association rule miningLecture-27 - Association rule mining

Association rule performance Association rule performance
measuresmeasures
ConfidenceConfidence
SupportSupport
Minimum support thresholdMinimum support threshold
Minimum confidence thresholdMinimum confidence threshold
Lecture-27 - Association rule miningLecture-27 - Association rule mining

Rule Measures: Support and Rule Measures: Support and
ConfidenceConfidence
Find all the rules Find all the rules X & Y X & Y ÞÞ Z Z with minimum with minimum
confidence and supportconfidence and support

support, support, ss, probability that a transaction , probability that a transaction
contains {X contains {X  Y Y  Z} Z}

confidence, confidence, c,c, conditional probability conditional probability
that a transaction having {X that a transaction having {X  Y} also Y} also
contains contains ZZ
Transaction IDItems Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
Let minimum support 50%, and Let minimum support 50%, and
minimum confidence 50%, we haveminimum confidence 50%, we have

A A ÞÞ C (50%, 66.6%) C (50%, 66.6%)

C C ÞÞ A (50%, 100%) A (50%, 100%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Lecture-27 - Association rule miningLecture-27 - Association rule mining

Martket Basket AnalysisMartket Basket Analysis
Shopping basketsShopping baskets
Each item has a Boolean variable representing Each item has a Boolean variable representing
the presence or absence of that item. the presence or absence of that item.
Each basket can be represented by a Boolean Each basket can be represented by a Boolean
vector of values assigned to these variables.vector of values assigned to these variables.
Identify patterns from Boolean vectorIdentify patterns from Boolean vector
Patterns can be represented by association Patterns can be represented by association
rules.rules.
Lecture-27 - Association rule miningLecture-27 - Association rule mining

Association Rule Mining: A Road MapAssociation Rule Mining: A Road Map
Boolean vs. quantitative associationsBoolean vs. quantitative associations
- Based on the types of values handled- Based on the types of values handled
buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “SQLServer”) ^ buys(x, “DMBook”) => => buys(x, buys(x,
“DBMiner”) [0.2%, 60%]“DBMiner”) [0.2%, 60%]
age(x, “30..39”) ^ income(x, “42..48K”) age(x, “30..39”) ^ income(x, “42..48K”) => => buys(x, “PC”) buys(x, “PC”)
[1%, 75%][1%, 75%]
Single dimension vs. multiple dimensionalSingle dimension vs. multiple dimensional
associationsassociations
Single level vs. multiple-level analysisSingle level vs. multiple-level analysis
Lecture-27 - Association rule miningLecture-27 - Association rule mining

Lecture-28Lecture-28
Mining single-dimensional Mining single-dimensional
Boolean association rules from Boolean association rules from
transactional databasestransactional databases

Apriori AlgorithmApriori Algorithm
Single dimensional, single-level, Boolean Single dimensional, single-level, Boolean
frequent item setsfrequent item sets
Finding frequent item sets using candidate Finding frequent item sets using candidate
generationgeneration
Generating association rules from Generating association rules from
frequent item setsfrequent item sets
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

Mining Association RulesMining Association Rules—An —An ExampleExample
For rule For rule AA ÞÞ CC::
support = support({support = support({AA CC}) = 50%}) = 50%
confidence = support({confidence = support({AA CC})/support({})/support({AA}) = 66.6%}) = 66.6%
The Apriori principle:The Apriori principle:
Any subset of a frequent itemset must be frequentAny subset of a frequent itemset must be frequent
Transaction IDItems Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
Frequent ItemsetSupport
{A} 75%
{B} 50%
{C} 50%
{A,C} 50%
Min. support 50%
Min. confidence 50%
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

Mining Frequent Itemsets: the Key StepMining Frequent Itemsets: the Key Step
Find the Find the frequent itemsetsfrequent itemsets: the sets of items : the sets of items
that have minimum supportthat have minimum support

A subset of a frequent itemset must also be a A subset of a frequent itemset must also be a
frequent itemsetfrequent itemset
i.e., if {i.e., if {ABAB} is} is a frequent itemset, both {a frequent itemset, both {AA} and } and
{{BB} should be a frequent itemset} should be a frequent itemset

Iteratively find frequent itemsets with cardinality Iteratively find frequent itemsets with cardinality
from 1 to from 1 to k (k-k (k-itemsetitemset))
Use the frequent itemsets to generate Use the frequent itemsets to generate
association rules.association rules.
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

The Apriori AlgorithmThe Apriori Algorithm
Join StepJoin Step
CC
kk is generated by joining L is generated by joining L
k-1k-1with itselfwith itself
Prune StepPrune Step

Any (k-1)-itemset that is not frequent cannot be a Any (k-1)-itemset that is not frequent cannot be a
subset of a frequent k-itemsetsubset of a frequent k-itemset
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

The Apriori AlgorithmThe Apriori Algorithm
Pseudo-codePseudo-code::
CC
kk: Candidate itemset of size k: Candidate itemset of size k
LL
kk : frequent itemset of size k : frequent itemset of size k
LL
11 = {frequent items}; = {frequent items};
for for ((kk = 1; = 1; LL
kk != !=ÆÆ; ; kk++) ++) do begindo begin
CC
k+1k+1 = candidates generated from = candidates generated from LL
kk;;
for eachfor each transaction transaction tt in database do in database do
increment the count of all candidates in increment the count of all candidates in CC
k+1k+1
that are contained in that are contained in tt
LL
k+1k+1 = candidates in = candidates in CC
k+1k+1 with min_support with min_support
endend
returnreturn ÈÈ
kk LL
kk;;
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

The Apriori Algorithm The Apriori Algorithm —— Example Example
TIDItems
1001 3 4
2002 3 5
3001 2 3 5
4002 5
Database D itemsetsup.
{1}2
{2}3
{3}3
{4}1
{5}3
itemsetsup.
{1}2
{2}3
{3}3
{5}3
Scan D
C
1
L
1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemsetsup
{1 2}1
{1 3}2
{1 5}1
{2 3}2
{2 5}3
{3 5}2
itemsetsup
{1 3}2
{2 3}2
{2 5}3
{3 5}2
L
2
C
2 C
2
Scan D
C
3
L
3itemset
{2 3 5}
Scan D itemsetsup
{2 3 5}2
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

How to Generate Candidates?How to Generate Candidates?
Suppose the items in Suppose the items in LL
k-1k-1 are listed in an order are listed in an order
Step 1: self-joining Step 1: self-joining LL
k-1k-1
insert into Cinsert into C
kk
select p.itemselect p.item
11, p.item, p.item
22, …, p.item, …, p.item
k-1k-1, q.item, q.item
k-1k-1
from Lfrom L
k-1k-1 p, L p, L
k-1 k-1 qq
where p.itemwhere p.item
11=q.item=q.item
11, …, p.item, …, p.item
k-2k-2=q.item=q.item
k-2k-2, p.item, p.item
k-1 k-1 < q.item< q.item
k-1k-1
Step 2: pruningStep 2: pruning
forall forall itemsets c in Citemsets c in C
kk dodo
forall forall (k-1)-subsets s of c (k-1)-subsets s of c dodo
if if (s is not in L(s is not in L
k-1k-1) ) then delete then delete cc from from CC
kk
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

How to Count Supports of Candidates?How to Count Supports of Candidates?
Why counting supports of candidates a problem?Why counting supports of candidates a problem?

The total number of candidates can be very hugeThe total number of candidates can be very huge

One transaction may contain many candidatesOne transaction may contain many candidates
MethodMethod

Candidate itemsets are stored in a hash-treeCandidate itemsets are stored in a hash-tree

Leaf node of hash-tree contains a list of itemsets Leaf node of hash-tree contains a list of itemsets
and countsand counts

Interior node contains a hash tableInterior node contains a hash table

Subset function: finds all the candidates contained in Subset function: finds all the candidates contained in
a transactiona transaction
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

Example of Generating CandidatesExample of Generating Candidates
LL
33=={{abc, abd, acd, ace, bcdabc, abd, acd, ace, bcd}}
Self-joining: Self-joining: LL
33*L*L
33

abcd abcd from from abcabc and and abdabd

acdeacde from from acdacd and and aceace
Pruning:Pruning:
acdeacde is removed because is removed because adeade is not in is not in LL
33
CC
44={={abcdabcd}}
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

Methods to Improve Apriori’s EfficiencyMethods to Improve Apriori’s Efficiency
Hash-based itemset countingHash-based itemset counting

A A kk-itemset whose corresponding hashing bucket count is -itemset whose corresponding hashing bucket count is
below the threshold cannot be frequentbelow the threshold cannot be frequent
Transaction reductionTransaction reduction

A transaction that does not contain any frequent k-itemset is A transaction that does not contain any frequent k-itemset is
useless in subsequent scansuseless in subsequent scans
PartitioningPartitioning

Any itemset that is potentially frequent in DB must be frequent Any itemset that is potentially frequent in DB must be frequent
in at least one of the partitions of DBin at least one of the partitions of DB
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

Methods to Improve Apriori’s EfficiencyMethods to Improve Apriori’s Efficiency
SamplingSampling

mining on a subset of given data, lower support mining on a subset of given data, lower support
threshold + a method to determine the completenessthreshold + a method to determine the completeness
Dynamic itemset countingDynamic itemset counting

add new candidate itemsets only when all of their add new candidate itemsets only when all of their
subsets are estimated to be frequentsubsets are estimated to be frequent
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

Mining Frequent Patterns Without Mining Frequent Patterns Without
Candidate GenerationCandidate Generation
Compress a large database into a compact, Frequent-Compress a large database into a compact, Frequent-
Pattern tree (FP-tree) structurePattern tree (FP-tree) structure

highly condensed, but complete for frequent pattern mininghighly condensed, but complete for frequent pattern mining

avoid costly database scansavoid costly database scans
Develop an efficient, FP-tree-based frequent pattern Develop an efficient, FP-tree-based frequent pattern
mining methodmining method

A divide-and-conquer methodology: decompose mining tasks into A divide-and-conquer methodology: decompose mining tasks into
smaller onessmaller ones

Avoid candidate generation: sub-database test onlyAvoid candidate generation: sub-database test only
Lecture-28Lecture-28
Mining single-dimensional Boolean association rules from transactional databasesMining single-dimensional Boolean association rules from transactional databases

Lecture-29Lecture-29
Mining multilevel association rules Mining multilevel association rules
from transactional databasesfrom transactional databases

Mining various kinds of association Mining various kinds of association
rulesrules
Mining Multilevel association rulesMining Multilevel association rules

Concepts at different levelsConcepts at different levels
Mining Multidimensional association rulesMining Multidimensional association rules

More than one dimensionalMore than one dimensional
Mining Quantitative association rulesMining Quantitative association rules

Numeric attributesNumeric attributes
Lecture-29 - Mining multilevel association rules from transactional databasesLecture-29 - Mining multilevel association rules from transactional databases

Multiple-Level Association RulesMultiple-Level Association Rules
Items often form hierarchy.Items often form hierarchy.
Items at the lower level are Items at the lower level are
expected to have lower expected to have lower
support.support.
Rules regarding itemsets atRules regarding itemsets at
appropriate levels could be appropriate levels could be
quite useful.quite useful.
Transaction database can be Transaction database can be
encoded based on encoded based on
dimensions and levelsdimensions and levels
We can explore shared multi-We can explore shared multi-
level mininglevel mining
Food
breadmilk
skim
SunsetFraser
2% whitewheat
TIDItems
T1{111, 121, 211, 221}
T2{111, 211, 222, 323}
T3{112, 122, 221, 411}
T4{111, 121}
T5{111, 122, 211, 221, 413}
Lecture-29 - Mining multilevel association rules from transactional databasesLecture-29 - Mining multilevel association rules from transactional databases

Multi-level AssociationMulti-level Association
Uniform Support- the same minimum support for Uniform Support- the same minimum support for
all levelsall levels

++ One minimum support threshold. No need to One minimum support threshold. No need to
examine itemsets containing any item whose examine itemsets containing any item whose
ancestors do not have minimum support.ancestors do not have minimum support.

– – Lower level items do not occur as frequently. Lower level items do not occur as frequently.
If support threshold If support threshold
too high too high ÞÞ miss low level associations miss low level associations
too low too low ÞÞ generate too many high level generate too many high level
associationsassociations
Lecture-29 - Mining multilevel association rules from transactional databasesLecture-29 - Mining multilevel association rules from transactional databases

Multi-level AssociationMulti-level Association
Reduced Support- reduced minimum Reduced Support- reduced minimum
support at lower levelssupport at lower levels

There are 4 search strategies:There are 4 search strategies:
Level-by-level independentLevel-by-level independent
Level-cross filtering by k-itemsetLevel-cross filtering by k-itemset
Level-cross filtering by single itemLevel-cross filtering by single item
Controlled level-cross filtering by single itemControlled level-cross filtering by single item
Lecture-29 - Mining multilevel association rules from transactional databasesLecture-29 - Mining multilevel association rules from transactional databases

Uniform SupportUniform Support
Multi-level mining with uniform supportMulti-level mining with uniform support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Back
Lecture-29 - Mining multilevel association rules from transactional databasesLecture-29 - Mining multilevel association rules from transactional databases

Reduced SupportReduced Support
Multi-level mining with reduced supportMulti-level mining with reduced support
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Milk
[support = 10%]
Lecture-29 - Mining multilevel association rules from transactional databasesLecture-29 - Mining multilevel association rules from transactional databases

Multi-level Association: Redundancy Multi-level Association: Redundancy
FilteringFiltering
Some rules may be redundant due to “ancestor” Some rules may be redundant due to “ancestor”
relationships between items.relationships between items.
ExampleExample

milk milk ÞÞ wheat bread wheat bread [support = 8%, confidence = 70%][support = 8%, confidence = 70%]

2% milk 2% milk ÞÞ wheat bread wheat bread [support = 2%, confidence = 72%][support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second We say the first rule is an ancestor of the second
rule.rule.
A rule is redundant if its support is close to the A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.“expected” value, based on the rule’s ancestor.
Lecture-29 - Mining multilevel association rules from transactional databasesLecture-29 - Mining multilevel association rules from transactional databases

Lecture-30Lecture-30
Mining multidimensional Mining multidimensional
association rules from association rules from
transactional databases and transactional databases and
data warehousedata warehouse

Multi-Dimensional AssociationMulti-Dimensional Association
Single-dimensional rulesSingle-dimensional rules
buys(X, “milk”) buys(X, “milk”) ÞÞ buys(X, “bread”) buys(X, “bread”)
Multi-dimensional rulesMulti-dimensional rules

Inter-dimension association rules -no repeated predicatesInter-dimension association rules -no repeated predicates
age(X,”19-25”) age(X,”19-25”) ÙÙ occupation(X,“student”) occupation(X,“student”) ÞÞ
buys(X,“coke”)buys(X,“coke”)

hybrid-dimension association rules -repeated predicateshybrid-dimension association rules -repeated predicates
age(X,”19-25”) age(X,”19-25”) ÙÙ buys(X, “popcorn”) buys(X, “popcorn”) ÞÞ buys(X, “coke”) buys(X, “coke”)
Lecture-30 - Mining multidimensional association rules from transactional databases and Lecture-30 - Mining multidimensional association rules from transactional databases and
data warehousedata warehouse

Multi-Dimensional AssociationMulti-Dimensional Association
Categorical AttributesCategorical Attributes

finite number of possible values, no ordering finite number of possible values, no ordering
among valuesamong values
Quantitative AttributesQuantitative Attributes

numeric, implicit ordering among valuesnumeric, implicit ordering among values
Lecture-30 - Mining multidimensional association rules from transactional databases and Lecture-30 - Mining multidimensional association rules from transactional databases and
data warehousedata warehouse

Techniques for Mining MD AssociationsTechniques for Mining MD Associations
Search for frequent Search for frequent kk-predicate set:-predicate set:

Example: Example: {{ageage, occupation, buys}, occupation, buys} is a 3-predicate is a 3-predicate
set.set.

Techniques can be categorized by how Techniques can be categorized by how ageage are are
treated.treated.
1. Using static discretization of quantitative attributes1. Using static discretization of quantitative attributes

Quantitative attributes are statically discretized by Quantitative attributes are statically discretized by
using predefined concept hierarchies.using predefined concept hierarchies.
2. Quantitative association rules2. Quantitative association rules

Quantitative attributes are dynamically discretized Quantitative attributes are dynamically discretized
into “bins”based on the distribution of the data.into “bins”based on the distribution of the data.
3. Distance-based association rules3. Distance-based association rules

This is a dynamic discretization process that This is a dynamic discretization process that
considers the distance between data points.considers the distance between data points.
Lecture-30 - Mining multidimensional association rules from transactional databases and Lecture-30 - Mining multidimensional association rules from transactional databases and
data warehousedata warehouse

Static Discretization of Quantitative AttributesStatic Discretization of Quantitative Attributes
Discretized prior to mining using concept hierarchy.Discretized prior to mining using concept hierarchy.
Numeric values are replaced by ranges.Numeric values are replaced by ranges.
In relational database, finding all frequent k-predicate sets In relational database, finding all frequent k-predicate sets
will require will require kk or or kk+1 table scans.+1 table scans.
Data cube is well suited for mining.Data cube is well suited for mining.
The cells of an n-dimensional The cells of an n-dimensional cuboid correspond to cuboid correspond to
the predicate sets.the predicate sets.
Mining from data cubescan be much faster.Mining from data cubescan be much faster.
(income)(age)
()
(buys)
(age, income)(age,buys)(income,buys)
(age,income,buys)Lecture-30 - Mining multidimensional association rules from transactional databases and Lecture-30 - Mining multidimensional association rules from transactional databases and
data warehousedata warehouse

Quantitative Association RulesQuantitative Association Rules
age(X,”30-34”) Ù income(X,”24K -
48K”)
Þ buys(X,”high resolution TV”)
Numeric attributes are Numeric attributes are dynamicallydynamically discretized discretized

Such that the confidence or compactness of the rules Such that the confidence or compactness of the rules
mined is maximized.mined is maximized.
2-D quantitative association rules: A2-D quantitative association rules: A
quan1quan1 ÙÙ A A
quan2 quan2
ÞÞ A A
catcat
Cluster “adjacent” Cluster “adjacent”
association rulesassociation rules
to form general to form general
rules using a 2-D rules using a 2-D
grid.grid.
Example:Example:
Lecture-30 - Mining multidimensional association rules from transactional databases and data warehouseLecture-30 - Mining multidimensional association rules from transactional databases and data warehouse

Lecture-31Lecture-31
From association mining to From association mining to
correlation analysiscorrelation analysis

Interestingness MeasurementsInterestingness Measurements
Objective measuresObjective measures

Two popular measurementsTwo popular measurements
supportsupport
confidenceconfidence
Subjective measures Subjective measures
A rule (pattern) is interesting ifA rule (pattern) is interesting if
*it is *it is unexpectedunexpected (surprising to the user); and/or (surprising to the user); and/or
*actionable*actionable (the user can do something with it) (the user can do something with it)
Lecture-31 - From association mining to correlation analysisLecture-31 - From association mining to correlation analysis

Criticism to Support and ConfidenceCriticism to Support and Confidence
Example Example

Among 5000 studentsAmong 5000 students
3000 play basketball3000 play basketball
3750 eat cereal3750 eat cereal
2000 both play basket ball and eat cereal2000 both play basket ball and eat cereal

play basketballplay basketball ÞÞ eat cerealeat cereal [40%, 66.7%] is misleading [40%, 66.7%] is misleading
because the overall percentage of students eating cereal is 75% because the overall percentage of students eating cereal is 75%
which is higher than 66.7%.which is higher than 66.7%.

play basketballplay basketball ÞÞ not eat cerealnot eat cereal [20%, 33.3%] is far more [20%, 33.3%] is far more
accurate, although with lower support and confidenceaccurate, although with lower support and confidence
basketballnot basketballsum(row)
cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000
Lecture-31 - From association mining to correlation analysisLecture-31 - From association mining to correlation analysis

Criticism to Support and Confidence Criticism to Support and Confidence
Example Example

X and Y: positively correlated,X and Y: positively correlated,

X and Z, negatively relatedX and Z, negatively related

support and confidence of support and confidence of
X=>Z dominates X=>Z dominates
We need a measure of dependent or We need a measure of dependent or
correlated eventscorrelated events
P(B|A)/P(B) is also called the lift of rule P(B|A)/P(B) is also called the lift of rule
A => BA => B
X11110000
Y11000000
Z01111111
RuleSupportConfidence
X=>Y25% 50%
X=>Z37.50% 75%
)()(
)(
,
BPAP
BAP
corr
BA
È
=
Lecture-31 - From association mining to correlation analysisLecture-31 - From association mining to correlation analysis

Other Interestingness Measures: InterestOther Interestingness Measures: Interest
Interest (correlation, lift)Interest (correlation, lift)

taking both P(A) and P(B) in considerationtaking both P(A) and P(B) in consideration

P(A^B)=P(B)*P(A), if A and B are independent eventsP(A^B)=P(B)*P(A), if A and B are independent events

A and B negatively correlated, if the value is less than 1; A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlatedotherwise A and B positively correlated
)()(
)(
BPAP
BAPÙ
X11110000
Y11000000
Z01111111
Itemset Support Interest
X,Y 25% 2
X,Z 37.50% 0.9
Y,Z 12.50% 0.57
Lecture-31 - From association mining to correlation analysisLecture-31 - From association mining to correlation analysis

Lecture-32Lecture-32
Constraint-based association Constraint-based association
miningmining

Constraint-Based MiningConstraint-Based Mining
Interactive, exploratory mining Interactive, exploratory mining
kinds of constraintskinds of constraints

Knowledge type constraint- classification, association, Knowledge type constraint- classification, association,
etc.etc.

Data constraint: SQL-like queriesData constraint: SQL-like queries

Dimension/level constraintsDimension/level constraints

Rule constraintRule constraint

Interestingness constraintsInterestingness constraints
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Rule Constraints in Association MiningRule Constraints in Association Mining
Two kind of rule constraints:Two kind of rule constraints:

Rule form constraints: meta-rule guided mining.Rule form constraints: meta-rule guided mining.
P(x, y) ^ Q(x, w) P(x, y) ^ Q(x, w) ® ® takes(x, “database systems”).takes(x, “database systems”).

Rule (content) constraint: constraint-based query Rule (content) constraint: constraint-based query
optimization (Ng, et al., SIGMOD’98).optimization (Ng, et al., SIGMOD’98).
sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) >
10001000
1-variable vs. 2-variable constraints 1-variable vs. 2-variable constraints

1-var: A constraint confining only one side (L/R) of the 1-var: A constraint confining only one side (L/R) of the
rule, e.g., as shown above. rule, e.g., as shown above.

2-var: A constraint confining both sides (L and R).2-var: A constraint confining both sides (L and R).
sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Constrain-Based Association QueryConstrain-Based Association Query
Database: (1) Database: (1) trans (TID, Itemset ),trans (TID, Itemset ), (2)(2) itemInfo (Item, Type, Price)itemInfo (Item, Type, Price)
A constrained asso. query (CAQ) is in the form of {(A constrained asso. query (CAQ) is in the form of {(SS11, S, S22 ))|C |C
},},

where C is a set of constraints on Swhere C is a set of constraints on S11, S, S22 including frequency including frequency
constraintconstraint
A classification of (single-variable) constraints:A classification of (single-variable) constraints:

Class constraint: S Class constraint: S ÌÌ A. A. e.g. S e.g. S ÌÌ ItemItem

Domain constraint:Domain constraint:
SSqq v, v, qq ÎÎ { { ==, , ¹¹, , <<, , ££, , >>, , ³³ } }. e.g. S.Price < 100. e.g. S.Price < 100
vvqq S, S, qq is is ÎÎ or or ÏÏ. e.g. snacks . e.g. snacks ÏÏ S.TypeS.Type
VVqq S, S, or or SSqq V, V, qq ÎÎ { { ÍÍ, , ÌÌ, , ËË, , ==, , ¹¹ } }

e.g. e.g. {{snacks, sodassnacks, sodas } } ÍÍ S.Type S.Type

Aggregation constraint: Aggregation constraint: agg(S) agg(S) qq v, v, where where agg agg is in {is in {min, min,
max, sum, count, avgmax, sum, count, avg}, and }, and qq ÎÎ { { ==, , ¹¹, , <<, , ££, , >>, , ³³ }. }.
e.g. count(Se.g. count(S11.Type) .Type) == 1 , avg(S 1 , avg(S22.Price) .Price) << 100 100
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Constrained Association Query Optimization Constrained Association Query Optimization
ProblemProblem
Given a CAQGiven a CAQ = = { ({ (SS11, S, S22)) | C | C }, the algorithm should be :}, the algorithm should be :

sound: It only finds frequent sets that satisfy the given sound: It only finds frequent sets that satisfy the given
constraints Cconstraints C

complete: All frequent sets satisfy the given complete: All frequent sets satisfy the given
constraints C are foundconstraints C are found
A naïve solution:A naïve solution:

Apply Apriori for finding all frequent sets, and then to Apply Apriori for finding all frequent sets, and then to
test them for constraint satisfaction one by one.test them for constraint satisfaction one by one.
Our approach:Our approach:

Comprehensive analysis of the properties of Comprehensive analysis of the properties of
constraints and try to push them as deeply as constraints and try to push them as deeply as
possible inside the frequent set computation.possible inside the frequent set computation.
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Anti-monotone and Monotone ConstraintsAnti-monotone and Monotone Constraints
A constraint CA constraint C
aa is anti-monotone iff. for any is anti-monotone iff. for any
pattern S not satisfying Cpattern S not satisfying C
aa, none of the , none of the
super-patterns of S can satisfy Csuper-patterns of S can satisfy C
aa
A constraint CA constraint C
mm is monotone iff. for any is monotone iff. for any
pattern S satisfying Cpattern S satisfying C
mm, every super-, every super-
pattern of S also satisfies itpattern of S also satisfies it
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Succinct ConstraintSuccinct Constraint
A subset of item IA subset of item I
ss is a succinct set, if it can be is a succinct set, if it can be
expressed as expressed as ss
pp(I) for some selection predicate (I) for some selection predicate
p, where p, where ss is a selection operator is a selection operator
SPSPÍÍ22
II
is a succinct power set, if there is a fixed is a succinct power set, if there is a fixed
number of succinct set Inumber of succinct set I
11, …, I, …, I
k k ÍÍI, s.t. SP can be I, s.t. SP can be
expressed in terms of the strict power sets of Iexpressed in terms of the strict power sets of I
11, ,
…, I…, I
k k using union and minususing union and minus
A constraint CA constraint C
ss is succinct provided SAT is succinct provided SAT
CsCs(I) is a (I) is a
succinct power setsuccinct power set
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Convertible ConstraintConvertible Constraint
Suppose all items in patterns are listed in a total Suppose all items in patterns are listed in a total
order Rorder R
A constraint C is convertible anti-monotone iff a A constraint C is convertible anti-monotone iff a
pattern S satisfying the constraint implies that pattern S satisfying the constraint implies that
each suffix of S w.r.t. R also satisfies Ceach suffix of S w.r.t. R also satisfies C
A constraint C is convertible monotone iff a A constraint C is convertible monotone iff a
pattern S satisfying the constraint implies that pattern S satisfying the constraint implies that
each pattern of which S is a suffix w.r.t. R also each pattern of which S is a suffix w.r.t. R also
satisfies Csatisfies C
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Relationships Among Relationships Among
Categories of ConstraintsCategories of Constraints
Succinctness
Anti-monotonicity Monotonicity
Convertible constraints
Inconvertible constraints
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Property of Constraints: Anti-MonotoneProperty of Constraints: Anti-Monotone
Anti-monotonicity: Anti-monotonicity: If a set S violates the If a set S violates the
constraint, any superset of S violates the constraint, any superset of S violates the
constraint.constraint.
Examples: Examples:

sum(S.Price)sum(S.Price) ££ vv is anti-monotone is anti-monotone

sum(S.Price) sum(S.Price) ³³ v v is not anti-monotone is not anti-monotone

sum(S.Price) sum(S.Price) = = v v is partly anti-monotone is partly anti-monotone
Application:Application:

Push “Push “sum(S.price)sum(S.price) ££ 1000” deeply into iterative 1000” deeply into iterative
frequent set computation. frequent set computation.
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Characterization of Characterization of
Anti-Monotonicity ConstraintsAnti-Monotonicity Constraints
S q v, q Î { =, £, ³ }
v Î S
S Ê V
S Í V
S = V
min(S) £ v
min(S) ³ v
min(S) = v
max(S) £ v
max(S) ³ v
max(S) = v
count(S) £ v
count(S) ³ v
count(S) = v
sum(S) £ v
sum(S) ³ v
sum(S) = v
avg(S) q v, q Î { =, £, ³ }
(frequent constraint)
yes
no
no
yes
partly
no
yes
partly
yes
no
partly
yes
no
partly
yes
no
partly
convertible
(yes)
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Example of Convertible Constraints: Avg(S) Example of Convertible Constraints: Avg(S) qq V V
Let R be the value descending order over Let R be the value descending order over
the set of itemsthe set of items

E.g. I={9, 8, 6, 4, 3, 1}E.g. I={9, 8, 6, 4, 3, 1}
Avg(S) Avg(S) ³³ v is convertible monotone w.r.t. v is convertible monotone w.r.t.
RR
If S is a suffix of SIf S is a suffix of S
11, avg(S, avg(S
11) ) ³³ avg(S) avg(S)
{8, 4, 3} is a suffix of {9, 8, 4, 3}{8, 4, 3} is a suffix of {9, 8, 4, 3}
avg({9, 8, 4, 3})=6 avg({9, 8, 4, 3})=6 ³³ avg({8, 4, 3})=5 avg({8, 4, 3})=5
If S satisfies avg(S) If S satisfies avg(S) ³³v, so does Sv, so does S
11
{8, 4, 3} satisfies constraint avg(S) {8, 4, 3} satisfies constraint avg(S) ³³ 4, so does {9, 4, so does {9,
8, 4, 3}8, 4, 3}
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Property of Constraints: SuccinctnessProperty of Constraints: Succinctness
Succinctness:Succinctness:

For any set SFor any set S11 and S and S22 satisfying C, S satisfying C, S1 1 ÈÈ S S22 satisfies C satisfies C

Given AGiven A11 is the sets of size 1 satisfying C, then any set is the sets of size 1 satisfying C, then any set
S satisfying C are based on AS satisfying C are based on A1 1 , i.e., it contains a subset , i.e., it contains a subset
belongs to Abelongs to A1 ,1 ,
Example : Example :

sum(S.Price )sum(S.Price ) ³³ v v is not succinct is not succinct

min(S.Price ) min(S.Price ) ££ v v is succinct is succinct
Optimization:Optimization:

If C is succinct, then C is pre-counting prunable. The If C is succinct, then C is pre-counting prunable. The
satisfaction of the constraint alone is not affected by the satisfaction of the constraint alone is not affected by the
iterative support counting.iterative support counting.
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining

Characterization of Constraints Characterization of Constraints
by Succinctnessby Succinctness
S q v, q Î { =, £, ³ }
v Î S
S ÊV
S Í V
S = V
min(S) £ v
min(S) ³ v
min(S) = v
max(S) £ v
max(S) ³ v
max(S) = v
count(S) £ v
count(S) ³ v
count(S) = v
sum(S) £ v
sum(S) ³ v
sum(S) = v
avg(S) q v, q Î { =, £, ³ }
(frequent constraint)
Yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
weakly
weakly
weakly
no
no
no
no
(no)
Lecture-32 - Constraint-based association miningLecture-32 - Constraint-based association mining
Tags