Lect7 Association analysis to correlation analysis

hktripathy 3,993 views 20 slides Feb 28, 2019
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Association analysis to correlation analysis


Slide Content

Association Analysis to Correlation Analysis

Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C}  {D} and {A,B}  {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used

Application of Interestingness Measure

Given a rule X  Y, i nformation needed to compute rule interestingness can be obtained from a contingency table Y Y X f 11 f 10 f 1+ X f 01 f 00 f o + f +1 f +0 |T| Contingency table for X  Y f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures support, confidence, lift, Gini , J-measure, etc. Computing Interestingness Measure

Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375

Correlation Concepts Two item sets A and B are independent (the occurrence of A is independent of the occurrence of item set B) iff P(A  B) = P(A)  P(B) Otherwise A and B are dependent and correlated The measure of correlation, or correlation between A and B is given by the formula: Corr(A,B)= P(A U B ) / P(A) . P(B)

Correlation Concepts [Cont.] corr(A,B) >1 means that A and B are positively correlated i.e. the occurrence of one implies the occurrence of the other. corr(A,B) < 1 means that the occurrence of A is negatively correlated with (or discourages) the occurrence of B. corr(A,B) =1 means that A and B are independent and there is no correlation between them.

Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(S B) = 420/1000 = 0.42 P(S)  P(B) = 0.6  0.7 = 0.42 P(SB) = P(S)  P(B) => Statistical independence P(SB) > P(S)  P(B) => Positively correlated P(SB) < P(S)  P(B) => Negatively correlated

Association & Correlation The correlation formula can be re-written as Corr(A,B) = P(B|A) / P(B) We already know that Support(A  B)= P(AUB) Confidence(A  B)= P(B|A) That means that, Confidence(A  B)= corr(A,B) P(B) So correlation, support and confidence are all different, but the correlation provides an extra information about the association rule (A  B). We say that the correlation corr(A,B) provides the LIFT of the association rule (A=>B), i.e. A is said to increase (or LIFT) the likelihood of B by the factor of the value returned by the formula for corr(A,B).

Statistical-based Measures Measures that take into account statistical dependence P(A and B) = P(A) x P(B|A) P(A) x P(B|A) = P(A and B) P(B|A) = P(A and B) / P(A)

Interestingness Measure: Correlations (Lift) play basketball  eat cereal [40%, 66.7%] is misleading The overall % of students eating cereal is 75% > 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000

Example: Lift/Interest Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated) Association Rule: Tea  Coffee

Example: -Coefficient -coefficient is analogous to correlation coefficient for continuous variables Y Y X 60 10 70 X 10 20 30 70 30 100 Y Y X 20 10 30 X 10 60 70 30 70 100  Coefficient is the same for both tables

There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Apriori -style support based pruning? How does it affect these measures?

Summary Definition (Support) : The support of an itemset I is defined as the fraction of the transactions in the database T = {T 1 . . . T n } that contain I as a subset. support (A  B) = P(A  B) Relative support: The itemset support defined in above equation is sometimes referred to as relative support. Absolute support: whereas the occurrence frequency is called the absolute support. (If the relative support of an itemset I satisfies a prespecified minimum support threshold (i.e., the absolute support of I satisfies the corresponding minimum support count threshold), then I is a frequent itemset) Definition (Frequent Itemset Mining) : Given a set of transactions T = {T 1 . . . T n } , where each transaction T i is a subset of items from U , determine all itemsets I that occur as a subset of at least a predefined fraction minsup of the transactions in T . Definition (Maximal Frequent Itemsets) : A frequent itemset is maximal at a given minimum support level minsup, if it is frequent, and no superset of it is frequent. The Frequent Pattern Mining Model

Property (Support Monotonicity Property): The support of every subset J of I is at least equal to that of the support of itemset I. sup(J) ≥ sup(I) ∀ J ⊆ I Property (Downward Closure Property): Every subset of a frequent itemset is also frequent. The Frequent Pattern Mining Model

Summary Definition (Confidence): Let X and Y be two sets of items. The confidence conf(X ∪ Y ) of the rule X ∪ Y is the conditional probability of X ∪ Y occurring in a transaction, given that the transaction contains X . Therefore, the confidence conf(X ⇒ Y ) is defined as follows: Definition (Association Rules) Let X and Y be two sets of items. Then, the rule X ⇒ Y is said to be an association rule at a minimum support of minsup and minimum confidence of minconf, if it satisfies both the following criteria: 1. The support of the itemset X ∪ Y is at least minsup. 2. The confidence of the rule X ⇒ Y is at least minconf. Property 4.3.1 (Confidence Monotonicity) Let X 1 , X 2 , and I be itemsets such that X 1 ⊂ X 2 ⊂ I . Then the confidence of X 2 ⇒ I − X 2 is at least that of X 1 ⇒ I − X 1 . conf(X2 ⇒ I − X2) ≥ conf(X1 ⇒ I − X1) Association Rule Generation Framework

Introduction to DATA MINING, Vipin Kumar, P N Tan, Michael Steinbach
Tags