pattern_evaluaiton_methods.ppt

301 views 25 slides Aug 17, 2023
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

ytytytytytdymdsmg tfrd5juhdvsrvsbsefbesvbesbveshilghwighweivbweibeavfilewbwvbwbvewiuheiuvfbgeiugvfewiufvhewliwehgeaiwugvdcuhuaiegfu guiegfa faegfgf aifga faig ffg fww ffw gfi a f fei agalfa feigfaeygfa fifg fgfgaig faifa afefifgawgfialgiyfgaw f afiyfga fagfiayg agf aeifg ef efeiyfge agff feh eygfe f...


Slide Content

Topic for the class: Pattern Evaluation Methods
Unit _3 : Title-Mining frequent patterns, associations and correlations
Date & Time : 14.3.2022 9.00 AM-9.50 AM
Dr. Bhramaramba Ravi
Professor
Department of Computer Science and Engineering
GITAM Institute of Technology (GIT)
Visakhapatnam –530045
Email: [email protected]
1Department of CSE, GIT EID 356 Data Mining and Data warehousing17 August 2023

Course objectives
•To understand the importance of data mining and its
applications
•To introduce various types of data and preprocessing
techniques
•To learn various multi-dimensional data models and
OLAP processing
•To study concepts of Association Analysis
•To learn various Classification methods
•To learn basics of Cluster Analysis
2
Department of computer science and engineering, GIT Course Code EID 356
Course Title: Data Mining and Data Warehousing
17 August 2023

Learning Outcomes
•Attheendofthislecture/session,Studentswillbeable
to
•Understandthepatternevaluationmethods
3
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing
17 August 2023

Module 3
•Miningfrequentpatterns,associationsand
correlations:Basicconcepts,Applicationsof
frequentpatternandassociations,Frequent
patternandassociationmining:Aroadmap,
miningvariouskindsofassociationrules,Apriori
algorithm,FPgrowthalgorithm,Pattern
evaluationmethods
4
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing
17 August 2023

•Mostassociationruleminingalgorithmsemployasupport-
confidenceframework.Althoughminimumsupportand
confidencethresholdshelpweedoutorexcludetheexploration
ofagoodnumberofuninterestingrules,manyoftherules
generatedarestillnotinterestingtotheusers.Thisisespecially
truewhenminingatlowsupportthresholdsorminingforlong
patterns.Thishasbeenamajorbottleneckforsuccessful
applicationofassociationrulemining.
5
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
Which Patterns are interesting?-Pattern Evaluation
Methods

Strong rules are not necessarily
interesting
•Whetherornotaruleisinterestingcanbeassessedeither
subjectivelyorobjectively.Ultimatelyonlytheusercanjudge
ifagivenruleisinteresting,andthisjudgment,being
subjective,maydifferfromoneusertoanother.However,
objectiveinterestingnessmeasuresbasedonthestatisticsbehind
thedata,canbeusedasonesteptowardthegoalofweedingout
uninterestingrulesthatwouldotherwisebepresentedtotheuser.
Example:Amisleadingstrongassociationrule.Supposeweare
interestedinanalyzingtransactionsatAllElectronicswithrespect
tothepurchaseofcomputergamesandvideos.
6
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing17 August 2023

Strong rules are not necessarily
interesting contd.
7
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing
17 August 2023
•Letgamerefertothetransactionscontainingcomputergames,and
videorefertothosecontainingvideos.Ofthe10,000transactions
analyzed,thedatashowthat6000ofthecustomertransactions
includedcomputergames,while7500includedvideosand4000
includedbothcomputergamesandvideos.Supposethatadata
miningprogramfordiscoveringassociationrulesisrunonthedata,
usingaminimumsupportofsay,30%andaminimumconfidenceof
60%.Thefollowingassociationruleisdiscovered:
buys(X,”computergames”)=>buys(X,”Videos”)
[support=40%,Confidence=66%]

Strong rules are not necessarily
interesting contd.
8
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing
17 August 2023
•Theearlierruleisastrongassociationrulesinceitssupportvalueof
4000/10000=40%andconfidencevalueof4000/6000=66%satisfythe
minimumsupportandminimumconfidencethresholds.Howeverthe
ruleismisleadingbecausetheprobabilityofpurchasingvideosis75%,
whichisevenlargerthan66%.Computergamesandvideosare
negativelyassociatedbecausethepurchaseofoneoftheseitems
actuallydecreasesthelikelihoodofpurchasingtheother

From Association Analysis to Correlation
Analysis
9
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing
17 August 2023
•As we have seen so far, the support and confidence measures are
insufficient at filtering out uninteresting association rules. To tackle
this weakness, a correlation measure can be used to augment the
support-confidence framework for association rules. This leads to
correlation rules of the form
A => B[Support, confidence, correlation]
That is, a correlation rule is measured not only by its support and
confidence but also by the correlation between itemsetsA and B.
•Lift is a simple correlation measure that is given as follows:

•TheoccurrenceofitemsetAisindependentoftheoccurrenceof
itemsetBifP(AUB)=P(A)P(B)otherwiseitemsetsAandBare
dependentandcorrelatedasevents.Thisdefinitioncanbe
extendedtomorethantwoitemsets.Theliftbetweenthe
occurrenceofAandBcanbemeasuredbycomputing
lift(A,B)=P(AUB)/P(A)P(B)
•Iftheresultingvalueoftheaboveequationislessthan1,then
theoccurrenceofAisnegativelycorrelatedwiththeoccurrence
ofB,meaningthattheoccurrenceofonelikelyleadstothe
absenceoftheother.
10
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
From Association Analysis to Correlation Analysis
contd.

11
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
From Association Analysis to Correlation Analysis
contd.
•If the resulting value is equal to 1, then A and B are independent and
there is no correlation between them. The earlier equation is
equivalent to P(B/A)/P(B) or conf(A=>B)/sup(B) which is also referred
to as the lift of the association(or correlation) rule A=>B
•Example –Correlation analysis using lift. To help filter out misleading “
strong” associations of the form A=>B from the data of earlier example
we need to study how the two itemsets A and B, are correlated. Let
game refer to the transactions of example that do not contain
computer games, and video refer to those that do not contain videos.

•The Transactions can be summarized in a contingency table
as shown in Table
2x2 Contingency Table Summarizing the transactions with
respect to game and video purchases
12
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
From Association Analysis to Correlation Analysis
contd.

•Fromthetablewecanseethattheprobabiiityofpurchasingacomputer
gameisP({game})=0.60,theprobabilityofpurchasingavideois
P({video})=0.75andtheprobabilityofpurchasingbothis
P({game,video})=0.40.
•Bytheequationforliftliftiscalculatedas
•P({game,video})/(P({game})XP({video}))=0.40/(0.60x0.75)=0.89.
•Becausethisvalueislessthan1,thereisanegativecorrelationbetween
theoccurrenceof{game}and{video}.
•Thenumeratoristhelikelihoodofacustomerpurchasingboth,whilethe
denominatoriswhatthelikelihoodwouldhavebeenifthetwopurchases
werecompletelyindependent.Suchanegativecorrelationcannotbe
identifiedbyasupport-confidenceframework.
13
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
From Association Analysis to Correlation Analysis
contd.

•Thesecondcorrelationmeasurethatwestudyistheχ
2
measure.
•Tocomputetheχ
2
value,wetakethesquareddifferencebetweenthe
observedandexpectedvalueforaslot(AandBpair)inthecontingency
table,dividedbytheexpectedvalue.Thisamountissummedforall
slotsofthecontingencytable.
14
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
From Association Analysis to Correlation Analysis
contd.

game game ∑row
video4000(4500)3500(3000)7500
video2000(1500)500(1000)2500
∑col6000 4000 10000
15
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
Table Contingency Table, now with the expected
Values


To compute the correlation using χ
2
analysis for nominal data, we need
the observed value and expected value for each slot of the contingency
table.
•Fromthetablewecancomputetheχ
2
valueasfollows:
•χ
2
=∑(observed-expected)
2
=(4000-4500)
2
+(3500-3000)
2
expected 4500 3000
+(2000-1500)
2
+(500-1000)
2
=555.6
1500 1000
•Becausetheχ
2
valueisgreaterthan1,andtheobservedvalueofthe
slot(game,video)=4000,whichislessthantheexpectedvalueof4500,
buyinggameandvideoarenegativelycorrelated.
16
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
From Association Analysis to Correlation Analysis
contd.

•Insteadofusingthesimplesupport-confidenceframeworktoevaluate
frequentpatterns,othermeasuressuchasliftandχ2oftendisclosemore
intrinsicpatternrelationships.
•Severalotherpatternevaluationmeasuresexist.
•Foursuchmeasuresareall_confidence,max_confidence,kulczynskiand
cosine.
•We’llthencomparetheireffectivenesswithrespecttooneanotherand
withrespecttotheliftandχ
2
measures.
•Giventwoitemsets,AandB,theall_confidencemeasureofAandBis
definedas
all_conf(A,B)=sup(AUB)/max{sup(a),sup(B)}=min{P(A|B),P(B|A)}
Wheremax{sup(A),sup(B)}isthemaximumsupportoftheitemsetsAandB.
Thusall_conf(A,B)isalsotheminimumconfidenceofthetwoassociation
rulesrelatedtoAandBnamely“A=>B”and“B=>A”
17
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
A comparison of pattern evaluation measures


•Giventwoitemsets,AandB,themax_confidencemeasureofAandBis
definedasmax_conf(A,B)=max{P(A|B),P(B|A)}
•Themax_confmeasureisthemaximumconfidenceofthetwoassociation
rules,“A=>B”and“B=>A”.
•Giventwoitemsets,AandB,thekulczynskimeasureofAand
B(abbreviatedasKulc)isdefinedas
•Kulc(A,B)=1/2(P(A|B)+P(B|A)).Itwasproposedin1927byPolish
mathematicianS.Kulczynski.Itcanbeviewedastheaverageoftwo
confidencemeasures.Thatis,itistheaverageoftwoconditional
probabilities:theprobabilityofitemsetBgivenitemsetA,andthe
probabilityofitemsetAgivenitemsetB.
18
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
A comparison of pattern evaluation measures contd.

•Finally,giventwoitemsets,AandB,thecosinemeasureofAandBis
definedas
•Cosine(A,B)=P(AUB)/√P(A)xP(B)=sup(AUB)/√sup(A)xsup(B)
=√P(A|B)xP(B|A)
•Thecosinemeasurecanbeviewedasaharmonizedliftmeasure.Thetwo
formulaearesimilarexceptthatforcosine,thesquarerootistakenonthe
productoftheprobabilitiesofAandB.Thisisanimportantdifference,
becausebytakingthesquareroot,thecosinevalueisonlyinfluencedby
thesupportsofA,B,andAUBandnotbythetotalnumberoftransactions.
19
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
A comparison of pattern evaluation measures contd.

20
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
A comparison of pattern evaluation measures contd.
•Each of these four measures defined has the following property: Its
value is only influenced by the supports of A,B, and AUB or more
exactly by the conditional probabilities of P(A|B) and P(B|A) but not by
the total number of transactions.
•Another common property is that each measure ranges from 0 to 1,
and the higher the value, the closer the relationship between A and B.


•Anulltransactionisatransactionthatdoesnotcontainanyoftheitemsets
beingexamined.
•Ameasureisnull-invariantifitsvalueisfreefromtheinfluenceofnull-
transactions.
•Nullinvarianceisanimportantpropertyformeasuringassociationpatterns
inlargetransactiondatabases.
•Amongthe6measures,liftandχ
2
arenotnull-invariantmeasures.
•Amongthefournull-invariantmeasures,namelyall_confidence,
max_confidence,Kulcandcosine,werecommendusingKulcinconjunction
withtheimbalanceratio.
21
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and 17 August 2023
A comparison of pattern evaluation measures contd.

Text Book(s)
1. Jiawei Han, Micheline Kamber, Jian Pei , Data Mining: Concepts
and Techniques, 2/e, Morgan Kaufmann publishers, 2006.
References
1. Correlation Analysis mgaub.ac.in
22
Department of computer science and engineering, GIT Course
Code EID 356 and Course Title: Data Mining and Data Warehousing
17 August 2023
References

Session Quiz (5-10 MCQ/True false questions)
(marks to be counted for continuous evaluation)
23
Department of Computer science and engineering, GIT Course Code EID 356
and Course Title: Data Mining and Data Warehousing
17 August 2023
Session Quiz

•Understoodthepatternevaluationmethods
24
Department of computer science and engineering, GIT Course
Code EID 356 and Course Title: Data Mining and Data Warehousing
17 August 2023
Recap-Summary

25
Department of computer sciienceand engineering GIT Course Code
EID 356 and Course Title: Data Mining and Data Warehousing
17 August 2023
THANK YOU