Characterization and Comparison

825 views 79 slides Aug 03, 2023
Slide 1
Slide 1 of 79
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79

About This Presentation

Characterization and Comparison


Slide Content

UNIT -III
Concept Description:
Characterization and Comparison
By Mrs. Chetana

UNIT -III
•Concepts Description: Characterization and Comparision:
Data Generalization and Summarization-Based Characterization,
Analytical Characterization: Analysis of Attribute Relevance,
Mining Class Comparisons: Discriminating between Different
Classes, Mining Descriptive Statistical Measures in Large
Databases.
•Applications:
Telecommunication Industry, Social Network Analysis, Intrusion
Detection
By Mrs. Chetana

Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana

What is Concept Description?
FromDataAnalysispointofview,dataminingcanbe
classifiedintotwocategories:
Descriptiveminingandpredictivemining
◦Descriptivemining:describesthedatasetinaconciseand
summarativemannerandpresentsinterestinggeneral
propertiesofdata
◦Predictivemining:analyzesthedatainordertoconstruct
oneorasetofmodels,andattemptstopredictthebehavior
ofnewdatasets
By Mrs. Chetana

What is Concept Description?
Databasesusuallystoreslargeamountofdataingreat
detail.
However,usersoftenliketoviewsetsofsummarized
datainconcise,descriptiveterms.
Suchdatadescriptionsmayprovideanoverallpictureof
aclassofdataordistinguishitfromasetofcomparative
classes.
Suchdescriptivedataminingiscalled
conceptdescriptionsandformsanimportant
componentofdatamining
By Mrs. Chetana

What is Concept Description?
Thesimplestkindofdescriptivedataminingiscalled
conceptdescription.
Aconceptusuallyreferstoacollectionofdatasuchas
frequent_buyers,graduate_studentsandsoon.
Asdataminingtaskconceptdescriptionisnotasimple
enumerationofthedata.Instead,conceptdescription
generates descriptionsforcharacterizationand
comparisonofthedata
It is sometimes called class description, when the concept to be
described refers to a class of objects
◦Characterization:providesaconciseandbriefsummarizationofthe
givencollectionofdata
◦Comparison:providesdescriptionscomparingtwoormore
collectionsofdata
By Mrs. Chetana

Concept Description vs. OLAP
OLAP:
◦DatawarehouseandOLAPtoolsarebasedonmultidimensionaldata
modelthatviewsdataintheformofdatacube,consistingof
dimensions(orattributes)andmeasures(aggregatefunctions)
◦ThecurrentOLAPsystemsconfinedimensionstonon-numericdata.
◦Similarly,measuressuchascount(),sum(),average()incurrentOLAP
systemsapplyonlytonumericdata.
◦restrictedtoasmallnumberofdimensionandmeasuretypes
◦user-controlledprocess(suchasselectionofdimensionsandthe
applicationsofOLAPoperationssuchasdrilldown,rollup,slicing
anddicingarecontrolledbytheusers
Conceptdescriptioninlargedatabases:
◦Thedatabaseattributescanbeofvarioustypes,includingnumeric,
nonnumeric,spatial,textorimage
◦canhandlecomplexdatatypesoftheattributesandtheiraggregations
◦amoreautomatedprocess
By Mrs. Chetana

Concept Description vs. OLAP
Concept description:
◦can handle complex data types of the attributes and
their aggregations
◦a more automated process
OLAP:
◦restricted to a small number of dimension and measure
types
◦user-controlled process
By Mrs. Chetana

Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana

Data Generalization and Summarization-
based Characterization
Dataandobjectsindatabasescontaindetailedinformationatprimitive
conceptlevel.
Forex,theitemrelationinasalesdatabasemaycontainattributes
describinglowleveliteminformationsuchasitem_ID,name,brand,
category,supplier,place_madeandprice.
Itisusefultobeabletosummarizealargesetofdataandpresentitata
highconceptuallevel.
Forex.SummarizingalargesetofitemsrelatingtoChristmasseason
salesprovidesageneraldescriptionofsuchdata,whichcanbevery
helpfulforsalesandmarketingmanagers.
Thisrequiresimportantfunctionalitycalleddatageneralization
By Mrs. Chetana

Data Generalization and Summarization-
based Characterization
Datageneralization
◦Aprocesswhichabstractsalargesetoftask-relevantdata
inadatabasefromalowconceptuallevelstohigherones.
◦Approaches:
Datacubeapproach(OLAPapproach)
Attribute-orientedinductionapproach
1
2
3
4
5
Conceptual
levels
By Mrs. Chetana

Characterization: Data Cube Approach
(without using AO-Induction)
Performcomputationsandstoreresultsindatacubes
Strength
◦Anefficientimplementationofdatageneralization
◦Computationofvariouskindsofmeasures
e.g.,count(),sum(),average(),max()
◦Generalizationandspecializationcanbeperformedonadatacube
byroll-upanddrill-down
Limitations
◦handleonlydimensionsofsimplenonnumericdataandmeasuresof
simpleaggregatednumericvalues.
◦Lackofintelligentanalysis,can’ttellwhichdimensionsshouldbe
usedandwhatlevelsshouldthegeneralizationreach
By Mrs. Chetana

Attribute-Oriented Induction (AOI)
TheAttributeOrientedInduction(AOI)approachtodata
generalizationandsummarization–basedcharacterizationwasfirst
proposedin1989(KDD‘89workshop)afewyearspriortothe
introductionofthedatacubeapproach.
Thedatacubeapproachcanbeconsideredasadatawarehouse
based,precomputationaloriented,materializedapproach
Itperformsoff-lineaggregationbeforeanOLAPordatamining
queryissubmittedforprocessing.
Ontheotherhand,theattributeorientedinductionapproach,atleast
initsinitialproposal,arelationaldatabasequeryoriented,
generalizedbased,on-linedataanalysistechnique
By Mrs. Chetana

Attribute-Oriented Induction (AOI)
However,thereisnoinherentbarrierdistinguishingthetwo
approachesbasedononlineaggregationversusoffline
precomputation.
Someaggregationsinthedatacubecanbecomputedon-line,
whileoff-lineprecomputationofmultidimensionalspacecan
speedupattribute-orientedinductionaswell.
By Mrs. Chetana

Attribute-Oriented Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How it is done?
◦Collect the task-relevant data( initial relation) using a relational
database query
◦Perform generalization by attribute removalor attribute
generalization.
◦Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts.
◦reduces the size of generalized data set.
◦Interactive presentation with users.
By Mrs. Chetana

Basic Principles of
Attribute-Oriented Induction
Datafocusing:task-relevantdata,includingdimensions,andtheresult
istheinitialrelation.
Attribute-removal:removeattributeAifthereisalargesetofdistinct
valuesforAbut
(1)thereisnogeneralizationoperatoronA,or
(2)A’shigherlevelconceptsareexpressedintermsofotherattributes.
Attribute-generalization:IfthereisalargesetofdistinctvaluesforA,
andthereexistsasetofgeneralizationoperatorsonA,thenselectan
operatorandgeneralizeA.
Attribute-thresholdcontrol:typical2-8,specified/default.
Generalizedrelationthresholdcontrol(10-30):controlthefinal
relation/rulesize.
By Mrs. Chetana

Basic Algorithm for Attribute-Oriented Induction
InitialRel:
Query processing of task-relevant data, deriving the initial relation.
PreGen:
Based on the analysis of the number of distinct values in each attribute,
determine generalization plan for each attribute: removal? or how high to
generalize?
PrimeGen:
Based on the PreGen plan, perform generalization to the right level to
derive a “prime generalized relation”, accumulating the counts.
Presentation:
User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping
into rules, cross tabs, visualization presentations.
By Mrs. Chetana

Example
DMQL: Describe general characteristics of graduate
students in the Big-University database
useBig_University_DB
mine characteristics as“Science_Students”
in relevance toname, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
wherestatus in “graduate”
Corresponding SQL statement:
Selectname, gender, major, birth_place, birth_date,
residence, phone#, gpa
fromstudent
where status in {“Msc”, “MBA”, “PhD” }
By Mrs. Chetana

Class Characterization: An
ExampleNameGenderMajorBirth-PlaceBirth_dateResidencePhone #GPA
Jim
Woodman
M CSVancouver,BC,
Canada
8-12-763511 Main St.,
Richmond
687-45983.67
Scott
Lachance
M CSMontreal, Que,
Canada
28-7-75345 1st Ave.,
Richmond
253-91063.70
Laura Lee

F

Physics

Seattle, WA, USA

25-8-70

125 Austin Ave.,
Burnaby

420-5232

3.83

RemovedRetainedSci,Eng,
Bus
CountryAge rangeCityRemovedExcl,
VG,.. GenderMajorBirth_regionAge_rangeResidenceGPACount
MScience Canada 20-25RichmondVery-good 16
FScience Foreign 25-30BurnabyExcellent 22
… … … … … … … Birth_Region
Gender
CanadaForeignTotal
M 16 14 30
F 10 22 32
Total 26 36 62
See
Principles
See Algorithm
Prime
Generalized
Relation
Initial
Relation
See Implementation
By Mrs. Chetana

Presentation of Generalized Results
Generalized relation:
◦Relations where some or all attributes are generalized, with counts or other
aggregation values accumulated.
Cross tabulation:
◦Mapping results into cross tabulation form (similar to contingency tables).
Visualization techniques:
◦Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
◦Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad(x) Λ male(x) ⇒birth_region(x) = “Canadd[t:53%] ∨birth_region(x) = “foreign[t:47%]
By Mrs. Chetana

Implementation by Cube Technology
Construct a data cube on-the-fly for the given data mining query
◦Facilitate efficient drill-down analysis
◦May increase the response time
◦A balanced solution: precomputation of “subprime” relation
Use a predefined & precomputed data cube
◦Construct a data cube beforehand
◦Facilitate not only the attribute-oriented induction, but also attribute
relevance analysis, dicing, slicing, roll-up and drill-down
◦Cost of cube computation and the nontrivial storage overhead
By Mrs. Chetana

Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization -based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana

Analytical Characterization
Attribute Relevance Analysis
“WhatifIamnotsurewhichattributetoincludeforclass
characterizationandclasscomparison?Imayendupspecifying
toomanyattributes,whichcouldslowdownthesystem
considerably”
Measuresofattributerelevanceanalysiscanbeusedtohelp
identifyirrelevantorweaklyrelevantattributesthatcanbe
excludedfromtheconceptdescriptionprocess.
Theincorporationofthisprocessingstepintoclass
characterizationorcomparisonisreferredtoasanalytical
characterizationoranalyticalcomparison
By Mrs. Chetana

Why Perform Attribute Relevance
Analysis??
ThefirstlimitationofOLAPtoolisthehandlingofcomplexobjects.
Thesecondlimitationisthelackofanautomatedgeneralizationprocess:
theusermustexplicitlytellthesystemwhichdimensionsshouldbeincludedintheclass
characterizationandhowhighaleveleachdimensionshouldbegeneralized.
Actually,eachstepofgeneralizationorspecializationonanydimension
mustbespecifiedbytheuser.
Usually,itisnotdifficultforausertoinstructadataminingsystem
regardinghowhighaleveleachdimensionshouldbegeneralized.
Forex,userscansetattributegeneralizationthresholdsforthis,orspecify
whichlevelagivendimensionshouldreach,suchaswiththecommand
“generalizedimensionlocationtothecountrylevel”.
By Mrs. Chetana

Why Perform Attribute Relevance
Analysis??
Evenwithoutexplicituserinstruction,adefaultvaluesuchas2to8can
besetbythedataminingsystem,whichwouldalloweachdimensionto
begeneralizedtoalevelthatcontainsonly2to8distinctvalues.
Onotherhandnormallyausermayincludetoofewattributesinthe
analysis,causingtheincompleteminingresultsorausermayintroduce
toomanyattributesforanalysise.g“inrelevanceto*”.
Methodsshouldbeintroducedtoperformattributerelevanceanalysisin
ordertofilteroutstatisticallyirrelevantorweaklyrelevantattributes
Classcharacterizationthatincludestheanalysisofattribute/dimension
relevanceiscalledanalyticalcharacterization.
Classcomparisonthatincludessuchanalysisiscalledanalytical
comparison
By Mrs. Chetana

Attribute Relevance Analysis
Why?
◦Which dimensions should be included?
◦How high level of generalization?
◦Automatic vs. interactive
◦Reduce number of attributes; easy to understand patterns
What?
◦statistical method for preprocessing data
filter out irrelevant or weakly relevant attributes
retain or rank the relevant attributes
◦relevance related to dimensions and levels
◦analytical characterization, analytical comparison
By Mrs. Chetana

Steps for Attribute relevance analysis
Data Collection :
Collect data for both the target class and the contrasting class by query processing
Preliminary relevance analysis using conservative AOI:
•This step identifies a set of dimensions and attributes on which the selected relevance
measure is to be applied.
•The relation obtained by such an application of AOI is called the candidate relation of
the mining task.
Remove irrelevant and weakly relevant attributes using the selected
relevance analysis:
•We evaluate each attribute in the candidate relation using the selected relevance
analysis measure.
•This step results in an initial target class working relation and initial contrasting class
working relation.
Generate the concept description using AOI:
•Perform AOI using a less conservative set of attribute generalization thresholds.
•If the descriptive mining is
Class characterization , only ITCWR ( Initial Target Class Working Relation)is included
Class Comparison both ITCWR and ICCWR( Initial Contrasting Class Working Relation) are
included
By Mrs. Chetana

Relevance Measures
Quantitative relevance measure determines the
classifying power of an attribute within a set of data.
Methods
◦information gain (ID3)
◦gain ratio (C4.5)
◦gini index
◦
2
contingency table statistics
◦uncertainty coefficient
By Mrs. Chetana

Entropy and Information Gain
S contains s
ituples of class C
ifor i = {1, …, m}
Information measures info required to classify any arbitrary
tuple
Entropy of attribute A with values {a
1,a
2,…,a
v}
Information gained by branching on attribute AI(s1,s2,...,sm)=−∑
i=1
m
si
s
log2
si
s E(A)=∑
j=1
v
s1j+...+smj
s
I(s1j,...,smj) Gain(A)=I(s1,s2,...,sm)−E(A)
By Mrs. Chetana

Example: Analytical Characterization
Task
◦Mine general characteristics describing graduate students using
analytical characterization
Given
◦Attributes :
name, gender, major, birth_place, birth_date, phone#, and gpa
◦Gen(a
i)= concept hierarchies on a
i
◦U
i= attribute analytical thresholds for a
i
◦T
i= attribute generalization thresholds for a
i
◦R= attribute relevance threshold
By Mrs. Chetana

Eg: Analytical Characterization (cont’d)
1. Data collection
◦target class: graduate student
◦contrasting class: undergraduate student
2. Analytical generalization using U
i
◦attribute removal
remove nameand phone#
◦attribute generalization
generalize major, birth_place, birth_date andgpa
accumulate counts
◦candidate relation(large attribute generalization threshold):
gender, major, birth_country, age_rangeand gpa
By Mrs. Chetana

Example: Analytical characterization (2)gendermajorbirth_countryage_rangegpa count
MScienceCanada20-25Very_good16
FScienceForeign25-30Excellent22
MEngineeringForeign25-30Excellent18
FScienceForeign25-30Excellent25
MScienceCanada20-25Excellent21
FEngineeringCanada20-25Excellent18
Candidate relation for Target class: Graduate students (=120)gendermajorbirth_countryage_rangegpa count
MScienceForeign<20Very_good18
FBusinessCanada<20Fair20
MBusinessCanada<20Fair22
FScienceCanada20-25Fair24
MEngineeringForeign20-25Very_good22
FEngineeringCanada<20Excellent24
Candidate relation for Contrasting class: Undergraduate students (=130)
By Mrs. Chetana

Eg: Analytical characterization (3)
3. Relevance analysis
◦Calculate expected info required to classify an arbitrary tuple
◦Calculate entropy of each attribute: e.g. major0.9988
250
130
log2
250
130
250
120
log2
250
120
130120s21, ==),I(=)I(s  For major=”Science”:S11=84S21=42I(s11,s21)=0.9183
For major=”Engineering”:S12=36S22=46I(s12,s22)=0.9892
For major=”Business”:S13=0S23=42I(s13,s23)=0
Number of grad
students in
“Business”
Number of undergrad
students in “Business”
By Mrs. Chetana

Example: Analytical Characterization (4)
Calculate expected info required to classify a given sample if S is
partitioned according to the attribute
Calculate information gain for each attribute
◦Information gain for all attributesE(major)=
126
250
I(s11,s21)+
82
250
I(s12,s22)+
42
250
I(s13,s23)=0.7873 Gain(major)=I(s1,s2)−E(major)=0.2115 Gain(gender) = 0.0003
Gain(birth_country)= 0.0407
Gain(major) = 0.2115
Gain(gpa) = 0.4490
Gain(age_range)= 0.5971
By Mrs. Chetana

Eg: Analytical characterization (5)
4. Initial working relation (W
0) derivation
◦R = 0.1 ( Attribute Relevant Threshold value)
◦remove irrelevant/weakly relevant attributes from candidate relation =>
drop gender, birth_country
◦remove contrasting class candidate relation
5. Perform attribute-oriented induction on W
0using T
imajor age_rangegpa count
Science 20-25 Very_good16
Science 25-30 Excellent47
Science 20-25 Excellent21
Engineering20-25 Excellent18
Engineering25-30 Excellent18
Initial target class working relation W
0: Graduate students
By Mrs. Chetana

Analytical Characterization :
Example of Entropy & Information Gain
By Mrs. Chetana

By Mrs. Chetana

By Mrs. Chetana

By Mrs. Chetana

By Mrs. Chetana

By Mrs. Chetana

Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana

Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana

Class Comparisons Methods and Implementations
DataCollection:Thesetofrelevantdatainthedatabaseiscollectedbyquery
processingandispartitionedintotargetclassandcontrastingclass.
Dimensionrelevanceanalysis:Iftherearemanydimensionsandanalytical
comparisonisdesired,thendimensionrelevanceanalysisshouldbeperformedon
theseclassesandonlythehighlyrelevantdimensionsareincludedinthefurther
analysis.
SynchronousGeneralization:Generalizationisperformedonthetargetclasstothe
levelcontrolledbyuser-orexpert–specifieddimensionthreshold,whichresultsina
primetargetclassrelation/cuboid.Theconceptsinthecontrastingclass(es)are
generalizedtothesamelevelasthoseintheprimetargetclassrelation/cuboid,
formingtheprimecontrastingclassrelation/cuboid.
Presentationofthederivedcomparison:Theresultingclasscomparison
descriptioncanbevisualizedintheformoftables,graphsandrules.This
presentationusuallyincludesa“contrasting”measure(suchascount%)thatreflects
thecomparisonbetweenthetargetandcontrastingclasses.
By Mrs. Chetana

Example: Analytical comparison
Task
◦Compare graduate and undergraduate students using discriminant
rule.
◦DMQL query
useBig_University_DB
mine comparison as“grad_vs_undergrad_students”
in relevance toname, gender, major, birth_place,
birth_date, residence, phone#, gpa
for“graduate_students”
wherestatus in “graduate”
versus“undergraduate_students”
wherestatus in “undergraduate”
analyzecount%
fromstudent
By Mrs. Chetana

Example: Analytical comparison (2)
Given
◦attributes name, gender, major, birth_place,
birth_date, residence, phone#and gpa
◦Gen(a
i)= concept hierarchies on attributes a
i
◦U
i= attribute analytical thresholds for attributes a
i
◦T
i= attribute generalization thresholds for
attributes a
i
◦R= attribute relevance threshold
By Mrs. Chetana

Example: Analytical comparison (3)
1. Data collection
◦target and contrasting classes
2. Attribute relevance analysis
◦remove attributes name, gender, major, phone#
3. Synchronous generalization
◦controlled by user-specified dimension thresholds
◦prime target and contrasting class(es) relations/cuboids
By Mrs. Chetana

Example: Analytical comparison (4)
4. Drill down, roll up and other OLAP operations on target and
contrasting classes to adjust levels of abstractions of resulting
description
5. Presentation
◦as generalized relations, crosstabs, bar charts, pie charts, or
rules
◦contrasting measures to reflect comparison between target
and contrasting classes
e.g. count%
By Mrs. Chetana

Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim
Woodman
M CS Vancouver,BC,
Canada
8-12-76 3511 Main St.,
Richmond
687-4598 3.67
Scott
Lachance
M CS Montreal, Que,
Canada
28-7-75 345 1st Ave.,
Richmond
253-9106 3.70
Laura Lee

F

Physics

Seattle, WA, USA

25-8-70

125 Austin Ave.,
Burnaby

420-5232

3.83


Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Bob
Schumann
M Chem Calagary, Alt,
Canada
10-1-78 2642 Halifax St,
Burnaby
294-4291 2.96
Ammy.
Eau
F Bio Golden, BC,
Canada
30-3-76 463 Sunset
Cres, Vancouer
681-5417 3.52
… … … … … … … …

Table 5.7 Initial target class working relation (graduate student)
Table 5.8 Initial contrasting class working relation (graduate student)
By Mrs. Chetana

Example: Analytical comparison (5)Major Age_rangeGpa Count%
Science 20-25 Good 5.53%
Science 26-30 Good 2.32%
Science Over_30 Very_good5.86%
… … … …
Business Over_30 Excellent4.68%
Prime generalized relation for the target class: Graduate studentsMajor Age_rangeGpa Count%
Science 15-20 Fair 5.53%
Science 15-20 Good 4.53%
… … … …
Science 26-30 Good 5.02%
… … … …
Business Over_30 Excellent0.68%
Prime generalized relation for the contrasting class: Undergraduate students
By Mrs. Chetana

Presentation-Generalized Relation
By Mrs. Chetana

Presentation -Crosstab
By Mrs. Chetana

Quantitative Characteristics Rules
Cj = target class
q
a= a generalized tuple covers some tuples of class
◦but can also cover some tuples of contrasting class
t-weight
◦range: [0, 1] or [0%, 100%]
Presentation of Class Characterization
Descriptions
Alogicthatisassociatedwiththequantitativeinformationiscalled
Quantitativerule.Itassociatesaninterestingnessmeasuret-weightwith
eachtuple
t
i
By Mrs. Chetana

grad(x)∧male(x)⇒
birth
region(x)= ital Canada[t:53]∨birth
region(x)= ital foreign[t:47]. By Mrs. Chetana

Quantitative DiscriminantRules
Cj = target class
q
a= a generalized tuple covers some tuples of class
◦but can also cover some tuples of contrasting class
d-weight
◦range: [0, 1] 


m
=i
Ci)(qa
Cj)(qa
=d
1
count
count
weight
Presentation of Class Comparison Descriptions
Tofindoutthediscriminativefeaturesoftargetandcontrastingclassescan
bedescribedasadiscriminativerule.
Itassociatesaninterestingnessmeasured-weightwitheachtuple
By Mrs. Chetana

Example: Quantitative Discriminant RuleStatusBirth_countryAge_rangeGpaCount
GraduateCanada25-30Good90
UndergraduateCanada25-30Good210
Count distribution between graduate and undergraduate students for a generalized
tuple
Intheaboveex,supposethatthecountdistribution
formajor=‘science’andage_range=’20..25”andgpa=‘good’isshowninthetables.
Thed_weightwouldbe90/(90+210)=30%w.r.ttotargetclassand
Thed_weightwouldbe210/(90+210)=70%w.r.ttocontrastingclass.
i.e.Thestudentmajoringinscienceis21to25yearsoldandhasagoodgpathen
basedonthedata,thereisaprobabilitythatshe/heisagraduatestudentversusa
70%probabilitythatshe/heisanundergraduatestudent.
Similarlythed-weightsforothertuplesalsocanbederived.
By Mrs. Chetana

Example: Quantitative DiscriminantRule
A Quantitative discriminant rule for the target class of a given comparison
is written in form of
Based on the above a discriminant rule for the target class graduate_student
can be written as
Note : The discriminant rule provides a sufficient condition, but not a necessary one;
for an object.
For Ex. the rule implies that if X satisfies the condition, then the probability that X
is a graduate student is 30%. ](X)[(X) d_weight:dconditionsstarget_claX,  ]good[d=gpa(X)=(X)ageCanada=(X)birth
(X)graduateX,
rangecountry
student
30: ital 30 - 25 ital 

By Mrs. Chetana

Location / Item TV Computer both_items
Europe 80 240 320
North America 120 560 680
Both_regions 200 800 1000

A crosstab for the total number (count) of TVs and computers sold in thousands in 1999
To calculate T_Weight(Typicality Weight)
The formula is
1. 80 / (80+240) = 25%
2. 120 / (120+560) = 17.65%
3. 200 / (200+800) = 20%
To calculate D_Weight(Discriminate rule)
The formula is
1. 80/(80+120) = 40%
2. 120/(80+120) = 60%
3. 200/(80+120) = 100%
t_weight= count (q
a)

i=1
n
count (q
i) 


m
=i
Ci)(qa
Cj)(qa
=d
1
count
count
weight
By Mrs. Chetana

Class Description
Quantitative characteristic rule
◦necessary
Quantitative discriminant rule
◦sufficient
Quantitative description rule
◦necessary and sufficientn]
wnn(X)[]w(X)[
(X)
'
'
{ :d,w:tcondition...1{ :dw1,:tcondition1
sstarget_claX,

 ](X)[(X) d_weight:dconditionsstarget_claX,  ∀X,target_class(X)⇒condition(X)[t:t_weight]
By Mrs. Chetana

Example: Quantitative Description Rule
•Quantitative description rule for target class EuropeLocation/item TV Computer Both_items
Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt
Europe 80 25% 40% 240 75% 30% 320 100% 32%
N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68%
Both_
regions
200 20% 100% 800 80% 100% 1000 100% 100%


Crosstab showing associated t-weight, d-weight values and total number (in thousands) of
TVs and computers sold at AllElectronics in 1998
To define a quantitative characteristic rule, we introduce the t-weightas an interestingness
measures that describes the typicality of each disjunct in the rule.30%]:d75%,:[t40%]:d25%,:[t)computer""(item(X))TV""(item(X)
Europe(X)X,


t_weight= count (q
a)

i=1
n
count (q
i)
By Mrs. Chetana

Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana

Measuring the Central Tendency
Mean
◦Weighted arithmetic mean
Median:A holistic measure
◦Middle value if odd number of values, or average of the middle two
values otherwise
◦estimated by interpolation
Mode
◦Value that occurs most frequently in the data
◦Unimodal, bimodal, trimodal
◦Empirical formula:
n
=i
ix
n
=x
1
1 

n
=i
i
n
=i
ii
w
xw
=x
1
1 median=L
1+(
n/2−(∑f)l
f
median
)c mean−mode=3×(mean−median)
By Mrs. Chetana

Measuring the Dispersion of Data
Quartiles, outliers and boxplots
◦Quartiles: Q
1(25
th
percentile), Q
3(75
th
percentile)
◦Inter-quartile range: IQR = Q
3 –Q
1
◦Five number summary: min, Q
1, M,Q
3, max
◦Boxplot: ends of the box are the quartiles, median is marked, whiskers,
and plot outlier individually
◦Outlier:usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation
◦Variances
2
: (algebraic, scalable computation)
◦Standard deviations is the square root of variance s
2s
2
=
1
n−1

i=1
n
(x
i
−̄x)
2
=
1
n−1
[∑
i=1
n
x
i
2

1
n
(∑
i=1
n
x
i
)
2
]
By Mrs. Chetana

BoxplotAnalysis
Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
Boxplot
◦Data is represented with a box
◦The ends of the box are at the first and third quartiles,
i.e., the height of the box is IRQ
◦The median is marked by a line within the box
◦Whiskers: two lines outside the box extend to
Minimum and Maximum
By Mrs. Chetana

A Boxplot
A boxplot
By Mrs. Chetana

Visualization of Data
Dispersion: Boxplot Analysis
By Mrs. Chetana

Mining Descriptive Statistical Measures in
Large Databases
Variance
Standard deviation: the square root of the
variance
◦Measures spread about the mean
◦It is zero if and only if all the values are equal
◦Both the deviation and the variance are algebraic s
2
=
1
n−1

i=1
n
(x
i
−̄x)
2
=
1
n−1[
∑x
i
2

1
n
(∑x
i)
2
]
By Mrs. Chetana

Histogram Analysis
Graph displays of basic statistical class descriptions
◦Frequency histograms
A univariate graphical method
Consists of a set of rectangles that reflect the counts or frequencies of
the classes present in the given data
By Mrs. Chetana

QuantilePlot
Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
Plots quantileinformation
◦For a data x
idata sorted in increasing order, f
iindicates that
approximately 100 f
i% of the data are below or equal to the
value x
i
By Mrs. Chetana

Quantile-Quantile(Q-Q) Plot
Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
Allows the user to view whether there is a shift in going from
one distribution to another
By Mrs. Chetana

Scatter plot
Provides a first look at bivariate data to see clusters of
points, outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
By Mrs. Chetana

Loess Curve
Adds a smooth curve to a scatter plot in order to provide
better perception of the pattern of dependence
Loess curve is fitted by setting two parameters: a smoothing
parameter, and the degree of the polynomialsthat are fitted by
the regression
By Mrs. Chetana

Graphic Displays of Basic Statistical Descriptions
Histogram:(shown before)
Boxplot:(covered before)
Quantile plot: each value x
iis paired with f
i indicating that
approximately 100 f
i % of data are x
i
Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
Loess (local regression) curve: add a smooth curve to a scatter
plot to provide better perception of the pattern of dependence
By Mrs. Chetana

Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana

AO Induction vs. Learning-from-example
Paradigm
Difference in philosophies and basic assumptions
◦Positive and negative samples in learning-from-example:
positive used for generalization, negative -for specialization
◦Positive samples only in data mining:
hence generalization-based, to drill-down backtrack the
generalization to a previous state
Difference in methods of generalizations
◦Machine learning generalizes on a tuple by tuple basis
◦Data mining generalizes on an attribute by attribute basis
By Mrs. Chetana

Comparison of Entire vs. Factored Version
Space
By Mrs. Chetana

Incremental and Parallel Mining of
Concept Description
Incremental mining: revision based on newly added data
DB
◦Generalize DB to the same level of abstraction in the generalized
relation R to derive R
◦Union R U R, i.e., merge counts and other statistical information
to produce a new relation R’
Similar philosophy can be applied to data sampling,
parallel and/or distributed mining, etc.
By Mrs. Chetana

Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana

Summary
Concept description: characterization and discrimination
OLAP-based vs. attribute-oriented induction
Efficient implementation of AOI
Analytical characterization and comparison
Mining descriptive statistical measures in large databases
Discussion
◦Incremental and parallel mining of description
◦Descriptive mining of complex types of data
By Mrs. Chetana
Tags