UNIT -III
Concept Description:
Characterization and Comparison
By Mrs. Chetana
UNIT -III
•Concepts Description: Characterization and Comparision:
Data Generalization and Summarization-Based Characterization,
Analytical Characterization: Analysis of Attribute Relevance,
Mining Class Comparisons: Discriminating between Different
Classes, Mining Descriptive Statistical Measures in Large
Databases.
•Applications:
Telecommunication Industry, Social Network Analysis, Intrusion
Detection
By Mrs. Chetana
Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
What is Concept Description?
FromDataAnalysispointofview,dataminingcanbe
classifiedintotwocategories:
Descriptiveminingandpredictivemining
◦Descriptivemining:describesthedatasetinaconciseand
summarativemannerandpresentsinterestinggeneral
propertiesofdata
◦Predictivemining:analyzesthedatainordertoconstruct
oneorasetofmodels,andattemptstopredictthebehavior
ofnewdatasets
By Mrs. Chetana
What is Concept Description?
Databasesusuallystoreslargeamountofdataingreat
detail.
However,usersoftenliketoviewsetsofsummarized
datainconcise,descriptiveterms.
Suchdatadescriptionsmayprovideanoverallpictureof
aclassofdataordistinguishitfromasetofcomparative
classes.
Suchdescriptivedataminingiscalled
conceptdescriptionsandformsanimportant
componentofdatamining
By Mrs. Chetana
What is Concept Description?
Thesimplestkindofdescriptivedataminingiscalled
conceptdescription.
Aconceptusuallyreferstoacollectionofdatasuchas
frequent_buyers,graduate_studentsandsoon.
Asdataminingtaskconceptdescriptionisnotasimple
enumerationofthedata.Instead,conceptdescription
generates descriptionsforcharacterizationand
comparisonofthedata
It is sometimes called class description, when the concept to be
described refers to a class of objects
◦Characterization:providesaconciseandbriefsummarizationofthe
givencollectionofdata
◦Comparison:providesdescriptionscomparingtwoormore
collectionsofdata
By Mrs. Chetana
Concept Description vs. OLAP
Concept description:
◦can handle complex data types of the attributes and
their aggregations
◦a more automated process
OLAP:
◦restricted to a small number of dimension and measure
types
◦user-controlled process
By Mrs. Chetana
Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
Data Generalization and Summarization-
based Characterization
Dataandobjectsindatabasescontaindetailedinformationatprimitive
conceptlevel.
Forex,theitemrelationinasalesdatabasemaycontainattributes
describinglowleveliteminformationsuchasitem_ID,name,brand,
category,supplier,place_madeandprice.
Itisusefultobeabletosummarizealargesetofdataandpresentitata
highconceptuallevel.
Forex.SummarizingalargesetofitemsrelatingtoChristmasseason
salesprovidesageneraldescriptionofsuchdata,whichcanbevery
helpfulforsalesandmarketingmanagers.
Thisrequiresimportantfunctionalitycalleddatageneralization
By Mrs. Chetana
Data Generalization and Summarization-
based Characterization
Datageneralization
◦Aprocesswhichabstractsalargesetoftask-relevantdata
inadatabasefromalowconceptuallevelstohigherones.
◦Approaches:
Datacubeapproach(OLAPapproach)
Attribute-orientedinductionapproach
1
2
3
4
5
Conceptual
levels
By Mrs. Chetana
Characterization: Data Cube Approach
(without using AO-Induction)
Performcomputationsandstoreresultsindatacubes
Strength
◦Anefficientimplementationofdatageneralization
◦Computationofvariouskindsofmeasures
e.g.,count(),sum(),average(),max()
◦Generalizationandspecializationcanbeperformedonadatacube
byroll-upanddrill-down
Limitations
◦handleonlydimensionsofsimplenonnumericdataandmeasuresof
simpleaggregatednumericvalues.
◦Lackofintelligentanalysis,can’ttellwhichdimensionsshouldbe
usedandwhatlevelsshouldthegeneralizationreach
By Mrs. Chetana
Attribute-Oriented Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How it is done?
◦Collect the task-relevant data( initial relation) using a relational
database query
◦Perform generalization by attribute removalor attribute
generalization.
◦Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts.
◦reduces the size of generalized data set.
◦Interactive presentation with users.
By Mrs. Chetana
Basic Principles of
Attribute-Oriented Induction
Datafocusing:task-relevantdata,includingdimensions,andtheresult
istheinitialrelation.
Attribute-removal:removeattributeAifthereisalargesetofdistinct
valuesforAbut
(1)thereisnogeneralizationoperatoronA,or
(2)A’shigherlevelconceptsareexpressedintermsofotherattributes.
Attribute-generalization:IfthereisalargesetofdistinctvaluesforA,
andthereexistsasetofgeneralizationoperatorsonA,thenselectan
operatorandgeneralizeA.
Attribute-thresholdcontrol:typical2-8,specified/default.
Generalizedrelationthresholdcontrol(10-30):controlthefinal
relation/rulesize.
By Mrs. Chetana
Basic Algorithm for Attribute-Oriented Induction
InitialRel:
Query processing of task-relevant data, deriving the initial relation.
PreGen:
Based on the analysis of the number of distinct values in each attribute,
determine generalization plan for each attribute: removal? or how high to
generalize?
PrimeGen:
Based on the PreGen plan, perform generalization to the right level to
derive a “prime generalized relation”, accumulating the counts.
Presentation:
User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping
into rules, cross tabs, visualization presentations.
By Mrs. Chetana
Example
DMQL: Describe general characteristics of graduate
students in the Big-University database
useBig_University_DB
mine characteristics as“Science_Students”
in relevance toname, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
wherestatus in “graduate”
Corresponding SQL statement:
Selectname, gender, major, birth_place, birth_date,
residence, phone#, gpa
fromstudent
where status in {“Msc”, “MBA”, “PhD” }
By Mrs. Chetana
Class Characterization: An
ExampleNameGenderMajorBirth-PlaceBirth_dateResidencePhone #GPA
Jim
Woodman
M CSVancouver,BC,
Canada
8-12-763511 Main St.,
Richmond
687-45983.67
Scott
Lachance
M CSMontreal, Que,
Canada
28-7-75345 1st Ave.,
Richmond
253-91063.70
Laura Lee
…
F
…
Physics
…
Seattle, WA, USA
…
25-8-70
…
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
RemovedRetainedSci,Eng,
Bus
CountryAge rangeCityRemovedExcl,
VG,.. GenderMajorBirth_regionAge_rangeResidenceGPACount
MScience Canada 20-25RichmondVery-good 16
FScience Foreign 25-30BurnabyExcellent 22
… … … … … … … Birth_Region
Gender
CanadaForeignTotal
M 16 14 30
F 10 22 32
Total 26 36 62
See
Principles
See Algorithm
Prime
Generalized
Relation
Initial
Relation
See Implementation
By Mrs. Chetana
Presentation of Generalized Results
Generalized relation:
◦Relations where some or all attributes are generalized, with counts or other
aggregation values accumulated.
Cross tabulation:
◦Mapping results into cross tabulation form (similar to contingency tables).
Visualization techniques:
◦Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
◦Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad(x) Λ male(x) ⇒birth_region(x) = “Canadd[t:53%] ∨birth_region(x) = “foreign[t:47%]
By Mrs. Chetana
Implementation by Cube Technology
Construct a data cube on-the-fly for the given data mining query
◦Facilitate efficient drill-down analysis
◦May increase the response time
◦A balanced solution: precomputation of “subprime” relation
Use a predefined & precomputed data cube
◦Construct a data cube beforehand
◦Facilitate not only the attribute-oriented induction, but also attribute
relevance analysis, dicing, slicing, roll-up and drill-down
◦Cost of cube computation and the nontrivial storage overhead
By Mrs. Chetana
Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization -based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
Attribute Relevance Analysis
Why?
◦Which dimensions should be included?
◦How high level of generalization?
◦Automatic vs. interactive
◦Reduce number of attributes; easy to understand patterns
What?
◦statistical method for preprocessing data
filter out irrelevant or weakly relevant attributes
retain or rank the relevant attributes
◦relevance related to dimensions and levels
◦analytical characterization, analytical comparison
By Mrs. Chetana
Steps for Attribute relevance analysis
Data Collection :
Collect data for both the target class and the contrasting class by query processing
Preliminary relevance analysis using conservative AOI:
•This step identifies a set of dimensions and attributes on which the selected relevance
measure is to be applied.
•The relation obtained by such an application of AOI is called the candidate relation of
the mining task.
Remove irrelevant and weakly relevant attributes using the selected
relevance analysis:
•We evaluate each attribute in the candidate relation using the selected relevance
analysis measure.
•This step results in an initial target class working relation and initial contrasting class
working relation.
Generate the concept description using AOI:
•Perform AOI using a less conservative set of attribute generalization thresholds.
•If the descriptive mining is
Class characterization , only ITCWR ( Initial Target Class Working Relation)is included
Class Comparison both ITCWR and ICCWR( Initial Contrasting Class Working Relation) are
included
By Mrs. Chetana
Relevance Measures
Quantitative relevance measure determines the
classifying power of an attribute within a set of data.
Methods
◦information gain (ID3)
◦gain ratio (C4.5)
◦gini index
◦
2
contingency table statistics
◦uncertainty coefficient
By Mrs. Chetana
Entropy and Information Gain
S contains s
ituples of class C
ifor i = {1, …, m}
Information measures info required to classify any arbitrary
tuple
Entropy of attribute A with values {a
1,a
2,…,a
v}
Information gained by branching on attribute AI(s1,s2,...,sm)=−∑
i=1
m
si
s
log2
si
s E(A)=∑
j=1
v
s1j+...+smj
s
I(s1j,...,smj) Gain(A)=I(s1,s2,...,sm)−E(A)
By Mrs. Chetana
Example: Analytical Characterization
Task
◦Mine general characteristics describing graduate students using
analytical characterization
Given
◦Attributes :
name, gender, major, birth_place, birth_date, phone#, and gpa
◦Gen(a
i)= concept hierarchies on a
i
◦U
i= attribute analytical thresholds for a
i
◦T
i= attribute generalization thresholds for a
i
◦R= attribute relevance threshold
By Mrs. Chetana
Eg: Analytical Characterization (cont’d)
1. Data collection
◦target class: graduate student
◦contrasting class: undergraduate student
2. Analytical generalization using U
i
◦attribute removal
remove nameand phone#
◦attribute generalization
generalize major, birth_place, birth_date andgpa
accumulate counts
◦candidate relation(large attribute generalization threshold):
gender, major, birth_country, age_rangeand gpa
By Mrs. Chetana
Example: Analytical characterization (2)gendermajorbirth_countryage_rangegpa count
MScienceCanada20-25Very_good16
FScienceForeign25-30Excellent22
MEngineeringForeign25-30Excellent18
FScienceForeign25-30Excellent25
MScienceCanada20-25Excellent21
FEngineeringCanada20-25Excellent18
Candidate relation for Target class: Graduate students (=120)gendermajorbirth_countryage_rangegpa count
MScienceForeign<20Very_good18
FBusinessCanada<20Fair20
MBusinessCanada<20Fair22
FScienceCanada20-25Fair24
MEngineeringForeign20-25Very_good22
FEngineeringCanada<20Excellent24
Candidate relation for Contrasting class: Undergraduate students (=130)
By Mrs. Chetana
Eg: Analytical characterization (3)
3. Relevance analysis
◦Calculate expected info required to classify an arbitrary tuple
◦Calculate entropy of each attribute: e.g. major0.9988
250
130
log2
250
130
250
120
log2
250
120
130120s21, ==),I(=)I(s For major=”Science”:S11=84S21=42I(s11,s21)=0.9183
For major=”Engineering”:S12=36S22=46I(s12,s22)=0.9892
For major=”Business”:S13=0S23=42I(s13,s23)=0
Number of grad
students in
“Business”
Number of undergrad
students in “Business”
By Mrs. Chetana
Example: Analytical Characterization (4)
Calculate expected info required to classify a given sample if S is
partitioned according to the attribute
Calculate information gain for each attribute
◦Information gain for all attributesE(major)=
126
250
I(s11,s21)+
82
250
I(s12,s22)+
42
250
I(s13,s23)=0.7873 Gain(major)=I(s1,s2)−E(major)=0.2115 Gain(gender) = 0.0003
Gain(birth_country)= 0.0407
Gain(major) = 0.2115
Gain(gpa) = 0.4490
Gain(age_range)= 0.5971
By Mrs. Chetana
Eg: Analytical characterization (5)
4. Initial working relation (W
0) derivation
◦R = 0.1 ( Attribute Relevant Threshold value)
◦remove irrelevant/weakly relevant attributes from candidate relation =>
drop gender, birth_country
◦remove contrasting class candidate relation
5. Perform attribute-oriented induction on W
0using T
imajor age_rangegpa count
Science 20-25 Very_good16
Science 25-30 Excellent47
Science 20-25 Excellent21
Engineering20-25 Excellent18
Engineering25-30 Excellent18
Initial target class working relation W
0: Graduate students
By Mrs. Chetana
Analytical Characterization :
Example of Entropy & Information Gain
By Mrs. Chetana
By Mrs. Chetana
By Mrs. Chetana
By Mrs. Chetana
By Mrs. Chetana
By Mrs. Chetana
Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
Class Comparisons Methods and Implementations
DataCollection:Thesetofrelevantdatainthedatabaseiscollectedbyquery
processingandispartitionedintotargetclassandcontrastingclass.
Dimensionrelevanceanalysis:Iftherearemanydimensionsandanalytical
comparisonisdesired,thendimensionrelevanceanalysisshouldbeperformedon
theseclassesandonlythehighlyrelevantdimensionsareincludedinthefurther
analysis.
SynchronousGeneralization:Generalizationisperformedonthetargetclasstothe
levelcontrolledbyuser-orexpert–specifieddimensionthreshold,whichresultsina
primetargetclassrelation/cuboid.Theconceptsinthecontrastingclass(es)are
generalizedtothesamelevelasthoseintheprimetargetclassrelation/cuboid,
formingtheprimecontrastingclassrelation/cuboid.
Presentationofthederivedcomparison:Theresultingclasscomparison
descriptioncanbevisualizedintheformoftables,graphsandrules.This
presentationusuallyincludesa“contrasting”measure(suchascount%)thatreflects
thecomparisonbetweenthetargetandcontrastingclasses.
By Mrs. Chetana
Example: Analytical comparison
Task
◦Compare graduate and undergraduate students using discriminant
rule.
◦DMQL query
useBig_University_DB
mine comparison as“grad_vs_undergrad_students”
in relevance toname, gender, major, birth_place,
birth_date, residence, phone#, gpa
for“graduate_students”
wherestatus in “graduate”
versus“undergraduate_students”
wherestatus in “undergraduate”
analyzecount%
fromstudent
By Mrs. Chetana
Example: Analytical comparison (2)
Given
◦attributes name, gender, major, birth_place,
birth_date, residence, phone#and gpa
◦Gen(a
i)= concept hierarchies on attributes a
i
◦U
i= attribute analytical thresholds for attributes a
i
◦T
i= attribute generalization thresholds for
attributes a
i
◦R= attribute relevance threshold
By Mrs. Chetana
Example: Analytical comparison (3)
1. Data collection
◦target and contrasting classes
2. Attribute relevance analysis
◦remove attributes name, gender, major, phone#
3. Synchronous generalization
◦controlled by user-specified dimension thresholds
◦prime target and contrasting class(es) relations/cuboids
By Mrs. Chetana
Example: Analytical comparison (4)
4. Drill down, roll up and other OLAP operations on target and
contrasting classes to adjust levels of abstractions of resulting
description
5. Presentation
◦as generalized relations, crosstabs, bar charts, pie charts, or
rules
◦contrasting measures to reflect comparison between target
and contrasting classes
e.g. count%
By Mrs. Chetana
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim
Woodman
M CS Vancouver,BC,
Canada
8-12-76 3511 Main St.,
Richmond
687-4598 3.67
Scott
Lachance
M CS Montreal, Que,
Canada
28-7-75 345 1st Ave.,
Richmond
253-9106 3.70
Laura Lee
…
F
…
Physics
…
Seattle, WA, USA
…
25-8-70
…
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Bob
Schumann
M Chem Calagary, Alt,
Canada
10-1-78 2642 Halifax St,
Burnaby
294-4291 2.96
Ammy.
Eau
F Bio Golden, BC,
Canada
30-3-76 463 Sunset
Cres, Vancouer
681-5417 3.52
… … … … … … … …
Table 5.7 Initial target class working relation (graduate student)
Table 5.8 Initial contrasting class working relation (graduate student)
By Mrs. Chetana
Example: Analytical comparison (5)Major Age_rangeGpa Count%
Science 20-25 Good 5.53%
Science 26-30 Good 2.32%
Science Over_30 Very_good5.86%
… … … …
Business Over_30 Excellent4.68%
Prime generalized relation for the target class: Graduate studentsMajor Age_rangeGpa Count%
Science 15-20 Fair 5.53%
Science 15-20 Good 4.53%
… … … …
Science 26-30 Good 5.02%
… … … …
Business Over_30 Excellent0.68%
Prime generalized relation for the contrasting class: Undergraduate students
By Mrs. Chetana
Presentation-Generalized Relation
By Mrs. Chetana
Presentation -Crosstab
By Mrs. Chetana
Quantitative Characteristics Rules
Cj = target class
q
a= a generalized tuple covers some tuples of class
◦but can also cover some tuples of contrasting class
t-weight
◦range: [0, 1] or [0%, 100%]
Presentation of Class Characterization
Descriptions
Alogicthatisassociatedwiththequantitativeinformationiscalled
Quantitativerule.Itassociatesaninterestingnessmeasuret-weightwith
eachtuple
t
i
By Mrs. Chetana
Quantitative DiscriminantRules
Cj = target class
q
a= a generalized tuple covers some tuples of class
◦but can also cover some tuples of contrasting class
d-weight
◦range: [0, 1]
m
=i
Ci)(qa
Cj)(qa
=d
1
count
count
weight
Presentation of Class Comparison Descriptions
Tofindoutthediscriminativefeaturesoftargetandcontrastingclassescan
bedescribedasadiscriminativerule.
Itassociatesaninterestingnessmeasured-weightwitheachtuple
By Mrs. Chetana
Example: Quantitative Discriminant RuleStatusBirth_countryAge_rangeGpaCount
GraduateCanada25-30Good90
UndergraduateCanada25-30Good210
Count distribution between graduate and undergraduate students for a generalized
tuple
Intheaboveex,supposethatthecountdistribution
formajor=‘science’andage_range=’20..25”andgpa=‘good’isshowninthetables.
Thed_weightwouldbe90/(90+210)=30%w.r.ttotargetclassand
Thed_weightwouldbe210/(90+210)=70%w.r.ttocontrastingclass.
i.e.Thestudentmajoringinscienceis21to25yearsoldandhasagoodgpathen
basedonthedata,thereisaprobabilitythatshe/heisagraduatestudentversusa
70%probabilitythatshe/heisanundergraduatestudent.
Similarlythed-weightsforothertuplesalsocanbederived.
By Mrs. Chetana
Example: Quantitative DiscriminantRule
A Quantitative discriminant rule for the target class of a given comparison
is written in form of
Based on the above a discriminant rule for the target class graduate_student
can be written as
Note : The discriminant rule provides a sufficient condition, but not a necessary one;
for an object.
For Ex. the rule implies that if X satisfies the condition, then the probability that X
is a graduate student is 30%. ](X)[(X) d_weight:dconditionsstarget_claX, ]good[d=gpa(X)=(X)ageCanada=(X)birth
(X)graduateX,
rangecountry
student
30: ital 30 - 25 ital
By Mrs. Chetana
Location / Item TV Computer both_items
Europe 80 240 320
North America 120 560 680
Both_regions 200 800 1000
A crosstab for the total number (count) of TVs and computers sold in thousands in 1999
To calculate T_Weight(Typicality Weight)
The formula is
1. 80 / (80+240) = 25%
2. 120 / (120+560) = 17.65%
3. 200 / (200+800) = 20%
To calculate D_Weight(Discriminate rule)
The formula is
1. 80/(80+120) = 40%
2. 120/(80+120) = 60%
3. 200/(80+120) = 100%
t_weight= count (q
a)
∑
i=1
n
count (q
i)
m
=i
Ci)(qa
Cj)(qa
=d
1
count
count
weight
By Mrs. Chetana
Crosstab showing associated t-weight, d-weight values and total number (in thousands) of
TVs and computers sold at AllElectronics in 1998
To define a quantitative characteristic rule, we introduce the t-weightas an interestingness
measures that describes the typicality of each disjunct in the rule.30%]:d75%,:[t40%]:d25%,:[t)computer""(item(X))TV""(item(X)
Europe(X)X,
t_weight= count (q
a)
∑
i=1
n
count (q
i)
By Mrs. Chetana
Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
Measuring the Central Tendency
Mean
◦Weighted arithmetic mean
Median:A holistic measure
◦Middle value if odd number of values, or average of the middle two
values otherwise
◦estimated by interpolation
Mode
◦Value that occurs most frequently in the data
◦Unimodal, bimodal, trimodal
◦Empirical formula:
n
=i
ix
n
=x
1
1
n
=i
i
n
=i
ii
w
xw
=x
1
1 median=L
1+(
n/2−(∑f)l
f
median
)c mean−mode=3×(mean−median)
By Mrs. Chetana
Measuring the Dispersion of Data
Quartiles, outliers and boxplots
◦Quartiles: Q
1(25
th
percentile), Q
3(75
th
percentile)
◦Inter-quartile range: IQR = Q
3 –Q
1
◦Five number summary: min, Q
1, M,Q
3, max
◦Boxplot: ends of the box are the quartiles, median is marked, whiskers,
and plot outlier individually
◦Outlier:usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation
◦Variances
2
: (algebraic, scalable computation)
◦Standard deviations is the square root of variance s
2s
2
=
1
n−1
∑
i=1
n
(x
i
−̄x)
2
=
1
n−1
[∑
i=1
n
x
i
2
−
1
n
(∑
i=1
n
x
i
)
2
]
By Mrs. Chetana
BoxplotAnalysis
Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
Boxplot
◦Data is represented with a box
◦The ends of the box are at the first and third quartiles,
i.e., the height of the box is IRQ
◦The median is marked by a line within the box
◦Whiskers: two lines outside the box extend to
Minimum and Maximum
By Mrs. Chetana
A Boxplot
A boxplot
By Mrs. Chetana
Visualization of Data
Dispersion: Boxplot Analysis
By Mrs. Chetana
Mining Descriptive Statistical Measures in
Large Databases
Variance
Standard deviation: the square root of the
variance
◦Measures spread about the mean
◦It is zero if and only if all the values are equal
◦Both the deviation and the variance are algebraic s
2
=
1
n−1
∑
i=1
n
(x
i
−̄x)
2
=
1
n−1[
∑x
i
2
−
1
n
(∑x
i)
2
]
By Mrs. Chetana
Histogram Analysis
Graph displays of basic statistical class descriptions
◦Frequency histograms
A univariate graphical method
Consists of a set of rectangles that reflect the counts or frequencies of
the classes present in the given data
By Mrs. Chetana
QuantilePlot
Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
Plots quantileinformation
◦For a data x
idata sorted in increasing order, f
iindicates that
approximately 100 f
i% of the data are below or equal to the
value x
i
By Mrs. Chetana
Quantile-Quantile(Q-Q) Plot
Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
Allows the user to view whether there is a shift in going from
one distribution to another
By Mrs. Chetana
Scatter plot
Provides a first look at bivariate data to see clusters of
points, outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
By Mrs. Chetana
Loess Curve
Adds a smooth curve to a scatter plot in order to provide
better perception of the pattern of dependence
Loess curve is fitted by setting two parameters: a smoothing
parameter, and the degree of the polynomialsthat are fitted by
the regression
By Mrs. Chetana
Graphic Displays of Basic Statistical Descriptions
Histogram:(shown before)
Boxplot:(covered before)
Quantile plot: each value x
iis paired with f
i indicating that
approximately 100 f
i % of data are x
i
Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
Loess (local regression) curve: add a smooth curve to a scatter
plot to provide better perception of the pattern of dependence
By Mrs. Chetana
Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
AO Induction vs. Learning-from-example
Paradigm
Difference in philosophies and basic assumptions
◦Positive and negative samples in learning-from-example:
positive used for generalization, negative -for specialization
◦Positive samples only in data mining:
hence generalization-based, to drill-down backtrack the
generalization to a previous state
Difference in methods of generalizations
◦Machine learning generalizes on a tuple by tuple basis
◦Data mining generalizes on an attribute by attribute basis
By Mrs. Chetana
Comparison of Entire vs. Factored Version
Space
By Mrs. Chetana
Incremental and Parallel Mining of
Concept Description
Incremental mining: revision based on newly added data
DB
◦Generalize DB to the same level of abstraction in the generalized
relation R to derive R
◦Union R U R, i.e., merge counts and other statistical information
to produce a new relation R’
Similar philosophy can be applied to data sampling,
parallel and/or distributed mining, etc.
By Mrs. Chetana
Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
Summary
Concept description: characterization and discrimination
OLAP-based vs. attribute-oriented induction
Efficient implementation of AOI
Analytical characterization and comparison
Mining descriptive statistical measures in large databases
Discussion
◦Incremental and parallel mining of description
◦Descriptive mining of complex types of data
By Mrs. Chetana