SlidePub
Home
Categories
Login
Register
Home
General
Data mining lecture about gini index and something
Data mining lecture about gini index and something
SajidHussainS
66 views
17 slides
May 04, 2024
Slide
1
of 17
Previous
Next
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
About This Presentation
Data mining lecture about gini index and something
Size:
535.4 KB
Language:
en
Added:
May 04, 2024
Slides:
17 pages
Slide Content
Slide 1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Decision Tree Construction
Slide 2
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Continuous Attributes: Computing Gini Index
Use Binary Decisions based on one
value
Several Choices for the splitting value
–Number of possible splitting values
= Number of distinct values
Each splitting value has a count matrix
associated with it
–Class counts in each of the
partitions, A < v and A v
Simple method to choose best v
–For each v, scan the database to
gather count matrix and compute
its Gini index
–Computationally Inefficient!
Repetition of work.Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Taxable
Income
> 80K?
Yes No
Slide 3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
–Sort the attribute on values
–Linearly scan these values, each time updating the count matrix
and computing gini index
–Choose the split position that has the least gini indexCheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<=><=><=><=><=><=><=><=><=><=><=>
Yes0303030312213030303030
No 0716253434343443526170
Gini0.4200.4000.3750.3430.4170.4000.3000.3430.3750.4000.420
Split Positions
Sorted Values
Slide 4
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Measures of Node Impurity
GiniIndex
Entropy
Misclassification error
Slide 5
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Entropy
Information is measured in bits
–Given a probability distribution, the info
required to predict an event is the distribution’s
entropy
–Entropy gives the information required in bits
(this can involve fractions of bits!)
Formula for computing the entropy:nnn ppppppppp logloglog),,,entropy(
221121
Slide 6
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Entropy
Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
–Measures homogeneity of a node.
Maximum (log n
c) when records are equally distributed
among all classes implying least information
Minimum (0.0) when all records belong to one class,
implying most information
–Entropy based computations are similar to the
GINI index computations
j
tjptjptEntropy )|(log)|()(
Slide 7
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Examples for computing EntropyC1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = –0 log 0–1 log 1 = –0 –0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = –(1/6) log
2(1/6)–(5/6) log
2(5/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = –(2/6) log
2(2/6)–(4/6) log
2(4/6) = 0.92
j
tjptjptEntropy )|(log)|()(
2
Slide 8
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Information Gain
Information Gain:
Parent Node, p is split into k partitions;
n
iis number of records in partition i
–Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
–Used in ID3 and C4.5
–Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
k
i
i
split iEntropy
n
n
pEntropyGAIN
1
)()(
Slide 9
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Weather Data: Play or not Play?
Outlook Temperature HumidityWindyPlay?
sunny hot high falseNo
sunny hot high true No
overcasthot high falseYes
rain mild high falseYes
rain cool normal falseYes
rain cool normal true No
overcastcool normal true Yes
sunny mild high falseNo
sunny cool normal falseYes
rain mild normal falseYes
sunny mild normal true Yes
overcastmild high true Yes
overcasthot normal falseYes
rain mild high true No
Note:
Outlook is the
Forecast,
no relation to
Microsoft
email program
Slide 10
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Which attribute to select?
Slide 11
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Example: attribute “Outlook”
“Outlook” = “Sunny”:
“Outlook” = “Overcast”:
“Outlook” = “Rainy”:
Expected information for attribute:bits 971.0)5/3log(5/3)5/2log(5/25,3/5)entropy(2/)info([2,3] bits 0)0log(0)1log(10)entropy(1,)info([4,0] bits 971.0)5/2log(5/2)5/3log(5/35,2/5)entropy(3/)info([3,2]
Note: log(0) is not
defined, but we evaluate
0*log(0) as zero971.0)14/5(0)14/4(971.0)14/5([3,2])[4,0],,info([3,2] bits 693.0
Slide 12
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Computing the information gain
Information gain:
(information before split) –(information after split)
Information gain for attributes from weather data:0.693-0.940[3,2])[4,0],,info([2,3]-)info([9,5])Outlook"gain(" bits 247.0 bits 247.0)Outlook"gain(" bits 029.0)e"Temperaturgain(" bits 152.0)Humidity"gain(" bits 048.0)Windy"gain("
witten&eibe
Slide 13
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Continuing to splitbits 571.0)e"Temperaturgain(" bits 971.0)Humidity"gain(" bits 020.0)Windy"gain("
Slide 14
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
The final decision tree
Note: not all leaves need to be pure; sometimes
identical instances have different classes
Splitting stops when data can’t be split any
further
Slide 15
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Highly-branching attributes
Problematic: attributes with a large number of
values (extreme case: ID code)
Subsets are more likely to be pure if there is a
large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting(selection of an
attribute that is non-optimal for prediction)
Slide 16
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Weather Data with ID code
IDOutlook Temperature HumidityWindyPlay?
Asunny hot high falseNo
Bsunny hot high true No
Covercasthot high falseYes
Drain mild high falseYes
Erain cool normal falseYes
Frain cool normal true No
Govercastcool normal true Yes
Hsunny mild high falseNo
Isunny cool normal falseYes
Jrain mild normal falseYes
Ksunny mild normal true Yes
Lovercastmild high true Yes
Movercasthot normal falseYes
Nrain mild high true No
Slide 17
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Split for ID Code Attribute
Entropy of split = 0 (since each leaf node is “pure”, having only
one case.
Information gain is maximal for ID code
witten&eibe
Tags
Categories
General
Download
Download Slideshow
Get the original presentation file
Quick Actions
Embed
Share
Save
Print
Full
Report
Statistics
Views
66
Slides
17
Age
586 days
Related Slideshows
22
Pray For The Peace Of Jerusalem and You Will Prosper
RodolfoMoralesMarcuc
37 views
26
Don_t_Waste_Your_Life_God.....powerpoint
chalobrido8
40 views
31
VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf
JaiJai148317
36 views
14
Fertility awareness methods for women in the society
Isaiah47
33 views
35
Chapter 5 Arithmetic Functions Computer Organisation and Architecture
RitikSharma297999
32 views
5
syakira bhasa inggris (1) (1).pptx.......
ourcommunity56
34 views
View More in This Category
Embed Slideshow
Dimensions
Width (px)
Height (px)
Start Page
Which slide to start from (1-17)
Options
Auto-play slides
Show controls
Embed Code
Copy Code
Share Slideshow
Share on Social Media
Share on Facebook
Share on Twitter
Share on LinkedIn
Share via Email
Or copy link
Copy
Report Content
Reason for reporting
*
Select a reason...
Inappropriate content
Copyright violation
Spam or misleading
Offensive or hateful
Privacy violation
Other
Slide number
Leave blank if it applies to the entire slideshow
Additional details
*
Help us understand the problem better