Data mining lecture about gini index and something

SajidHussainS 66 views 17 slides May 04, 2024
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Data mining lecture about gini index and something


Slide Content

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Decision Tree Construction

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Continuous Attributes: Computing Gini Index
Use Binary Decisions based on one
value
Several Choices for the splitting value
–Number of possible splitting values
= Number of distinct values
Each splitting value has a count matrix
associated with it
–Class counts in each of the
partitions, A < v and A v
Simple method to choose best v
–For each v, scan the database to
gather count matrix and compute
its Gini index
–Computationally Inefficient!
Repetition of work.Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Taxable
Income
> 80K?
Yes No

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
–Sort the attribute on values
–Linearly scan these values, each time updating the count matrix
and computing gini index
–Choose the split position that has the least gini indexCheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<=><=><=><=><=><=><=><=><=><=><=>
Yes0303030312213030303030
No 0716253434343443526170
Gini0.4200.4000.3750.3430.4170.4000.3000.3430.3750.4000.420
Split Positions
Sorted Values

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Measures of Node Impurity
GiniIndex
Entropy
Misclassification error

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Entropy
Information is measured in bits
–Given a probability distribution, the info
required to predict an event is the distribution’s
entropy
–Entropy gives the information required in bits
(this can involve fractions of bits!)
Formula for computing the entropy:nnn ppppppppp logloglog),,,entropy(
221121  

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Entropy
Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
–Measures homogeneity of a node.
Maximum (log n
c) when records are equally distributed
among all classes implying least information
Minimum (0.0) when all records belong to one class,
implying most information
–Entropy based computations are similar to the
GINI index computations
j
tjptjptEntropy )|(log)|()(

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Examples for computing EntropyC1 0
C2 6

C1 2
C2 4

C1 1
C2 5


P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = –0 log 0–1 log 1 = –0 –0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = –(1/6) log
2(1/6)–(5/6) log
2(5/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = –(2/6) log
2(2/6)–(4/6) log
2(4/6) = 0.92
j
tjptjptEntropy )|(log)|()(
2

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Information Gain
Information Gain:
Parent Node, p is split into k partitions;
n
iis number of records in partition i
–Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
–Used in ID3 and C4.5
–Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.





 

k
i
i
split iEntropy
n
n
pEntropyGAIN
1
)()(

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Weather Data: Play or not Play?
Outlook Temperature HumidityWindyPlay?
sunny hot high falseNo
sunny hot high true No
overcasthot high falseYes
rain mild high falseYes
rain cool normal falseYes
rain cool normal true No
overcastcool normal true Yes
sunny mild high falseNo
sunny cool normal falseYes
rain mild normal falseYes
sunny mild normal true Yes
overcastmild high true Yes
overcasthot normal falseYes
rain mild high true No
Note:
Outlook is the
Forecast,
no relation to
Microsoft
email program

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Which attribute to select?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Example: attribute “Outlook”
“Outlook” = “Sunny”:
“Outlook” = “Overcast”:
“Outlook” = “Rainy”:
Expected information for attribute:bits 971.0)5/3log(5/3)5/2log(5/25,3/5)entropy(2/)info([2,3]  bits 0)0log(0)1log(10)entropy(1,)info([4,0]  bits 971.0)5/2log(5/2)5/3log(5/35,2/5)entropy(3/)info([3,2] 
Note: log(0) is not
defined, but we evaluate
0*log(0) as zero971.0)14/5(0)14/4(971.0)14/5([3,2])[4,0],,info([3,2]  bits 693.0

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Computing the information gain
Information gain:
(information before split) –(information after split)
Information gain for attributes from weather data:0.693-0.940[3,2])[4,0],,info([2,3]-)info([9,5])Outlook"gain("  bits 247.0 bits 247.0)Outlook"gain("  bits 029.0)e"Temperaturgain("  bits 152.0)Humidity"gain("  bits 048.0)Windy"gain(" 
witten&eibe

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Continuing to splitbits 571.0)e"Temperaturgain("  bits 971.0)Humidity"gain("  bits 020.0)Windy"gain(" 

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
The final decision tree
Note: not all leaves need to be pure; sometimes
identical instances have different classes
Splitting stops when data can’t be split any
further

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Highly-branching attributes
Problematic: attributes with a large number of
values (extreme case: ID code)
Subsets are more likely to be pure if there is a
large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting(selection of an
attribute that is non-optimal for prediction)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Weather Data with ID code
IDOutlook Temperature HumidityWindyPlay?
Asunny hot high falseNo
Bsunny hot high true No
Covercasthot high falseYes
Drain mild high falseYes
Erain cool normal falseYes
Frain cool normal true No
Govercastcool normal true Yes
Hsunny mild high falseNo
Isunny cool normal falseYes
Jrain mild normal falseYes
Ksunny mild normal true Yes
Lovercastmild high true Yes
Movercasthot normal falseYes
Nrain mild high true No

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Split for ID Code Attribute
Entropy of split = 0 (since each leaf node is “pure”, having only
one case.
Information gain is maximal for ID code
witten&eibe
Tags