Lecturer: Abdullahi Ahamad Shehu (M.Sc. Data Science, M.Sc. Computer Science) Office: Faculty of Computing Extension Algorithms – Decision trees Information Gain vs Gain Ratio CSC4316 : DATA MINING
Decision trees purity criteria: Information gain Information is measured in bits Entropy information required to predict the class, given a probability distribution entropy(p 1 ,p 2 ,…, p n ) = -p 1 *log 2 (p 1 ) – p 2 *log 2 (p 2 ) –…– p n *log 2 ( p n ) p i is the proportion in class i Information via entropy of class distribution info([c 1 ,c 2 ,…, c n ]) = entropy(c 1 /c , c 2 /c , … , c n /c) c i is the number of instances of class i c = c 1 +c 2 +…+ c n
Decision trees purity criteria: Information gain Information Gain - difference between information contained in dataset expected information from attribute split Expected Information from attribute splitting weighted average of information from each partition info([a 1 ,…,a n ], [b 1 ,…,b n ], [c 1 ,…, c n ],…,) = A/ Σ * info([a 1 ,…,a n ]) + B/ Σ * info([b 1 ,…,b n ]) + C/ Σ * info([c 1 ,…, c n ]) + … where x i is num. of instances of class i X = x 1 + x 2 + … Σ = A + B + C + … Attribute A info([A T ,A F ]) A=a info([a T ,a F ]) A=b info([ b T ,b F ]) A=x info([x T ,x F ]) a b x e.g. Classification
Highly-branching attributes Subsets are more likely to be pure if there is a large number of values Attributes with a large number of values have higher gain Information gain is biased towards choosing attributes with a large number of values This may result in overfitting Gain ratio compensates for this bias
Root Node: ID code info(Weather Data) = 0.940 info(ID split) = info([0,1],[0,1],[1,0],…,[1,0],[0,1]) = 1/14*0+1/14*0+1/14*0+…+1/14*0+1/14*0 Gain(ID) = 0.940 – 0 = 0.940 bits Gain is maximal for ID code Not a useful decision node! ID Code No Yes No No Yes a b c … m n
Gain Ratio Takes number and size of branches into account when choosing an attribute Information gain entropy with respect to classification Gain ratio normalises information gain with entropy in attribute partition entropy with respect to attribute partition ignores classification
Example: Gain Ratio Outlook Information as before (entropy wrt classification) info(data) = info([9,5]) = 0.940 info(outlook) = 0.693 Gain as before gain(outlook) = 0.940 – 0.693 = 0.247 Information wrt Outlook partitions 5 sunny, 4 cloudy, 5 rainy info(data wrt outlook) = info([5,4,5]) = -5/14*log2(5/14) – 4/14 log2(4/14) – 5/14*log2(5/14) = 1.563 GainRatio (outlook) = gain(outlook)/info(data wrt outlook)= 0.247/1.563 = 0.158 Outlook info([ 9 , 5 ]) 9,11 , 1,2,8 info([ 2 , 3 ]) 3,7,12,13 info([ 4 , ]) 4,5,10, 6,14 info([ 3 , 2 ]) sunny cloudy rainy 5/14 4/14 5/14
Example: Gain Ratios Gain Ratios GainRatio (outlook) = 0.247 /1.563 = 0.158 GainRatio (temp) = 0.029/1.545 = 0.019 GainRatio (humidity) = 0.151/1 = 0.151 GainRatio (windy) = 0.048/0.986 = 0.049 Outlook still selected as root node But humidity is much closer choice
Problems with Gain Ratio Extreme scenario with very high number of values reject attribute even if high gain or gain ratio Gain ratio may overcompensate may choose an attribute because its intrinsic information is very low Standard solution only consider attributes with greater than average information gain calculate information gain for all attributes calculate gain ratio for those with > average gain choose attribute with highest gain ratio
Exercises 3 – Information Gain Ratio Use the information gain metric to choose a suitable attribute for the root node of the Credit Risk Dataset that predicts credit risk given credit history, debt, collateral and income. The class attribute is risk. Use gain ratio to choose the root node attribute for Credit Risk Dataset