The curious case of wisconsin diagonstic dataset

SabinBhatta 38 views 22 slides Oct 02, 2024

Slide 1 of 22

About This Presentation

This project aimed to create a good-performing classification model for the diagnosis of breast cancer for the Wisconsin Diagnostic Breast Cancer dataset using a minimum number of features. This was investigated by finding the strongest predictors and their contributions to the prediction using deci...

Size: 8.7 MB

Language: en

Added: Oct 02, 2024

Slides: 22 pages

Slide Content

The Curious Case of Wisconsin Diagonstic Dataset (WDBC)

Starring

Dataset Overview No missing/null attribute values. 32 Attributes 30 real valued input features ID (only one image per ID) CLASS DISTRIBUTION Diagnosis ( Benign or Malignant) Benign 357 62.7% Malignant 212 37.3% Total 569 100%

1. Multi collinear Features area_se perimeter_mean texture_worst perimeter_worst area_mean concavity_mean perimeter_se radius_worst concave_points_worst area_worst compactness_worst Pearson Coefficient >= 0.85

What if I told you using only 8 out of 30 attributes would give you a decent result ? The choice is yours !

Just Eight! perimeter_worst area_worst concave points_mean perimeter_mean concavity_worst texture_worst compactness_worst radius_se

The mean accuracy of Random Forest using only these 8 features was 95.575 % . Hyper parameters max_features=0.5 max_samples=228 min_samples_leaf=5 n_estimators=40 Split: 80:20

when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. But when interpreting the data, it can lead to the incorrect conclusion that one of the variables is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the response variable. Datadive: http://blog.datadive.net/selecting-good-features-part-iii-random-forests/

Random Forest Hyper parameters max_features=0.5 max_samples=228 min_samples_leaf=5 n_estimators=40 # Features used: 30 Mean Accuracy: 97.34% Top 10 features

Removing low importance features Threshold: 0.005 # Features used: 13 Mean Accuracy: 97.34% Top 10 features

Removing low Redundant features OOB Score: 0.9342 (Base Line) texture_worst , texture_mean radius_se , area_se radius_worst , area_worst concavity_worst , concavity_mean concave points_worst , concave points_mean

Remove potentially redundant variables one at a time . Result texture_worst: 0.9364035087719298 texture_mean: 0.9473684210526315 radius_se: 0.9407894736842105 area_se: 0.9473684210526315 radius_worst: 0.9407894736842105 area_worst: 0.9385964912280702 concavity_worst: 0.9451754385964912 concavity_mean: 0.949561403508771 9 concave points_worst: 0.9429824561403509 concave points_mean: 0.9364035087719298

Partial Dependence (of top 3 features)

Waterfall plot

Model Interpretation. Questions How confident are we in our predictions using a particular row of data? For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction? Which columns are the strongest predictors, which can we ignore? Which columns are effectively redundant with each other, for purposes of prediction? How do predictions vary, as we vary these columns?

Source: Analytics Vidhya

OOB (Out of Bag) Score prevents leakage and gives a better model with low variance which is necessarily not the case using cross-validation technique. Also, works best for small and medium datasets.

The curious case of wisconsin diagonstic dataset

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

The curious case of wisconsin diagonstic dataset

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......