Wisconsin Breast Cancer dataset.pptx

DaheeKim30 690 views 46 slides Aug 22, 2023
Slide 1
Slide 1 of 46
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46

About This Presentation

This project aims to determine whether a tumor is malignant or benign by analyzing the Wisconsin breast cancer dataset.


Slide Content

Wisconsin Breast Cancer dataset GUGC Da Hee Kim Advised by Homin Park

Contents Introduction Loading & checking the data Explanatory Data analysis (EDA) Feature Engineering Modeling Interpretability/ Explainable AI (XAI) Wrap-up Future research

1. Importance

1. Introduction - The problem Christina Applegate Sharon Osbourne Angelina Jolie

1. Introduction - The problem The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea) (A) Estimated new cases

1. Introduction - The problem The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea ) ( A) Estimated new cases (B) Estimated Deaths

Cancer ≠ Tumor: Abnormal growth of cells causing a mass of tissue Malignant tumors are cancerous and invade other sites. Benign tumors stay in their primary location. 1. Introduction - The problem

1. Introduction - The problem: Diagnosis

1. Introduction - The problem Benign Malignant Nucleus size uniform Symmetrical Homogenous Areas within normal size Non uniform nucleus Asymmetrical Non homogenous sizes Areas above normal size

1. Introduction - The problem: Diagnosis Problem: Depending on type, painful to patient Potential side effects (ex: bruising) Diagnosis can take time Tedious process Model Malignant Benign Machine learning Imaging

1. Wisconsin Breast Cancer dataset - What? Has parameters measured form a fine needle aspirate of a breast mass. The parameters are about the cell nucleus. (569 cells) Cell Nuclei

1. Wisconsin Breast Cancer dataset - What? The parameters include the following 10 features R adius T exture P erimeter A rea S moothness C ompactness (perimeter^2 / area - 1.0) C oncavity (severity of concave portions of the contour) C oncave points (number of concave portions of the contour) S ymmetry Fractal dimension

1. Wisconsin Breast Cancer dataset - What? The parameters include the following 10 features Texture: standard deviation of gray-scale values Smoothness: local variation in radius lengths Compactness: (perimeter^2 / area - 1.0) Concavity: severity of concave portions of the contour C oncave points: Number of concave portions of the contour Symmetry: Uses nucleus deformation parameter to measure how non-spherical a nucleus is. Fractal dimension: ("coastline approximation" - 1) Radius Perimeter Area Nuclei

1. Wisconsin Breast Cancer dataset - What? Radius Mean Standard error Worst R R Tumor nuclei

1. Wisconsin Breast Cancer dataset - What? : Structure of dataset The parameters include the following 10 features Mean Standard error Worst 30 features ID diagnosis 33 Columns Feature 10 Unnamed

Doctors will be able to determine whether the tumor is malignant or benign through imaging without biopsy. Breakthroughs with breast cancer can act as a steppingstone for other cancers where biopsy is difficult to conduct. 1. Wisconsin Breast Cancer dataset - Importance? Model Malignant Benign

2. Machine learning process

1 . Loading and checking data 1: Loading data 2: Checking for null values 3. Outlier detection 4. Summary and statistics

1. Loading and checking the data

1. Loading and checking the data ID & Diagnosis Mean Standard error Worst Unnamed

1. Loading and checking the data - 2: Checking for null values All the data in the Unnamed column consists of null values We will thus remove this column later

1. Loading and checking the data - 3: Outlier Detection We will drop these later Redefine “X” which includes only the features Output: Outliers found depending on only the feature traits ( x_col )

1. Loading and checking the data - 4: Summary and statistics We can observe the statistical values for each of the features Redefine “X” which includes only the features Output

1. Loading and checking the data - 4: Summary and statistics We can observe the quantity of each benign and malignant tumors Redefine “ data_w_diag ” which includes the diagnosis and the 30 features Output Number of Benign: 357 Number of Malignant: 212

2 . Explanatory Data Analysis 1: Heat Map all features 2: Important features 2-1: Radius VS Perimeter VS Area 1: Heat map 2-2: Compactness VS Concavity VS Compactness 1: Heat map 2: Feature plotting: Histogram 3: Overall data distribution

2. EDA - 1: Heat Map We can see a couple of relations using the heat map. Within the mean or worst features, we can see that radius is highly correlated to the perimeter and the area. compactness, concavity and concave points are corelated to each other.

2. EDA - 2-1: Radius VS Perimeter VS Area 1. Heat map Different colors: 1.0 is due to the rounding of correlation values.

2. EDA - 2-1: Radius VS Perimeter VS Area 1. Heat map We will check the correlation of Area VS perimeter VS radius Different colors: 1.0 is due to the rounding of correlation values.

2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting Mean Worst

2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting: Joint plot Mean

2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting Worst

2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting Worst Mean Thus, Personr values: Represents the correlation between features. Values > 0.97: explains why on heatmap, correlation value was 1.0 Due to rounding Very high correlation values However, we are not going to remove any features because we will see later on, that their feature importance varies.

Mean Worst Thus, Again not perfectly linear: Can assume that 1.0 correlation on the heatmap was due to rounding of the values In feature Engineering step, we will choose one out of the three features for dimension reduction 2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting

2. EDA -2-2: Compactness VS Concave points VS Concavity 1. Heat map We will compare the following features Compactness & concavity & concave points High correlation Compactness mean VS compactness worst Concavity mean VS concavity worst Concave points mean VS concave points worst

2. EDA -2-2: Compactness VS Concave points VS Concavity 1. Heat map We will compare the following features Compactness VS concavity Concavity VS concave points Compactness mean VS compactness worst Concavity mean VS concavity worst Concave points mean VS concave points worst

2. EDA -2-2: Compactness VS Concave points VS Concavity 2. Feature plotting Worst Mean Concave points Concavity Compactness Pearsonr value: 0.86-0.92. the feature pairs are highly correlated. Potential reasons fo r the high correlation between concave points and concavity (0.92 value) Morphological features: Both are related to the contour of the tumor nuclei. Tumors with more concave points might exibit more complex and irregular shapes leading to higher scores. concavity: severity of concave portions of the contour concave points: Number of concave portions of the contour Compactness: (perimeter^2 / area - 1.0)

2. EDA -2-2: Compactness VS Concave points VS Concavity 2. Feature plotting: Worst VS Mean Concave points concavity: severity of concave portions of the contour concave points: Number of concave portions of the contour Compactness: (perimeter^2 / area - 1.0) Concavity Compactness concavity & compactness: worst ≒ mean Concave points: worst ≠ mean Similarity of overall distribution

2. EDA - 3: Data distribution Violin Plot Worst Mean Standard Error

2. EDA - 3: Data distribution Violin Plot Standard Error No clear separation in distribution between malignant and benign Because the standard error value by itself had no meaning by itself.

2. EDA - 3: Data distribution Violin Plot Worst Red box: Examples of features with good separation Blue box: Examples of features with bad separation Assume that features with good separation will have higher feature importance

3. Feature Engineering 1: Standardization 2: Outlier detection

3. Feature engineering - 1. Standardization Before After

3. Feature engineering - 2. Outlier Deletion: Swarm Plot Before After

4. Modeling 1: Splitting data 2: Classification 2-1: ANN 2-2: SVC vs Decision Tree vs Ada Boost vs Random forest vs Extra trees vs GBC vs Logistic regression 2-3: Ensemble model 3: Cross validate models 4: Hyper parameter tuning 5: Evaluating models

4. Modeling 1: Splitting data 2: Classification 2-1: ANN 2-2: SVC vs Decision Tree vs Ada Boost vs Random forest vs Extra trees vs GBC vs Logistic regression 3: Cross validate models 4: Hyper parameter tuning 5: Evaluating models

4. Modeling -1: Splitting data