Wisconsin Breast Cancer dataset.pptx

DaheeKim30 690 views 46 slides Aug 22, 2023

Slide 1 of 46

About This Presentation

This project aims to determine whether a tumor is malignant or benign by analyzing the Wisconsin breast cancer dataset.

Size: 6.23 MB

Language: en

Added: Aug 22, 2023

Slides: 46 pages

Slide Content

Wisconsin Breast Cancer dataset GUGC Da Hee Kim Advised by Homin Park

Contents Introduction Loading & checking the data Explanatory Data analysis (EDA) Feature Engineering Modeling Interpretability/ Explainable AI (XAI) Wrap-up Future research

1. Importance

1. Introduction - The problem Christina Applegate Sharon Osbourne Angelina Jolie

1. Introduction - The problem The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea) (A) Estimated new cases

1. Introduction - The problem The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea ) ( A) Estimated new cases (B) Estimated Deaths

Cancer ≠ Tumor: Abnormal growth of cells causing a mass of tissue Malignant tumors are cancerous and invade other sites. Benign tumors stay in their primary location. 1. Introduction - The problem

1. Introduction - The problem: Diagnosis

1. Introduction - The problem Benign Malignant Nucleus size uniform Symmetrical Homogenous Areas within normal size Non uniform nucleus Asymmetrical Non homogenous sizes Areas above normal size

1. Introduction - The problem: Diagnosis Problem: Depending on type, painful to patient Potential side effects (ex: bruising) Diagnosis can take time Tedious process Model Malignant Benign Machine learning Imaging

1. Wisconsin Breast Cancer dataset - What? Has parameters measured form a fine needle aspirate of a breast mass. The parameters are about the cell nucleus. (569 cells) Cell Nuclei

1. Wisconsin Breast Cancer dataset - What? The parameters include the following 10 features R adius T exture P erimeter A rea S moothness C ompactness (perimeter^2 / area - 1.0) C oncavity (severity of concave portions of the contour) C oncave points (number of concave portions of the contour) S ymmetry Fractal dimension

1. Wisconsin Breast Cancer dataset - What? The parameters include the following 10 features Texture: standard deviation of gray-scale values Smoothness: local variation in radius lengths Compactness: (perimeter^2 / area - 1.0) Concavity: severity of concave portions of the contour C oncave points: Number of concave portions of the contour Symmetry: Uses nucleus deformation parameter to measure how non-spherical a nucleus is. Fractal dimension: ("coastline approximation" - 1) Radius Perimeter Area Nuclei

1. Wisconsin Breast Cancer dataset - What? Radius Mean Standard error Worst R R Tumor nuclei

1. Wisconsin Breast Cancer dataset - What? : Structure of dataset The parameters include the following 10 features Mean Standard error Worst 30 features ID diagnosis 33 Columns Feature 10 Unnamed

Doctors will be able to determine whether the tumor is malignant or benign through imaging without biopsy. Breakthroughs with breast cancer can act as a steppingstone for other cancers where biopsy is difficult to conduct. 1. Wisconsin Breast Cancer dataset - Importance? Model Malignant Benign

2. Machine learning process

1 . Loading and checking data 1: Loading data 2: Checking for null values 3. Outlier detection 4. Summary and statistics

1. Loading and checking the data

1. Loading and checking the data ID & Diagnosis Mean Standard error Worst Unnamed

1. Loading and checking the data - 2: Checking for null values All the data in the Unnamed column consists of null values We will thus remove this column later

1. Loading and checking the data - 3: Outlier Detection We will drop these later Redefine “X” which includes only the features Output: Outliers found depending on only the feature traits ( x_col )

1. Loading and checking the data - 4: Summary and statistics We can observe the statistical values for each of the features Redefine “X” which includes only the features Output

1. Loading and checking the data - 4: Summary and statistics We can observe the quantity of each benign and malignant tumors Redefine “ data_w_diag ” which includes the diagnosis and the 30 features Output Number of Benign: 357 Number of Malignant: 212

2 . Explanatory Data Analysis 1: Heat Map all features 2: Important features 2-1: Radius VS Perimeter VS Area 1: Heat map 2-2: Compactness VS Concavity VS Compactness 1: Heat map 2: Feature plotting: Histogram 3: Overall data distribution

2. EDA - 1: Heat Map We can see a couple of relations using the heat map. Within the mean or worst features, we can see that radius is highly correlated to the perimeter and the area. compactness, concavity and concave points are corelated to each other.

2. EDA - 2-1: Radius VS Perimeter VS Area 1. Heat map Different colors: 1.0 is due to the rounding of correlation values.

2. EDA - 2-1: Radius VS Perimeter VS Area 1. Heat map We will check the correlation of Area VS perimeter VS radius Different colors: 1.0 is due to the rounding of correlation values.

2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting Mean Worst

2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting: Joint plot Mean

2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting Worst

2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting Worst Mean Thus, Personr values: Represents the correlation between features. Values > 0.97: explains why on heatmap, correlation value was 1.0 Due to rounding Very high correlation values However, we are not going to remove any features because we will see later on, that their feature importance varies.

Mean Worst Thus, Again not perfectly linear: Can assume that 1.0 correlation on the heatmap was due to rounding of the values In feature Engineering step, we will choose one out of the three features for dimension reduction 2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting

2. EDA -2-2: Compactness VS Concave points VS Concavity 1. Heat map We will compare the following features Compactness & concavity & concave points High correlation Compactness mean VS compactness worst Concavity mean VS concavity worst Concave points mean VS concave points worst

2. EDA -2-2: Compactness VS Concave points VS Concavity 1. Heat map We will compare the following features Compactness VS concavity Concavity VS concave points Compactness mean VS compactness worst Concavity mean VS concavity worst Concave points mean VS concave points worst

2. EDA -2-2: Compactness VS Concave points VS Concavity 2. Feature plotting Worst Mean Concave points Concavity Compactness Pearsonr value: 0.86-0.92. the feature pairs are highly correlated. Potential reasons fo r the high correlation between concave points and concavity (0.92 value) Morphological features: Both are related to the contour of the tumor nuclei. Tumors with more concave points might exibit more complex and irregular shapes leading to higher scores. concavity: severity of concave portions of the contour concave points: Number of concave portions of the contour Compactness: (perimeter^2 / area - 1.0)

2. EDA -2-2: Compactness VS Concave points VS Concavity 2. Feature plotting: Worst VS Mean Concave points concavity: severity of concave portions of the contour concave points: Number of concave portions of the contour Compactness: (perimeter^2 / area - 1.0) Concavity Compactness concavity & compactness: worst ≒ mean Concave points: worst ≠ mean Similarity of overall distribution

2. EDA - 3: Data distribution Violin Plot Worst Mean Standard Error

2. EDA - 3: Data distribution Violin Plot Standard Error No clear separation in distribution between malignant and benign Because the standard error value by itself had no meaning by itself.

2. EDA - 3: Data distribution Violin Plot Worst Red box: Examples of features with good separation Blue box: Examples of features with bad separation Assume that features with good separation will have higher feature importance

3. Feature Engineering 1: Standardization 2: Outlier detection

3. Feature engineering - 1. Standardization Before After

3. Feature engineering - 2. Outlier Deletion: Swarm Plot Before After

4. Modeling 1: Splitting data 2: Classification 2-1: ANN 2-2: SVC vs Decision Tree vs Ada Boost vs Random forest vs Extra trees vs GBC vs Logistic regression 2-3: Ensemble model 3: Cross validate models 4: Hyper parameter tuning 5: Evaluating models

4. Modeling 1: Splitting data 2: Classification 2-1: ANN 2-2: SVC vs Decision Tree vs Ada Boost vs Random forest vs Extra trees vs GBC vs Logistic regression 3: Cross validate models 4: Hyper parameter tuning 5: Evaluating models

Wisconsin Breast Cancer dataset.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Wisconsin Breast Cancer dataset.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx