This project aims to determine whether a tumor is malignant or benign by analyzing the Wisconsin breast cancer dataset.
Size: 6.23 MB
Language: en
Added: Aug 22, 2023
Slides: 46 pages
Slide Content
Wisconsin Breast Cancer dataset GUGC Da Hee Kim Advised by Homin Park
Contents Introduction Loading & checking the data Explanatory Data analysis (EDA) Feature Engineering Modeling Interpretability/ Explainable AI (XAI) Wrap-up Future research
1. Importance
1. Introduction - The problem Christina Applegate Sharon Osbourne Angelina Jolie
1. Introduction - The problem The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea) (A) Estimated new cases
1. Introduction - The problem The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea ) ( A) Estimated new cases (B) Estimated Deaths
Cancer ≠ Tumor: Abnormal growth of cells causing a mass of tissue Malignant tumors are cancerous and invade other sites. Benign tumors stay in their primary location. 1. Introduction - The problem
1. Introduction - The problem: Diagnosis
1. Introduction - The problem Benign Malignant Nucleus size uniform Symmetrical Homogenous Areas within normal size Non uniform nucleus Asymmetrical Non homogenous sizes Areas above normal size
1. Introduction - The problem: Diagnosis Problem: Depending on type, painful to patient Potential side effects (ex: bruising) Diagnosis can take time Tedious process Model Malignant Benign Machine learning Imaging
1. Wisconsin Breast Cancer dataset - What? Has parameters measured form a fine needle aspirate of a breast mass. The parameters are about the cell nucleus. (569 cells) Cell Nuclei
1. Wisconsin Breast Cancer dataset - What? The parameters include the following 10 features R adius T exture P erimeter A rea S moothness C ompactness (perimeter^2 / area - 1.0) C oncavity (severity of concave portions of the contour) C oncave points (number of concave portions of the contour) S ymmetry Fractal dimension
1. Wisconsin Breast Cancer dataset - What? The parameters include the following 10 features Texture: standard deviation of gray-scale values Smoothness: local variation in radius lengths Compactness: (perimeter^2 / area - 1.0) Concavity: severity of concave portions of the contour C oncave points: Number of concave portions of the contour Symmetry: Uses nucleus deformation parameter to measure how non-spherical a nucleus is. Fractal dimension: ("coastline approximation" - 1) Radius Perimeter Area Nuclei
1. Wisconsin Breast Cancer dataset - What? Radius Mean Standard error Worst R R Tumor nuclei
1. Wisconsin Breast Cancer dataset - What? : Structure of dataset The parameters include the following 10 features Mean Standard error Worst 30 features ID diagnosis 33 Columns Feature 10 Unnamed
Doctors will be able to determine whether the tumor is malignant or benign through imaging without biopsy. Breakthroughs with breast cancer can act as a steppingstone for other cancers where biopsy is difficult to conduct. 1. Wisconsin Breast Cancer dataset - Importance? Model Malignant Benign
2. Machine learning process
1 . Loading and checking data 1: Loading data 2: Checking for null values 3. Outlier detection 4. Summary and statistics
1. Loading and checking the data
1. Loading and checking the data ID & Diagnosis Mean Standard error Worst Unnamed
1. Loading and checking the data - 2: Checking for null values All the data in the Unnamed column consists of null values We will thus remove this column later
1. Loading and checking the data - 3: Outlier Detection We will drop these later Redefine “X” which includes only the features Output: Outliers found depending on only the feature traits ( x_col )
1. Loading and checking the data - 4: Summary and statistics We can observe the statistical values for each of the features Redefine “X” which includes only the features Output
1. Loading and checking the data - 4: Summary and statistics We can observe the quantity of each benign and malignant tumors Redefine “ data_w_diag ” which includes the diagnosis and the 30 features Output Number of Benign: 357 Number of Malignant: 212
2 . Explanatory Data Analysis 1: Heat Map all features 2: Important features 2-1: Radius VS Perimeter VS Area 1: Heat map 2-2: Compactness VS Concavity VS Compactness 1: Heat map 2: Feature plotting: Histogram 3: Overall data distribution
2. EDA - 1: Heat Map We can see a couple of relations using the heat map. Within the mean or worst features, we can see that radius is highly correlated to the perimeter and the area. compactness, concavity and concave points are corelated to each other.
2. EDA - 2-1: Radius VS Perimeter VS Area 1. Heat map Different colors: 1.0 is due to the rounding of correlation values.
2. EDA - 2-1: Radius VS Perimeter VS Area 1. Heat map We will check the correlation of Area VS perimeter VS radius Different colors: 1.0 is due to the rounding of correlation values.
2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting Mean Worst
2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting: Joint plot Mean
2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting Worst
2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting Worst Mean Thus, Personr values: Represents the correlation between features. Values > 0.97: explains why on heatmap, correlation value was 1.0 Due to rounding Very high correlation values However, we are not going to remove any features because we will see later on, that their feature importance varies.
Mean Worst Thus, Again not perfectly linear: Can assume that 1.0 correlation on the heatmap was due to rounding of the values In feature Engineering step, we will choose one out of the three features for dimension reduction 2. EDA - 2-1: Radius VS Perimeter VS Area 2. Feature plotting
2. EDA -2-2: Compactness VS Concave points VS Concavity 1. Heat map We will compare the following features Compactness & concavity & concave points High correlation Compactness mean VS compactness worst Concavity mean VS concavity worst Concave points mean VS concave points worst
2. EDA -2-2: Compactness VS Concave points VS Concavity 1. Heat map We will compare the following features Compactness VS concavity Concavity VS concave points Compactness mean VS compactness worst Concavity mean VS concavity worst Concave points mean VS concave points worst
2. EDA -2-2: Compactness VS Concave points VS Concavity 2. Feature plotting Worst Mean Concave points Concavity Compactness Pearsonr value: 0.86-0.92. the feature pairs are highly correlated. Potential reasons fo r the high correlation between concave points and concavity (0.92 value) Morphological features: Both are related to the contour of the tumor nuclei. Tumors with more concave points might exibit more complex and irregular shapes leading to higher scores. concavity: severity of concave portions of the contour concave points: Number of concave portions of the contour Compactness: (perimeter^2 / area - 1.0)
2. EDA -2-2: Compactness VS Concave points VS Concavity 2. Feature plotting: Worst VS Mean Concave points concavity: severity of concave portions of the contour concave points: Number of concave portions of the contour Compactness: (perimeter^2 / area - 1.0) Concavity Compactness concavity & compactness: worst ≒ mean Concave points: worst ≠ mean Similarity of overall distribution
2. EDA - 3: Data distribution Violin Plot Worst Mean Standard Error
2. EDA - 3: Data distribution Violin Plot Standard Error No clear separation in distribution between malignant and benign Because the standard error value by itself had no meaning by itself.
2. EDA - 3: Data distribution Violin Plot Worst Red box: Examples of features with good separation Blue box: Examples of features with bad separation Assume that features with good separation will have higher feature importance
3. Feature engineering - 1. Standardization Before After
3. Feature engineering - 2. Outlier Deletion: Swarm Plot Before After
4. Modeling 1: Splitting data 2: Classification 2-1: ANN 2-2: SVC vs Decision Tree vs Ada Boost vs Random forest vs Extra trees vs GBC vs Logistic regression 2-3: Ensemble model 3: Cross validate models 4: Hyper parameter tuning 5: Evaluating models
4. Modeling 1: Splitting data 2: Classification 2-1: ANN 2-2: SVC vs Decision Tree vs Ada Boost vs Random forest vs Extra trees vs GBC vs Logistic regression 3: Cross validate models 4: Hyper parameter tuning 5: Evaluating models