Predict Diabets by using machine learning with several methods such as KKN, Decision Tree, MLP, and Logistic Regression.
Size: 7.26 MB
Language: en
Added: Jun 27, 2024
Slides: 49 pages
Slide Content
21 May 2024FI4002 Simulasi dan
Pemodelan Sistem Fisis
10220003 Bernike Hernita Sofiana
10220027 Annisa Sri Wardifa
10220075 Alyssa Hanifa Dhiyani
Diabetes Prediction
Kharwal, A. (2020, October 23). Predict diabetes with machine learning:
Aman Kharwal. thecleverprogrammer.
https://thecleverprogrammer.com/2020/07/13/predict-diabetes-with-
machine-learning/
Research Based Learning:
Machine Learning
Diabetes
Definition: chronic disease caused when the
pancreas does not produce enough insulin or
when the body can not effectively use the
insulin.
Number of people with diabetes: 108 million in
1980 to 422 million in 2014.
Prevalence rising more rapidly in low- and
middle-income countries.
Manifestations: thick skin, high blood pressure,
weight loss, etc.
Loke, A. (2023, April 5). Diabetes. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/diabetes
1
K-Nearest Neighbors
Definition: supervised machine
learning method employed to tackle
classification and regression
problems
Ability: adapt to different patterns
and make predictions based on the
local structure of the data.
Steps:
Selecting optimal K value1.
Calculating distance2.
Finding nearest neighbors3.
Voting for Classification or Taking
Average for Regression
4.
K-Nearest Neighbor(KNN) algorithm. GeeksforGeeks. (2024, January 25). https://www.geeksforgeeks.org/k-nearest-neighbours/
2
Import Data
3
Data Set Used
768 data points with 9 features each
4
Outcome
Outcome: feature going to be predicted -> 0 = No diabetes, 1 = Diabetes
5
Outcome
Distribution
0: 500 counts
1: 268 counts
6
7
7
Confirm the connection
between model
complexity & accuracy
8
KNN Model
Confusion Matrix
Results
Accuracy of KNN on training set: 79%
Accuracy of KNN on test set: 78%
Accuracy of Decision Tree on training set: 100%
Accuracy of Decision Tree on test set: 71.4%
Overfitting -> apply pre-pruning
9
Decision Tree Classifier
Definition : Non-parametric supervised learning method
used for classification and regression
Goal : To create a model that predicts the value of a target
variable by learning simple decision rules
Advantage : Able to handle both numerical and
categorical data
Disadvantage : Predictions of decision trees are neither
smooth nor continous and not good at extrapolation
10
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. https://scikit-learn.org/stable/modules/tree.html
Pre-Pruning
Set max_depth = 3 to decrease overfitting
Accuracy of Decision Tree on training set: 77.3%
Accuracy of Decision Tree on test set: 74%
11
Accuracy of Decision Tree on training set: 76.4%
Accuracy of Decision Tree on test set: 71.9%
Feature Importance
0: not used at all
1: perfectly predicts the target
How important each feature is for the decision a decision tree classifier makes.
12
Feature Importance
Visualization
(without test size)
13
Feature Importance
Visualization
(with test size)
14
Feature Importance
Visualization Comparation
14
No test size Test size = 0.3
Correlation
Matrix
15
Correlation Matrix
16
Multi-Layer Perceptron
to Predict Diabetes
Accuracy on training set : 0.73
Accuracy on test set : 0.72
17
Re-scale the Data
Accuracy on training set : 0.823
Accuracy on test set : 0.802
17
Increasing the Number
of Parameters
Accuracy on training set : 0.806
Accuracy on test set : 0.797
18
Logistic Regression
19
Accuracy
Score
Logistic Regression Model Accuracy :
0,7359307359307359
20
50
30
120
Confusion
Matrix
21
31
Diabetes
with Equal
Outcome
22
Data Set Used
536 data points with 9 features each
23
Outcome
Distribution
0: 268 counts
1: 268 counts
24
Outcome: feature going
to be predicted ->
0 = No diabetes,
1 = Diabetes
Confirm the connection between model
complexity & accuracy
25
Original Data Equal Outcome
KNN
Decision
Tree
Training Set 87% 100%
Test Set 81% 70,8%
Results
26
Pre-Pruning
Set max_depth = 3 to decrease overfitting
Accuracy of Decision Tree on training set: 82,4%
Accuracy of Decision Tree on test set: 85,7%
27
Accuracy of Decision Tree on training set: 77.3%
Accuracy of Decision Tree on test set: 74%
Original Data Equal Outcome
Feature Importance
0: not used at all
1: perfectly predicts the target
How important each feature is for the decision a decision tree classifier makes.
28
Original Data
Equal Outcome
Feature Importance
Visualization
29
Original Data Equal Outcome
Correlation Matrix
30
Original
Accuracy
Re-scale
Data
Increasing
Number of
Partitions
Training
Set
83% 84% 99,7%
Test Set 71% 72,7% 69,6%
Results of MLP Classifier
31
Confusion
Matrix
Logistic Regression Model Accuracy :
0,7453416149068323
32
KNN
Decision
Tree
MLP
Classifier
Logistic
Regression
Most
Important
Feature
Correlation
Training
Set
Test
Set
Training
Set
Test
Set
Training
Set
TestSet
Original
Data
79% 78% 100% 71,40% 73% 72% 73% Glucose Glucose
Equal
Diabetes
Outcome
87% 81% 82,4% 85,7% 83% 72,7% 74% Insulin Glucose
Summary
Original Data vs Equal Diabetes Outcome
33
Diabetes without
Glucose Feature
34
Confirm the connection between model
complexity & accuracy
35
Original Data Data without glucose
KNN
Decision
Tree
Training Set 74% 79%
Test Set 66% 72,7%
Results
36
Data without glucose
KNN
Decision
Tree
Training
Set
79% 100%
Test Set 78% 71,4%
Original Data
Pre-Pruning
Set max_depth = 3 to decrease overfitting
Accuracy of KNN on training set: 79%
Accuracy of KNN on test set: 72,7%
37
Accuracy of KNN on training set: 77.3%
Accuracy of KNN on test set: 74%
Original Data
Data without glucose
Feature Importance
How important each feature is for the decision a decision tree classifier makes.
38
Original Data
Data without glucose
Feature Importance
Visualization
39
Original Data Data without glucose
Correlation Matrix
41
Data without glucose
Confusion Matrix
Logistic Regression Model Accuracy :
0,7359307359307359
Original Data
Logistic Regression Model Accuracy :
0.670995670995671
Data without glucose
42
KNN
Decision
Tree
MLP
Classifier Logistic
Regressi
on
Feature
Importance
Correlation
Training
Set
TestSet
Trainin
gSet
Test
Set
Training
Set
TestSet
Original
Data
79% 78% 100% 71,40% 73% 72% 73% Glucose Glucose
Data
without
glucose
74% 76,6% 100% 71,9% 72% 69,3% 73% Age BMI
Summary
Original Data vs Data without Glucose
43