Diabetespredictionbyusingmachinelearning.pdf

AnnisaSriWardifa1 30 views 49 slides Jun 27, 2024
Slide 1
Slide 1 of 49
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49

About This Presentation

Predict Diabets by using machine learning with several methods such as KKN, Decision Tree, MLP, and Logistic Regression.


Slide Content

21 May 2024FI4002 Simulasi dan
Pemodelan Sistem Fisis
10220003 Bernike Hernita Sofiana
10220027 Annisa Sri Wardifa
10220075 Alyssa Hanifa Dhiyani
Diabetes Prediction
Kharwal, A. (2020, October 23). Predict diabetes with machine learning:
Aman Kharwal. thecleverprogrammer.
https://thecleverprogrammer.com/2020/07/13/predict-diabetes-with-
machine-learning/
Research Based Learning:
Machine Learning

Diabetes
Definition: chronic disease caused when the
pancreas does not produce enough insulin or
when the body can not effectively use the
insulin.
Number of people with diabetes: 108 million in
1980 to 422 million in 2014.
Prevalence rising more rapidly in low- and
middle-income countries.
Manifestations: thick skin, high blood pressure,
weight loss, etc.
Loke, A. (2023, April 5). Diabetes. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/diabetes
1

K-Nearest Neighbors
Definition: supervised machine
learning method employed to tackle
classification and regression
problems
Ability: adapt to different patterns
and make predictions based on the
local structure of the data.
Steps:
Selecting optimal K value1.
Calculating distance2.
Finding nearest neighbors3.
Voting for Classification or Taking
Average for Regression
4.
K-Nearest Neighbor(KNN) algorithm. GeeksforGeeks. (2024, January 25). https://www.geeksforgeeks.org/k-nearest-neighbours/
2

Import Data
3

Data Set Used
768 data points with 9 features each
4

Outcome
Outcome: feature going to be predicted -> 0 = No diabetes, 1 = Diabetes
5

Outcome
Distribution
0: 500 counts
1: 268 counts
6

7

7

Confirm the connection
between model
complexity & accuracy
8

KNN Model
Confusion Matrix

Results
Accuracy of KNN on training set: 79%
Accuracy of KNN on test set: 78%
Accuracy of Decision Tree on training set: 100%
Accuracy of Decision Tree on test set: 71.4%
Overfitting -> apply pre-pruning
9

Decision Tree Classifier
Definition : Non-parametric supervised learning method
used for classification and regression
Goal : To create a model that predicts the value of a target
variable by learning simple decision rules
Advantage : Able to handle both numerical and
categorical data
Disadvantage : Predictions of decision trees are neither
smooth nor continous and not good at extrapolation
10
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. https://scikit-learn.org/stable/modules/tree.html

Pre-Pruning
Set max_depth = 3 to decrease overfitting
Accuracy of Decision Tree on training set: 77.3%
Accuracy of Decision Tree on test set: 74%
11
Accuracy of Decision Tree on training set: 76.4%
Accuracy of Decision Tree on test set: 71.9%

Feature Importance
0: not used at all
1: perfectly predicts the target
How important each feature is for the decision a decision tree classifier makes.
12

Feature Importance
Visualization
(without test size)
13

Feature Importance
Visualization
(with test size)
14

Feature Importance
Visualization Comparation
14
No test size Test size = 0.3

Correlation
Matrix
15

Correlation Matrix
16

Multi-Layer Perceptron
to Predict Diabetes
Accuracy on training set : 0.73
Accuracy on test set : 0.72
17

Re-scale the Data
Accuracy on training set : 0.823
Accuracy on test set : 0.802
17

Increasing the Number
of Parameters
Accuracy on training set : 0.806
Accuracy on test set : 0.797
18

Logistic Regression
19

Accuracy
Score
Logistic Regression Model Accuracy :
0,7359307359307359
20

50
30
120
Confusion
Matrix
21
31

Diabetes
with Equal
Outcome
22

Data Set Used
536 data points with 9 features each
23

Outcome
Distribution
0: 268 counts
1: 268 counts
24
Outcome: feature going
to be predicted ->
0 = No diabetes,
1 = Diabetes

Confirm the connection between model
complexity & accuracy
25
Original Data Equal Outcome

KNN
Decision
Tree
Training Set 87% 100%
Test Set 81% 70,8%
Results
26

Pre-Pruning
Set max_depth = 3 to decrease overfitting
Accuracy of Decision Tree on training set: 82,4%
Accuracy of Decision Tree on test set: 85,7%
27
Accuracy of Decision Tree on training set: 77.3%
Accuracy of Decision Tree on test set: 74%
Original Data Equal Outcome

Feature Importance
0: not used at all
1: perfectly predicts the target
How important each feature is for the decision a decision tree classifier makes.
28
Original Data
Equal Outcome

Feature Importance
Visualization
29
Original Data Equal Outcome

Correlation Matrix
30

Original
Accuracy
Re-scale
Data
Increasing
Number of
Partitions
Training
Set
83% 84% 99,7%
Test Set 71% 72,7% 69,6%
Results of MLP Classifier
31

Confusion
Matrix
Logistic Regression Model Accuracy :
0,7453416149068323
32

 
KNN
Decision
Tree
MLP
Classifier
Logistic
Regression
Most
Important
Feature
Correlation
Training
Set
Test
Set
Training
Set
Test
Set
Training
Set
TestSet
Original
Data
79% 78% 100% 71,40% 73% 72% 73% Glucose Glucose
Equal
Diabetes
Outcome
87% 81% 82,4% 85,7% 83% 72,7% 74% Insulin Glucose
Summary
Original Data vs Equal Diabetes Outcome
33

Diabetes without
Glucose Feature
34

Confirm the connection between model
complexity & accuracy
35
Original Data Data without glucose

KNN
Decision
Tree
Training Set 74% 79%
Test Set 66% 72,7%
Results
36
Data without glucose
KNN
Decision
Tree
Training
Set
79% 100%
Test Set 78% 71,4%
Original Data

Pre-Pruning
Set max_depth = 3 to decrease overfitting
Accuracy of KNN on training set: 79%
Accuracy of KNN on test set: 72,7%
37
Accuracy of KNN on training set: 77.3%
Accuracy of KNN on test set: 74%
Original Data
Data without glucose

Feature Importance
How important each feature is for the decision a decision tree classifier makes.
38
Original Data
Data without glucose

Feature Importance
Visualization
39
Original Data Data without glucose

Correlation Matrix
41
Data without glucose

Confusion Matrix
Logistic Regression Model Accuracy :
0,7359307359307359
Original Data
Logistic Regression Model Accuracy :
0.670995670995671
Data without glucose
42

 
KNN
Decision
Tree
MLP
Classifier Logistic
Regressi
on
Feature
Importance
Correlation
Training
Set
TestSet
Trainin
gSet
Test
Set
Training
Set
TestSet
Original
Data
79% 78% 100% 71,40% 73% 72% 73% Glucose Glucose
Data
without
glucose
74% 76,6% 100% 71,9% 72% 69,3% 73% Age BMI
Summary
Original Data vs Data without Glucose
43

THANK
YOU
Tags