Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Presentation on project group 2
1. Prediction of Diabetes in PIMA
Indian Women
Submitted By: Group-2
Aditya K/24
R Panneerselvam/16
Shreedhar Mani Tripathi/04
Vivek Dubey/23
2. Background
• This dataset is originally from the National Institute of Diabetes
and Digestive and Kidney Diseases in the USA.
• The objective of the dataset is to diagnostically predict whether or
not a woman patient has diabetes, based on certain diagnostic
measurements included in the dataset.
3. The key variables
• Pregnancies
• Glucose
• Blood Pressure
• Skin Thickness
• Insulin
• BMI
• Diabetes Pedigree Function
• Age
• Outcome
Number of times pregnant
Plasma glucose concentration a 2 hours in an
oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m^2)
Diabetes pedigree function
Age (years)
Class variable (0 or 1)
4. Steps
• Data Visualization
• Data Preprocessing: Splitting of data set onto
three parts for training, validation and test.
• Logistic Regression
• Building a Model for diabetes prediction
• Validation of Regression Model
• Testing of Regression Model
• Conclusion
7. DATA VISUALIZATION: Pregnancy
• The maximum number of pregnancies that a woman had was 17 while
the minimum was 0 and the Mean is 3.845 and Median is 3 Pregnancy
per woman.
0
1
3
3.845
6
17
0
2
4
6
8
10
12
14
16
18
Min 1st Qu Median Mean 3rd Qu Max.
No of Pregnency
8. DATA VISUALIZATION: Glucose
• The maximum Glucose a Pima Indian woman had was 199 mg/dL and
mean Glucose level is 120.9 mg/dL with Median of 117 mg/dL.
99
117 120.9
140.2
199
0
50
100
150
200
250
1st Qu Median Mean 3rd Qu Max.
Glucose
9. DATA VISUALIZATION: BP
• The maximum BP a Pima Indian woman had was 122mmHg and Mean
BP is 69.11mmHg with Median of 72 mmHg
62
72
69.11
80
122
0
20
40
60
80
100
120
140
1st Qu Median Mean 3rd Qu Max.
Blood Pressure
10. DATA VISUALIZATION: Skin Thickness
• The maximum Skin thickness a Pima Indian woman had was 99and Mean
Skin thickness is 20.54with Median of 23
23
20.54
32
99
0
20
40
60
80
100
120
Median Mean 3rd Qu Max.
Skin Thickness
11. DATA VISUALIZATION: Insulin
• The maximum Insulin a Pima Indian woman had was 846 and mean
Insulin is 79.8 with Median of 30.5
30.5
79.8
127.2
846
0
100
200
300
400
500
600
700
800
900
Median Mean 3rd Qu Max.
Insulin
12. DATA VISUALIZATION: BMI
• The maximum BMI a Pima Indian woman had was 67.1 and mean BMI is
31.99 with Median of 32
27.3
32 31.99
36.6
67.1
0
10
20
30
40
50
60
70
80
1st Qu Median Mean 3rd Qu Max.
BMI
13. DATA VISUALIZATION: DPF
• The maximum Diabetes Pedigree Function a Pima Indian woman had
was 2.42 and mean Diabetes Pedigree Function is 0.4719 with Median of
0.3725
0.078
0.2437
0.3725
0.4719
0.6262
2.42
0
0.5
1
1.5
2
2.5
3
Min 1st Qu Median Mean 3rd Qu Max.
Diabetes Pedigree Function
14. DATA VISUALIZATION: Age
• The maximum Age a Pima Indian woman had was 81 years and mean
Age is 33.24 years with Median of 29years
21
24
29
33.24
41
81
0
10
20
30
40
50
60
70
80
90
Min 1st Qu Median Mean 3rd Qu Max.
Age
15. DATA VISUALIZATION
• Comparison of Mean of Independent variables of Pima Indian Women with and without
Diabetes: From the data it can be visualized that BP level, BMI, Glucose Insulin and Age are
higher for diabetic women as compared to women without diabetes. The findings are on
expected lines
0
20
40
60
80
100
120
140
160
Preg Glucose BP ST Insu. BMI D P F Age
Without Diabetes With Diabetes
16. Logistic Regression: Prediction of
Diabetes
• DATA PREPROCESSING: SPLITTING OF DATA:
• Before we proceed for the regression, we have divided the
data into three parts
• Training data set contains 70% of the total observations.
• Validation data set contains 20% of the total observations.
• Test data set contains 10% of the total observations
17. Removal of multicollinearity from the
training data set
• Maximum VIF is 1.6413 for insulin
• The complete data of the vif is shown in the
excel sheet
18. Removal of unusual observation
• As we can see from the above plot that that the maximum cooks distance = 0.084,
corresponding to the observation 229 in the original data and 385 in the training data set
that wasidentified by using R command “which.max(cooks.distance(dbmodel))”. Therefore,
we need not remove any observation.
19. • The model is tested using the HL test and was found to be fit.
• Two variables skin thickness and insulin were found to have
high P values. 0.482913 and 0.211439 respectively.
• They were removed one by one ensuring the fitness of model
using HL test after each removal. First insulin was removed
and then skin thickness.
• The model built has all the variables except insulin and the
skin thickness.
• X-squared = 7.3947, df = 8, p-value = 0.4947. therefore model
is significant.
22. Validation data
• The (estimate-3*SE) and (estimate+3*SE)
overlap in both training data and testing data.
• X-squared = 4.688, df = 8, p-value = 0.7903
• Therefore the model is validated.
• We have calculated the threshold value in
excel and decided it at 0.3.
23. Testing data
• The same threshold value is used on the
testing data and the results are almost same
as in validation data. Therefore the model is
also tested.
24. conclusion
• The variables that are contributing in the prediction of
diabetes in a Pima Indian woman are
1. Number of Pregnancies,
2. Glucose,
3. Blood Pressure,
4. BMI,
5. Diabetes Pedigree function
6. Age.
• Insulin and skin thickness were not contributing in the
prediction of diabetes in Pima Indian women.
• Blood pressure is negatively impacting the model
prediction.