SlideShare a Scribd company logo
1 of 34
LUNG CAPACITY PREDICTOR
CS – 513 : DATA MINING PROJECT
TEAM MEMBERS
Abhishek
Srivastava
(10412516)
Dhrumin
Desai
(10412236)
Neeraj
Ganvir
(10411831)
Kanika
Chopra
(10410278)
PROJECT OVERVIEW
• Lung capacity is an indicator of a person’s health condition. As lungs supply oxygen to the blood vessels
in our bodies and remove any carbon dioxide generated as waste gases from our blood stream, it is very
important for us to keep our lungs healthy. Habits such as smoking not only reduces our lung capacity
but also damages our lungs, leading to lung cancer.
• The objective of our project is to derive certain conclusions based on the data provided and to develop
prediction models using Data-mining techniques based on different contributing attributes to predict :
• Whether the person has a “LOW” , “MEDIUM” or “HIGH” Lung capacity ; and
• The actual Lung Capacity for a person
DATA USED
• The dataset comprises of 724 observations for the below 6 listed attributes –
Attributes Description
Age
Describes the age of the person, ranging between (3 – 19 years), because it
has been supported experimentally that a person’s lung capacity increases until
he is 20, after which his lung capacity will slowly decline as he grows older.
Height Describes height of the person, measured in inches
Smoke Describes whether the person is a smoker - Yes / No
Gender Describes whether the person is a Male or Female
Caesarean Describes whether the person was born caesarean - Yes / No
TLC Describes the total lung capacity of the person
• The data has been taken from - https://vincentarelbundock.github.io/Rdatasets/datasets.html
DATA SNAPSHOT
DATA EXPLORATION
SUMMARY/VISUALIZATION FOR DATA COLUMNS
• Age :
• Height :
DATA EXPLORATION
SUMMARY/VISUALIZATION FOR DATA COLUMNS
• Smoke :
• Gender :
DATA EXPLORATION
SUMMARY/VISUALIZATION FOR DATA COLUMNS
• Caesarean :
• TLC :
DERIVED CONCLUSIONS
• Its appropriate examining the
relationship between the 2
numeric variables – Age and
Height:
• From the plot we can see that,
there is almost a linear
relationship between the Age
and Height i.e. person with
more age has more height.
DERIVED CONCLUSIONS
• Exploring the relationship
between Gender and Smoking :
• From the plot we can see that,
Females are more Smokers than
Males.
DERIVED CONCLUSIONS
• Examining the relationship
between LUNG CAPACITY and
GENDER :
• From the plot we can see that,
Males have a higher Lung
Capacity than Females.
DERIVED CONCLUSIONS
• Examining the relationship
between LUNG CAPACITY and
CAESAREAN :
• From the plot we can see that,
there is not much of a difference
in the Lung Capacities of a
Caesarean and a Non – Caesarean
person.
DERIVED CONCLUSIONS
• Examining the relationship
between LUNG CAPACITY and
HEIGHT :
• From the plot we can see that, the
Lung Capacity of a person
increases with Height.
DERIVED CONCLUSIONS
• Examining the relationship between
LUNG CAPACITY and AGE :
• From the plot we can see that, the
Lung Capacity of a person increases
with increase in Age.
DERIVED CONCLUSIONS
• Examining the relationship between
LUNG CAPACITY and SMOKER :
• From the plot we can see that, the
Lung Capacity of a person is higher if
he/she is a Smoker.
• But this is counter-intuitive, so lets
find out WHY?
DERIVED CONCLUSIONS
• Let’s consider Age as well along with
Smoke as the factor for Lung
Capacity.
• We can create 3 Age groups for that,
• "<10”; "10-15“; ">15"
• So if we plot it again, this is what we
find –
• The reason why Smokers have a
higher lung capacity is because most
of the Smokers are the older lot of
people and we know older person
have a higher Lung capacity.
• Therefore, we can conclude that
smoking alone projects a false idea
but when combined with age we get
the real picture.
K-MEANS CLUSTERING
ADDITION OF A NEW CATEGORICAL FIELD
• We introduced a new Categorical variable called “Lung Cap” to categorize the actual lung capacities
given in our dataset into 3 groups, defined as follows –
 Low - TLC < 6.5
 Medium - TLC >= 6.5 and < 10
 High - TLC >= 10
• This has been done in order to build classification models based on Lung Cap (factor) and not the actual
numerical values.
• We decided the ranges for the Lung Cap variable based on the summary details of the TLC variable,
which is as follows –
SPLITTING DATA IN TRAINING & TEST
TRAINING DATA
TEST DATA
KNN CLASSIFICATION
• Estimating the OPTIMAL
value of K for our KNN
classification model, based
on “K Vs Error Curve”
• Optimal value of K comes
out to be 11 and that’s
what we have used for our
model.
ERROR is minimum
at K = 11
KNN CLASSIFICATION
• Therefore, performed KNN
Classification using K = 11.
• We achieved an Accuracy of
80.68% for our KNN model.
KKNN CLASSIFICATION
• Estimating the OPTIMAL
value of K for our KKNN
classification model, based
on “K Vs Error Curve”
• Optimal value of K comes
out to be 28 and that’s
what we have used for our
model.
ERROR is minimum
at K = 28
KKNN CLASSIFICATION
• Therefore, performed KKNN
Classification using K = 28.
• We achieved an Accuracy of
80.00% for our KKNN model.
CART DECISION TREE
CART DECISION TREE
CART DECISION TREE
• We achieved an Accuracy of
78.62% for our CART Decision
Tree model.
C5.0 DECISION TREE
C5.0 DECISION TREE
• We achieved an Accuracy of
76.55% for our C5.0 Decision
Tree model.
NEURAL NETWORK MODEL
FOR PREDICTING LUNG CAPACITY GROUP
NEURAL NETWORK MODEL
FOR PREDICTING LUNG CAPACITY GROUP
• Performance Measure of our Neural Network model for predicting the Lung Capacity Group
NEURAL NETWORK MODEL
FOR PREDICTING ACTUAL LUNG CAPACITY
NEURAL NETWORK MODEL
FOR PREDICTING ACTUAL LUNG CAPACITY
• Here we could not measure
the Performance using
Confusion Matrix because of
distinct lung capacity values.
CONCLUSION
• We tried different models for classification/prediction on our data and below are the performances
achieved
Model Accuracy (%)
KNN 80.68
KKNN 80.00
CART DECISION TREE 78.62
C5.0 76.55
Neural Network 75.17
Lung Capacity Predictor Models

More Related Content

Similar to Lung Capacity Predictor Models

Introduction to statistics...ppt rahul
Introduction to statistics...ppt rahulIntroduction to statistics...ppt rahul
Introduction to statistics...ppt rahulRahul Dhaker
 
Understanding Uncertainty.pdf
Understanding Uncertainty.pdfUnderstanding Uncertainty.pdf
Understanding Uncertainty.pdfMohanadHussien2
 
ACHIVERS TEAM _PPT__ML.pptx
ACHIVERS TEAM _PPT__ML.pptxACHIVERS TEAM _PPT__ML.pptx
ACHIVERS TEAM _PPT__ML.pptxranvir20
 
Lecture 1 - System of Measurements, SI Units
Lecture 1 - System of Measurements, SI UnitsLecture 1 - System of Measurements, SI Units
Lecture 1 - System of Measurements, SI UnitsMarjorieJeanAnog
 
Accuracy precision errors
Accuracy precision errorsAccuracy precision errors
Accuracy precision errorsDrSampuranSuahg
 

Similar to Lung Capacity Predictor Models (9)

Introduction to statistics...ppt rahul
Introduction to statistics...ppt rahulIntroduction to statistics...ppt rahul
Introduction to statistics...ppt rahul
 
Mathproj final
Mathproj finalMathproj final
Mathproj final
 
Understanding Uncertainty.pdf
Understanding Uncertainty.pdfUnderstanding Uncertainty.pdf
Understanding Uncertainty.pdf
 
Anthrapometry
Anthrapometry Anthrapometry
Anthrapometry
 
ACHIVERS TEAM _PPT__ML.pptx
ACHIVERS TEAM _PPT__ML.pptxACHIVERS TEAM _PPT__ML.pptx
ACHIVERS TEAM _PPT__ML.pptx
 
Burns And Bush Chapter 16
Burns And Bush Chapter 16Burns And Bush Chapter 16
Burns And Bush Chapter 16
 
Cost Effectiveness in Radiology - By Jeffrey Shyu
Cost Effectiveness in Radiology - By Jeffrey ShyuCost Effectiveness in Radiology - By Jeffrey Shyu
Cost Effectiveness in Radiology - By Jeffrey Shyu
 
Lecture 1 - System of Measurements, SI Units
Lecture 1 - System of Measurements, SI UnitsLecture 1 - System of Measurements, SI Units
Lecture 1 - System of Measurements, SI Units
 
Accuracy precision errors
Accuracy precision errorsAccuracy precision errors
Accuracy precision errors
 

Lung Capacity Predictor Models

  • 1. LUNG CAPACITY PREDICTOR CS – 513 : DATA MINING PROJECT
  • 3. PROJECT OVERVIEW • Lung capacity is an indicator of a person’s health condition. As lungs supply oxygen to the blood vessels in our bodies and remove any carbon dioxide generated as waste gases from our blood stream, it is very important for us to keep our lungs healthy. Habits such as smoking not only reduces our lung capacity but also damages our lungs, leading to lung cancer. • The objective of our project is to derive certain conclusions based on the data provided and to develop prediction models using Data-mining techniques based on different contributing attributes to predict : • Whether the person has a “LOW” , “MEDIUM” or “HIGH” Lung capacity ; and • The actual Lung Capacity for a person
  • 4. DATA USED • The dataset comprises of 724 observations for the below 6 listed attributes – Attributes Description Age Describes the age of the person, ranging between (3 – 19 years), because it has been supported experimentally that a person’s lung capacity increases until he is 20, after which his lung capacity will slowly decline as he grows older. Height Describes height of the person, measured in inches Smoke Describes whether the person is a smoker - Yes / No Gender Describes whether the person is a Male or Female Caesarean Describes whether the person was born caesarean - Yes / No TLC Describes the total lung capacity of the person • The data has been taken from - https://vincentarelbundock.github.io/Rdatasets/datasets.html
  • 6. DATA EXPLORATION SUMMARY/VISUALIZATION FOR DATA COLUMNS • Age : • Height :
  • 7. DATA EXPLORATION SUMMARY/VISUALIZATION FOR DATA COLUMNS • Smoke : • Gender :
  • 8. DATA EXPLORATION SUMMARY/VISUALIZATION FOR DATA COLUMNS • Caesarean : • TLC :
  • 9. DERIVED CONCLUSIONS • Its appropriate examining the relationship between the 2 numeric variables – Age and Height: • From the plot we can see that, there is almost a linear relationship between the Age and Height i.e. person with more age has more height.
  • 10. DERIVED CONCLUSIONS • Exploring the relationship between Gender and Smoking : • From the plot we can see that, Females are more Smokers than Males.
  • 11. DERIVED CONCLUSIONS • Examining the relationship between LUNG CAPACITY and GENDER : • From the plot we can see that, Males have a higher Lung Capacity than Females.
  • 12. DERIVED CONCLUSIONS • Examining the relationship between LUNG CAPACITY and CAESAREAN : • From the plot we can see that, there is not much of a difference in the Lung Capacities of a Caesarean and a Non – Caesarean person.
  • 13. DERIVED CONCLUSIONS • Examining the relationship between LUNG CAPACITY and HEIGHT : • From the plot we can see that, the Lung Capacity of a person increases with Height.
  • 14. DERIVED CONCLUSIONS • Examining the relationship between LUNG CAPACITY and AGE : • From the plot we can see that, the Lung Capacity of a person increases with increase in Age.
  • 15. DERIVED CONCLUSIONS • Examining the relationship between LUNG CAPACITY and SMOKER : • From the plot we can see that, the Lung Capacity of a person is higher if he/she is a Smoker. • But this is counter-intuitive, so lets find out WHY?
  • 16. DERIVED CONCLUSIONS • Let’s consider Age as well along with Smoke as the factor for Lung Capacity. • We can create 3 Age groups for that, • "<10”; "10-15“; ">15" • So if we plot it again, this is what we find – • The reason why Smokers have a higher lung capacity is because most of the Smokers are the older lot of people and we know older person have a higher Lung capacity. • Therefore, we can conclude that smoking alone projects a false idea but when combined with age we get the real picture.
  • 18. ADDITION OF A NEW CATEGORICAL FIELD • We introduced a new Categorical variable called “Lung Cap” to categorize the actual lung capacities given in our dataset into 3 groups, defined as follows –  Low - TLC < 6.5  Medium - TLC >= 6.5 and < 10  High - TLC >= 10 • This has been done in order to build classification models based on Lung Cap (factor) and not the actual numerical values. • We decided the ranges for the Lung Cap variable based on the summary details of the TLC variable, which is as follows –
  • 19. SPLITTING DATA IN TRAINING & TEST TRAINING DATA TEST DATA
  • 20. KNN CLASSIFICATION • Estimating the OPTIMAL value of K for our KNN classification model, based on “K Vs Error Curve” • Optimal value of K comes out to be 11 and that’s what we have used for our model. ERROR is minimum at K = 11
  • 21. KNN CLASSIFICATION • Therefore, performed KNN Classification using K = 11. • We achieved an Accuracy of 80.68% for our KNN model.
  • 22. KKNN CLASSIFICATION • Estimating the OPTIMAL value of K for our KKNN classification model, based on “K Vs Error Curve” • Optimal value of K comes out to be 28 and that’s what we have used for our model. ERROR is minimum at K = 28
  • 23. KKNN CLASSIFICATION • Therefore, performed KKNN Classification using K = 28. • We achieved an Accuracy of 80.00% for our KKNN model.
  • 26. CART DECISION TREE • We achieved an Accuracy of 78.62% for our CART Decision Tree model.
  • 28. C5.0 DECISION TREE • We achieved an Accuracy of 76.55% for our C5.0 Decision Tree model.
  • 29. NEURAL NETWORK MODEL FOR PREDICTING LUNG CAPACITY GROUP
  • 30. NEURAL NETWORK MODEL FOR PREDICTING LUNG CAPACITY GROUP • Performance Measure of our Neural Network model for predicting the Lung Capacity Group
  • 31. NEURAL NETWORK MODEL FOR PREDICTING ACTUAL LUNG CAPACITY
  • 32. NEURAL NETWORK MODEL FOR PREDICTING ACTUAL LUNG CAPACITY • Here we could not measure the Performance using Confusion Matrix because of distinct lung capacity values.
  • 33. CONCLUSION • We tried different models for classification/prediction on our data and below are the performances achieved Model Accuracy (%) KNN 80.68 KKNN 80.00 CART DECISION TREE 78.62 C5.0 76.55 Neural Network 75.17