3. PROJECT OVERVIEW
• Lung capacity is an indicator of a person’s health condition. As lungs supply oxygen to the blood vessels
in our bodies and remove any carbon dioxide generated as waste gases from our blood stream, it is very
important for us to keep our lungs healthy. Habits such as smoking not only reduces our lung capacity
but also damages our lungs, leading to lung cancer.
• The objective of our project is to derive certain conclusions based on the data provided and to develop
prediction models using Data-mining techniques based on different contributing attributes to predict :
• Whether the person has a “LOW” , “MEDIUM” or “HIGH” Lung capacity ; and
• The actual Lung Capacity for a person
4. DATA USED
• The dataset comprises of 724 observations for the below 6 listed attributes –
Attributes Description
Age
Describes the age of the person, ranging between (3 – 19 years), because it
has been supported experimentally that a person’s lung capacity increases until
he is 20, after which his lung capacity will slowly decline as he grows older.
Height Describes height of the person, measured in inches
Smoke Describes whether the person is a smoker - Yes / No
Gender Describes whether the person is a Male or Female
Caesarean Describes whether the person was born caesarean - Yes / No
TLC Describes the total lung capacity of the person
• The data has been taken from - https://vincentarelbundock.github.io/Rdatasets/datasets.html
9. DERIVED CONCLUSIONS
• Its appropriate examining the
relationship between the 2
numeric variables – Age and
Height:
• From the plot we can see that,
there is almost a linear
relationship between the Age
and Height i.e. person with
more age has more height.
10. DERIVED CONCLUSIONS
• Exploring the relationship
between Gender and Smoking :
• From the plot we can see that,
Females are more Smokers than
Males.
11. DERIVED CONCLUSIONS
• Examining the relationship
between LUNG CAPACITY and
GENDER :
• From the plot we can see that,
Males have a higher Lung
Capacity than Females.
12. DERIVED CONCLUSIONS
• Examining the relationship
between LUNG CAPACITY and
CAESAREAN :
• From the plot we can see that,
there is not much of a difference
in the Lung Capacities of a
Caesarean and a Non – Caesarean
person.
13. DERIVED CONCLUSIONS
• Examining the relationship
between LUNG CAPACITY and
HEIGHT :
• From the plot we can see that, the
Lung Capacity of a person
increases with Height.
14. DERIVED CONCLUSIONS
• Examining the relationship between
LUNG CAPACITY and AGE :
• From the plot we can see that, the
Lung Capacity of a person increases
with increase in Age.
15. DERIVED CONCLUSIONS
• Examining the relationship between
LUNG CAPACITY and SMOKER :
• From the plot we can see that, the
Lung Capacity of a person is higher if
he/she is a Smoker.
• But this is counter-intuitive, so lets
find out WHY?
16. DERIVED CONCLUSIONS
• Let’s consider Age as well along with
Smoke as the factor for Lung
Capacity.
• We can create 3 Age groups for that,
• "<10”; "10-15“; ">15"
• So if we plot it again, this is what we
find –
• The reason why Smokers have a
higher lung capacity is because most
of the Smokers are the older lot of
people and we know older person
have a higher Lung capacity.
• Therefore, we can conclude that
smoking alone projects a false idea
but when combined with age we get
the real picture.
18. ADDITION OF A NEW CATEGORICAL FIELD
• We introduced a new Categorical variable called “Lung Cap” to categorize the actual lung capacities
given in our dataset into 3 groups, defined as follows –
Low - TLC < 6.5
Medium - TLC >= 6.5 and < 10
High - TLC >= 10
• This has been done in order to build classification models based on Lung Cap (factor) and not the actual
numerical values.
• We decided the ranges for the Lung Cap variable based on the summary details of the TLC variable,
which is as follows –
20. KNN CLASSIFICATION
• Estimating the OPTIMAL
value of K for our KNN
classification model, based
on “K Vs Error Curve”
• Optimal value of K comes
out to be 11 and that’s
what we have used for our
model.
ERROR is minimum
at K = 11
21. KNN CLASSIFICATION
• Therefore, performed KNN
Classification using K = 11.
• We achieved an Accuracy of
80.68% for our KNN model.
22. KKNN CLASSIFICATION
• Estimating the OPTIMAL
value of K for our KKNN
classification model, based
on “K Vs Error Curve”
• Optimal value of K comes
out to be 28 and that’s
what we have used for our
model.
ERROR is minimum
at K = 28
23. KKNN CLASSIFICATION
• Therefore, performed KKNN
Classification using K = 28.
• We achieved an Accuracy of
80.00% for our KKNN model.
32. NEURAL NETWORK MODEL
FOR PREDICTING ACTUAL LUNG CAPACITY
• Here we could not measure
the Performance using
Confusion Matrix because of
distinct lung capacity values.
33. CONCLUSION
• We tried different models for classification/prediction on our data and below are the performances
achieved
Model Accuracy (%)
KNN 80.68
KKNN 80.00
CART DECISION TREE 78.62
C5.0 76.55
Neural Network 75.17