Lung Capacity Predictor Models

LUNG CAPACITY PREDICTOR
CS – 513 : DATA MINING PROJECT

TEAM MEMBERS
Abhishek
Srivastava
(10412516)
Dhrumin
Desai
(10412236)
Neeraj
Ganvir
(10411831)
Kanika
Chopra
(10410278)

PROJECT OVERVIEW
• Lung capacity is an indicator of a person’s health condition. As lungs supply oxygen to the blood vessels
in our bodies and remove any carbon dioxide generated as waste gases from our blood stream, it is very
important for us to keep our lungs healthy. Habits such as smoking not only reduces our lung capacity
but also damages our lungs, leading to lung cancer.
• The objective of our project is to derive certain conclusions based on the data provided and to develop
prediction models using Data-mining techniques based on different contributing attributes to predict :
• Whether the person has a “LOW” , “MEDIUM” or “HIGH” Lung capacity ; and
• The actual Lung Capacity for a person

DATA USED
• The dataset comprises of 724 observations for the below 6 listed attributes –
Attributes Description
Age
Describes the age of the person, ranging between (3 – 19 years), because it
has been supported experimentally that a person’s lung capacity increases until
he is 20, after which his lung capacity will slowly decline as he grows older.
Height Describes height of the person, measured in inches
Smoke Describes whether the person is a smoker - Yes / No
Gender Describes whether the person is a Male or Female
Caesarean Describes whether the person was born caesarean - Yes / No
TLC Describes the total lung capacity of the person
• The data has been taken from - https://vincentarelbundock.github.io/Rdatasets/datasets.html

DATA EXPLORATION
SUMMARY/VISUALIZATION FOR DATA COLUMNS
• Age :
• Height :

DATA EXPLORATION
• Smoke :
• Gender :

DATA EXPLORATION
• Caesarean :
• TLC :

DERIVED CONCLUSIONS
• Its appropriate examining the
relationship between the 2
numeric variables – Age and
Height:
• From the plot we can see that,
there is almost a linear
relationship between the Age
and Height i.e. person with
more age has more height.

DERIVED CONCLUSIONS
• Exploring the relationship
between Gender and Smoking :
Females are more Smokers than
Males.

DERIVED CONCLUSIONS
• Examining the relationship
between LUNG CAPACITY and
GENDER :
Males have a higher Lung
Capacity than Females.

DERIVED CONCLUSIONS
CAESAREAN :
there is not much of a difference
in the Lung Capacities of a
Caesarean and a Non – Caesarean
person.

DERIVED CONCLUSIONS
HEIGHT :
• From the plot we can see that, the
Lung Capacity of a person
increases with Height.

DERIVED CONCLUSIONS
• Examining the relationship between
LUNG CAPACITY and AGE :
Lung Capacity of a person increases
with increase in Age.

DERIVED CONCLUSIONS
• Examining the relationship between
LUNG CAPACITY and SMOKER :
Lung Capacity of a person is higher if
he/she is a Smoker.
• But this is counter-intuitive, so lets
find out WHY?

DERIVED CONCLUSIONS
• Let’s consider Age as well along with
Smoke as the factor for Lung
Capacity.
• We can create 3 Age groups for that,
• "<10”; "10-15“; ">15"
• So if we plot it again, this is what we
find –
• The reason why Smokers have a
higher lung capacity is because most
of the Smokers are the older lot of
people and we know older person
have a higher Lung capacity.
• Therefore, we can conclude that
smoking alone projects a false idea
but when combined with age we get
the real picture.

ADDITION OF A NEW CATEGORICAL FIELD
• We introduced a new Categorical variable called “Lung Cap” to categorize the actual lung capacities
given in our dataset into 3 groups, defined as follows –
 Low - TLC < 6.5
 Medium - TLC >= 6.5 and < 10
 High - TLC >= 10
• This has been done in order to build classification models based on Lung Cap (factor) and not the actual
numerical values.
• We decided the ranges for the Lung Cap variable based on the summary details of the TLC variable,
which is as follows –

SPLITTING DATA IN TRAINING & TEST
TRAINING DATA
TEST DATA

KNN CLASSIFICATION
• Estimating the OPTIMAL
value of K for our KNN
classification model, based
on “K Vs Error Curve”
• Optimal value of K comes
out to be 11 and that’s
what we have used for our
model.
ERROR is minimum
at K = 11

KNN CLASSIFICATION
• Therefore, performed KNN
Classification using K = 11.
• We achieved an Accuracy of
80.68% for our KNN model.

KKNN CLASSIFICATION
• Estimating the OPTIMAL
value of K for our KKNN
classification model, based
on “K Vs Error Curve”
• Optimal value of K comes
out to be 28 and that’s
what we have used for our
model.
ERROR is minimum
at K = 28

KKNN CLASSIFICATION
• Therefore, performed KKNN
Classification using K = 28.
80.00% for our KKNN model.

CART DECISION TREE
78.62% for our CART Decision
Tree model.

C5.0 DECISION TREE
76.55% for our C5.0 Decision
Tree model.

NEURAL NETWORK MODEL
FOR PREDICTING LUNG CAPACITY GROUP

FOR PREDICTING LUNG CAPACITY GROUP
• Performance Measure of our Neural Network model for predicting the Lung Capacity Group

FOR PREDICTING ACTUAL LUNG CAPACITY

FOR PREDICTING ACTUAL LUNG CAPACITY
• Here we could not measure
the Performance using
Confusion Matrix because of
distinct lung capacity values.

CONCLUSION
• We tried different models for classification/prediction on our data and below are the performances
achieved
Model Accuracy (%)
KNN 80.68
KKNN 80.00
CART DECISION TREE 78.62
C5.0 76.55
Neural Network 75.17

Lung Capacity Predictor Models

Lung Capacity Predictor Models

Recommended

Recommended

More Related Content

Similar to Lung Capacity Predictor Models

Similar to Lung Capacity Predictor Models (9)

Lung Capacity Predictor Models