2. INTRODUCTION
• Lung Cancer is the number one cause of all cancer deaths in the US, estimated
234,030 new cases and 154,050 deaths in 2018.
• Early detection using low-dose computed tomography (CT) Screening on high risk
individuals can reduce lung cancer mortality by 20%.
• The current CT screening criteria are 55-77 years old adults, currently smoking, and
30 pack-year smoking history, but these simple criteria are relatively ineffective.
• Many researches suggest that using lung cancer risk prediction models could lead
to more effective screening programs compared to the current screening criteria.
3. • Develop two risk prediction models for Lung Cancer using classification
algorithms in R.
Decision Tree – Classification and Regression Tree ( CART)
Neural Network – Artificial Neural Network (ANN)
• Select the better model base on their performance metrics.
• Identify the major risk factors associated with lung cancer.
PROJECT PURPOSE
4. Variables Characteristic
Patient ID Character
Age Numeric 14-73
Gender Binary 1-2
Smoking Numeric 1-8
Passive Smoking Numeric 1-8
Air Pollution Numeric 1-8
Occupational Hazards Numeric 1-8
Genetic Risk Numeric 1-7
Alcohol Use Numeric 1-7
Chronic Lung Disease Numeric 1-7
Dust Allergy Numeric 1-7
Diet Balance Numeric 1-7
Chest Pain Numeric 1-9
Short Breath Numeric 1-9
Fatigue Numeric 1-9
Bloody Coughing Numeric 1-9
Wheezing Numeric 1-7
Swallowing Difficulty Numeric 1-7
Clubbing of finger nails Numeric 1-7
Weight Loss Numeric 1-7
Frequent Cold Numeric 1-7
Dry Cough Numeric 1-7
Clubbing of finger nails Numeric 1-9
Levels Chr /Binary High, Medium, Low
DATA
DESCRIPTION
• Data is a subset of the National Lung
Screening Trial Cohort
• 1000 randomized participants
• 22 attributes are potential risk
factors and symptoms of lung
cancer
• Each observation has one of 3
possible classes: Low, Medium, High
10. MODEL EVALUATION
Models Accuracy Sensitivity Specificity Precision ROC Area
Decision Tree
(High Level)
.9832 .9541 1 1 .9721
Neural Network
(High Level)
.9899 1 .9841 .9732 .9636
11. DISCUSSION
• In medical test, False Negative is more dangerous than False Positive, so Finale risk prediction model is
Artificial Neural Network model which has 100% Sensitivity (0% False Negative) compared to Decision
Tree 95.41% Sensitivity (4.59% False Negative).
• Based on Variable Importance result, the most significant risk factors for lung cancer are Air Pollution,
Age, Smoking, Passive Smoking, and Alcohol Use.
• Future improvements
Improve the model performance by fine-tuning the model parameters
Reduce input features to prevent overfitting.
Increase data inputs for better model performance.
Use different classification algorithms for better selection ( Support Vector Machine, RandomForest)
12. • The project has developed the risk prediction model for Lung Cancer and identified top
5 risk factors associated with Lung cancer using classification methods in R packages.
• Using risk prediction models to select high-risk individuals for lung cancer screening
would be more superior to current selection criteria.
• Avoiding the major risk factors may help to prevent and lower lung cancer.
• The project shows that the results are promising for the application of lung cancer risk
prediction models for selective screening.
CONCLUSION
13. • American Lung Association http://www.lung.org
• National Lung Screening Trials https://www.cancer.gov/types/lung/research/nlst
• Fitting a neural network in R https://www.r-bloggers.com
• Classification And Regression Trees for Machine Learning https://machinelearningmastery.com
• Machine Learning in Medicine, Rahul C. Deo, Circulation. 2015;132:1920-1930, November 16,
2015
• Evaluation of Classification Model Accuracy: Essentials http://www.sthda.com/english/articles
• Cross-Validation for Predictive Analytics using R http://www.milanor.net/blog/cross-validation-
for-predictive-analytics-using-r/
• Ideas on interpreting machine learning Patrick Hall, Wen Phan, SriSatish Ambati,March 15, 2017
• R packages https://cran.r-project.org/web/packages
REFERENCES