A Data-Driven
Nomogram for
Breast Cancer
Survival
Capstone by Lisa Federer, Bret Gaulin, Michal Haskell, Andy
Pollack, and Carmen Smiley
The Problem
Image source:
1 in 8 US women will be diagnosed with
breast cancer during their lifetime.
Image source: http://www.pathophys.org/wp-content/uploads/2012/12/breastcancer-copy.png
Our hypothesis
Certain demographic, diagnostic, and treatment parameters
can reliably predict survival time for women with breast
cancer. Using this knowledge, we could build a “calculator” to
estimate survival time for individuals.
The data
http://seer.cancer.gov/
Data exploration and variable
selection
• Survival time (months)
• Age at diagnosis
• Year of birth
• Race
• Origin (Hispanic recode)
• Stage
• Histology
• Tumor extent
• Number of primary tumors
• Laterality
• ER Status
• PR Status
• Radiation therapy
146 variables in
SEER database
13 variables
of interest
Classification by survival time
Classification model results
Model Precision Recall f-score Support
Logistic
Regression
0 0.65 0.43 0.23 146
1 0.55 0.37 0.44 416
2 0.55 0.76 0.63 471
Average 0.56 0.56 0.54 1033
Naïve Bayes
0 0.39 0.16 0.23 162
1 0.63 0.3 0.4 402
2 0.5 0.84 0.63 466
Average 0.54 0.84 0.63 1030
Decision Tree
0 0.3 0.39 0.34 131
1 0.76 0.75 0.75 416
2 0.8 0.75 0.78 485
Average 0.72 0.7 0.71 1032
Random Forest
0 0.58 0.43 0.5 183
1 0.73 0.75 0.74 414
2 0.71 0.78 0.74 440
Average 0.7 0.7 0.7 1037
K Neighbors
0 0.53 0.33 0.41 150
1 0.83 0.82 0.82 425
2 0.77 0.88 0.82 472
Average 0.76 0.77 0.76 1047
Regression and survival models
Cox proportional hazards model
𝜆 𝑡 = 𝑏0 𝑡 exp(𝑏1 𝑥1 + ⋯ + 𝑏 𝑁 𝑥 𝑛)
Aalen’s additive model
𝜆 𝑡 = 𝑏0 𝑡 + 𝑏1 𝑡 𝑥1 + ⋯ + 𝑏 𝑁 (𝑡)𝑥 𝑇)
Nomogram demonstration
Limitations
Not a substitute for medical advice
Limitations
Missing or incomplete data in SEER database
Questions?

A data driven nomogram for breast cancer survival