Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hacking Global Health London 2016

Hackaton on data from the Healthy Growth, Birth, Development knowledge integration initiative from the Bill and Melinda Gates initiative.

Analysis includes use of R package caret to prepare data and modelling, and an example of trajectory clustering.

Posted on

Related Books

Free with a 30 day trial from Scribd

See all
  • Login to see the comments

  • Be the first to like this

Hacking Global Health London 2016

  1. 1. Giovanni M. Dall’Olio Hacking Global Health 1 lessons learned from an Open Data Science Hackaton
  2. 2. Background – the HBGDki initiative Bill and Melinda Gates Foundation Presentation title 2 Slides credit:
  3. 3. The HBGDki data Objective of HBGDki: •Understand which factors affect child development Variables in full dataset (curated from 122 studies): •Motor, Cognitive, Language Development •Environment, Socioeconomic status •Parents’ Reasoning skills and Depressive Symptoms •Infant temperament, Breastfeeding, Micronutrients, Growth velocity, HAZ, enteric infections Presentation title 3
  4. 4. Observations on HBGDki data? • 90% data from US studies • US data may be collected in a more systematic way or with better tools Bias towards US studies • Inconsistent data (different procedures used) although manually curated • Incomplete data Data collected from several sources • HBGDki plans to use insights from current dataset to launch a global data collection study • Scope of the Hackaton is to see which type of analysis can be done and where efforts should be concentrated Future plans ahead Presentation title 4
  5. 5. The Hackaton Challenge • Being able to predict the weight at birth during the pregnancy allows to detect underweight babies and act in advance • This can be predicted from ultrasound measurements • The current method are relatively good, but the objective of the hackaton is to improve them. Predicting weight at birth, given ultrasound measurements Slides credit:
  6. 6. The Hackaton data Data size • 17,370 ultrasound scans from 2,525 samples collected from two studies Variables • GAGEDAYS: age of the foetus in days at the time of the ultrasound • SUBJID, STUDYID, SEX: subject and study id, sex of the baby • WTKG: predicted weight at birth, using best method in x • BWT_40: predicted weight at birth, using best method in literature • PARITY, GRAVIDA: number of times the mother has been pregnant before • ABCIRCM, BPDCM, FEMURCM, HCIRCM: ultrasound measurements Presentation title 6 6 Biparietal Diameter BPDCM Head Circumference HCIRCM Abdominal Circumference ABCIRM Femur Length FEMURCM Slides credit:
  7. 7. Exploratory 1: how much data, and how it is distributed Number of ultrasounds per subject Presentation title 7 Distribution of ultrasound measurements
  8. 8. Centering, scaling, and imputing data with caret library(caret) preProcess(., method=c("center", "scale", "knnImpute" , "YeoJohnson" )) After transformBefore transform 8 – The caret library in R can be used to center and scale the data, apply an YeoJohson transform to normalize it, and impute missing values
  9. 9. Exploratory 2: Correlation between variables • The ggpairs function from GGally allows to quickly create pair plots Presentation title 9
  10. 10. Correlation between variables, Grouped by Study Presentation title 10
  11. 11. Exploratory 3: Differences between Studies • One group plotted the PARITY (number of pregnancies) by Study • From the different distributions they hypothesized that Study 1 was from an high-income country, while Study 2 from a medium-low income country Presentation title 11 Study 1 Study 2
  12. 12. A PCA of the four ultrasound measurements confirms they are highly correlated • We can merge these 4 variables into one single Principal Component, losing <1% of the variance Presentation title 12
  13. 13. My plan: trajectory clustering Presentation title 13 Use trajectory clustering to classify growth trajectories into different groups. For example a group of individuals may grow slower or faster than the others, or with different trajectories Use non-ultrasound variables to characterize the different trajectory groups – e.g. does male sex increases odds of being in a fast- growing group? Data on the right shows example analysis on data
  14. 14. Trajectory Clustering on PC1 of Ultrasound measurements Presentation title 14 cluster n 1 1 2 12 3 5 4 578 – Unfortunately trajectory clustering of the data doesn’t show much – Almost all samples (578) follow the same trajectory – A cluster of 12 samples (cluster 2) follows a slightly faster growth trajectory than the others
  15. 15. Characterizing Cluster 2 • Cluster 2 contains 12 babies that grow slightly faster than the other groups • We can use a binomial regression on other variables (Sex, study id, parity) to determine if they increase the odds of belonging to cluster 12 • Results are not exciting but at least indicate a new possible direction of analysis when new data is available Logistic Regression – odds of belonging to cluster 2 given Sex, Study ID and Parity Presentation title 15 Coefficients Estimate Std. Error z-value Pr(>|z|) (Intercept) 9.2496 729.0359 0.013 0.989877 SEXMale 0.6564 0.2685 2.444 0.014517 * STUDYID -14.2373 729.0359 -0.02 0.984419 PARITY 0.508 0.133 3.82 0.000134 ***
  16. 16. Modeling with caret • The caret library is an interface to several R packages for modelling / clustering / regressions • The train function can be used to: • Preprocess the data (center, scale, normalization) • Fit a model/ regression/etc • Do resampling and cross-validation • Select best fit based on a metric Presentation title 16 ctrl <- trainControl( method="boot", number=10, repeats=3) = train(BWT_40~.,, method="gbm", trainControl=ctrl, preProcess=c("center", "scale"), verbose=F)
  17. 17. Generalized boosting regression on ultrasound data Presentation title 17 var rel.inf ABCIRCM ABCIRCM 42.3102187 GAGEDAYS GAGEDAYS 34.7568922 FEMURCM FEMURCM 7.0196893 SEXMale SEXMale 6.5910654 BPDCM BPDCM 4.7765837 HCIRCM HCIRCM 2.6879421 PARITY PARITY 1.5100042 STUDYID STUDYID 0.3476046 • 25 resamplings • Data centered, scaled, knnImputed with caret • RMSE 0.294
  18. 18. Focusing model on weeks 15-25 slightly improves performances Presentation title 18 • 25 resamplings • Data centered, scaled, knnImputed with caret • RMSE .327 gbm variable importance Overall GAGEDAYS 100.00 ABCIRCM 93.12 HCIRCM 68.62 FEMURCM 46.02 BPDCM 29.54 SEXMale 21.96 PARITY 11.61 STUDYID 0.00
  19. 19. Caret is an interface to several R modelling packages Presentation title 19
  20. 20. Models Models tried: • Linear regression • Regularised regression (LASSO/Ridge) • Decision trees + AdaBoost • Random forests Using: • Last scan only • Last two scans • Last three scans • All 6 scans (if available) ‘Best’ model • Last three scans • Elastic Net • MAPE ≈ 7.4% (MAE ≈ 0.24 kg) This can be improved by: • Adding scans closer to delivery back in (MAPE ≈ 6.4%) What did teams do What did the winning team do better? • Feature engineering • Smart transform of features to predict brain volume, density, etc • Unfortunately their slides are not available anymore ..
  21. 21. Lessons learned • About 50% time was spent on cleaning and understanding data • HBGDki’s investment in data curation is well justified Cleaning data takes time • An approach to classify longitudinal data, even if incomplete • More samples and more variables would allow to characterize different classes of growth speed Trajectory clustering • Common interface for several R modelling packages • Also useful for data cleaning and exploringCaret • Models can be improved by understanding the variables and transforming them in a proper way Feature Engineering Presentation title 21