Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data


Published on

Slides for my talk at PAG2020 PAGXXVIII

Published in: Science
  • Be the first to comment

  • Be the first to like this

Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data

  1. 1. Predicting Gene Loss in Plants: Lessons Learned from Laptop-Scale Data @PhilippBayer Forrest Fellow, Edwards group School of Biological Sciences University of Western Australia 1
  2. 2. Who am I? 2 • Originally from Germany. PhD in Applied Bioinformatics at UQ, worked on genotyping by sequencing methods, finished 2016. • Now Forrest Fellow at UWA, Perth in Edwards group
  3. 3. My toolbox 3 • Originally did everything in Python – self-taught • Jupyter notebooks on my laptop, scripts on our servers • Scikit-learn, pandas, fastai/keras • Nowadays lots of R – workflowr, Rstudio, caret • Whichever works. String fiddling in Python, then stats analysis/plotting in R.
  4. 4. ‘Science’ vs ‘craft’ • I think ML is much more a ‘craft’ than a ‘science’ • It’s very hard to predict whether thing A or thing B will be more accurate or perform better, in many cases methods will perform similarly • At some point you develop a gut feel for what may and may not work -> craft! 4
  5. 5. The project 5 • Used sequencing data for ~300 lines of Brassica oleracea (cabbage), rapa, napus (canola)
  6. 6. XGBoost model • Can we find out which genomic elements predict gene variability? Lots of homeologous recombination, lots of transposon activity • Build three feature tables for each gene in B. napus/oleracea/rapa • Table includes size of chromosome, whether gene is 1/2/3kb close to various transposons, whether gene is in a syntenic block etc., to predict the column ‘is a gene variable’ 6
  7. 7. EDTA for TE prediction 7
  8. 8. XGBoost model • Used XGBoost, one of the the current state-of- the-art machine learning approaches for not-so- big data and feature tables (~ table of numbers) • Goal of the model: is a given gene ‘core’ or ‘variable’ (lost in at least one plant)? • Input data: • 120,000 canola genes (rows) • Transposons of different classes (columns) • Position on chromosome (columns) 8
  9. 9. XGBoost 9 n_estimators is probably the most important parameter. The higher, the longer training takes, the more accuracy you get, the more overfitting you get too! Everything downstream takes longer too
  10. 10. Initial accuracy! 10 I mean, biology is messy, right?? So 85.5% should be really good? That’s almost 86%! Woo!
  11. 11. … but?? • Can we trust that? We should check the confusion matrix! 11 Predicted core Predicted variable Actual core 19914 148 Actual variable 3310 507
  12. 12. … but?? • The confusion matrix shows us that in this case, accuracy is misleading! • XGBoost mostly predicts ‘core’ and calls it a day. 12
  13. 13. Imbalanced classes • Most real life datasets have heavily imbalanced classes • Example: Prediction of a specific cancer, >99% of people won’t develop that cancer, so a model just saying ‘no cancer’ will have >99% accuracy • Class imbalance will make your models look like they perform well when in reality, they perform terribly 13
  14. 14. Imbalanced classes • Scikit-learn has many spots where you can work against class imbalance • Data stratification: 14
  15. 15. Imbalanced classes • Most models have some kind of parameter for class imbalance, for XGBoost: • (‘craft’ – in my experience, other values than the suggested above had better performance) 15
  16. 16. Imbalanced classes • The fit method also has a parameter for imbalanced classes: 16
  17. 17. Imbalanced classes • So after implementing all this stuff, can I get a better class accuracy? 17 Predicted core Predicted variable Actual core 16471 3591 Actual variable 1817 2000
  18. 18. Base model • Shouldn’t I make a base model first? • I need to ‘beat’ something! I shouldn’t just use XGBoost because it’s the flashy thing to do! 18
  19. 19. The base model • Of all of my genes, 84.02% are core – that’s what we have to beat! • VERY different from the 50/50 you might have assumed for two classes 19
  20. 20. Summary of this part • Not shown: A whole bunch of experimenting with AUC, ROC, MCC, LightGBM, CatBoost, 10-fold validation, imbalanced-learn, BayesSearchCV for parameter optimisation, fiddling with the probability cut-off, f1 scores (precision/recall) • (This talk is 15 minutes long, not 15 hours) • This is – maybe? – all I can get out of this dataset! At some point you have to walk away. 20
  21. 21. What has the model learned? • That’s the actually interesting part! • XGBoost has in-built methods for ‘gain’, ‘cover’, ‘weight’ (I always forget what does what) feature importance • These treat rare or low-variance variables differently 21
  22. 22. Less confusing: Shapley values! • In a (wrong) nutshell: Make all possible combinations of features, see how the model’s prediction changes based on what you left out
  23. 23. Running SHAP in Python • Easy to run, but takes a while: • But takes much longer than training! With XGBoost, higher model complexity settings mean (n_estimators) waaaaaay longer runtime • Comes with three kinds of plots: force plots, dependence plots, and summary plots
  24. 24. SHAP in human survival (summary plot) 24
  25. 25. B. napus SHAP 25
  26. 26. SHAP dependence plot 26
  27. 27. 27 B. oleracea C1 B. napus C1
  28. 28. Shapley values 28 • Unlike F-values reported by XGBoost’s plot_importance, you can compare Shapley values between different models! Plot_importance tells you only whether a feature is important, SHAP tells you whether high/low is important too! • As expected, in B. napus the further away from centromeres, the higher Shapley values
  29. 29. My ‘sources’ 29 • A lot of this I got from Twitter.
  30. 30. My ‘sources’ 30 • Some I got from books – • Géron’s Hands-On Machine Learning (2nd ed) (Tim O’Reilly: ‘one of the best books O’Reilly has published in our entire history’) • Müller and Guido’s Introduction to Machine Learning with Python • And heaps of googling (, various Kaggle notebooks)
  31. 31. My ‘sources’ 31 • And from Perth’s machine learning community!
  32. 32. Summary Especially not yourself! 32
  33. 33. Summary 33 • Beware class imbalance! Don’t trust any measurement blindly. • ALWAYS check your predictions manually, either by looking at a confusion matrix or by digging into your raw predictions • At some point you just have to stop improving your model. This is a craft, not a science – hard to predict when to move on. Better to add features than to fiddle with the model.
  34. 34. Summary 34 • SHAP is a fun way to learn more about what the model actually learned – but the explanation is only as good as your model. A garbage model will have garbage explanations. • In my case: maybe Shapley can explain core genes, but not variable genes? • When building your own models, don’t get discouraged at all the things that can go wrong! There is a huge community off- and online to help you!
  35. 35. Summary 35 • All code shown today comes from Jupyter notebooks, all hosted at
  36. 36. Acknowledgements Armin Scheben Andy Yuan Habib Rijzaani Clémentine Mercé Haifei (Ricky) Hu Robyn Anderson Cassie Fernandez Monica Danilevicz Jacob Marsh Nicola & Andrew Forrest Paul Johnson Rochelle Gunn Dave Edwards Jacqueline Batley Jason Williams Nirav Merchant Armand Gilles Brent Verpaalen Heaps more on Twitter but Twitter’s Mentions doesn’t go past last October Perth Machine Learning Group Shujun Ou Contact: @philippbayer
  37. 37. 37