Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data

Predicting Gene Loss in Plants:
Lessons Learned from Laptop-Scale
Data
@PhilippBayer
Forrest Fellow, Edwards group
School of Biological Sciences
University of Western Australia
1

Who am I?
2
• Originally from Germany. PhD in Applied
Bioinformatics at UQ, worked on genotyping by
sequencing methods, finished 2016.
• Now Forrest Fellow at UWA, Perth in Edwards
group

My toolbox
3
• Originally did everything in Python – self-taught
• Jupyter notebooks on my laptop, scripts on our
servers
• Scikit-learn, pandas, fastai/keras
• Nowadays lots of R – workflowr, Rstudio, caret
• Whichever works. String fiddling in Python, then
stats analysis/plotting in R.

‘Science’ vs ‘craft’
• I think ML is much more a ‘craft’ than a ‘science’
• It’s very hard to predict whether thing A or thing
B will be more accurate or perform better, in
many cases methods will perform similarly
• At some point you develop a gut feel for what
may and may not work -> craft!
4

The project
5
• Used sequencing data for ~300 lines of Brassica
oleracea (cabbage), rapa, napus (canola)

XGBoost model
• Can we find out which genomic elements predict
gene variability? Lots of homeologous
recombination, lots of transposon activity
• Build three feature tables for each gene in B.
napus/oleracea/rapa
• Table includes size of chromosome, whether
gene is 1/2/3kb close to various transposons,
whether gene is in a syntenic block etc., to
predict the column ‘is a gene variable’
6

XGBoost model
• Used XGBoost, one of the the current state-of-
the-art machine learning approaches for not-so-
big data and feature tables (~ table of numbers)
• Goal of the model: is a given gene ‘core’ or
‘variable’ (lost in at least one plant)?
• Input data:
• 120,000 canola genes (rows)
• Transposons of different classes (columns)
• Position on chromosome (columns)
8

XGBoost
9
n_estimators is probably the most important parameter. The higher,
the longer training takes, the more accuracy you get, the more
overfitting you get too! Everything downstream takes longer too

Initial accuracy!
10
I mean, biology is messy, right?? So 85.5% should be really good?
That’s almost 86%! Woo!

… but??
• Can we trust that? We should check the
confusion matrix!
11
Predicted core Predicted variable
Actual core 19914 148
Actual variable 3310 507

… but??
• The confusion matrix shows us that in this case,
accuracy is misleading!
• XGBoost mostly predicts ‘core’ and calls it a day.
12

Imbalanced classes
• Most real life datasets have heavily imbalanced
classes
• Example: Prediction of a specific cancer, >99%
of people won’t develop that cancer, so a model
just saying ‘no cancer’ will have >99% accuracy
• Class imbalance will make your models look like
they perform well when in reality, they perform
terribly
13

Imbalanced classes
• Scikit-learn has many spots where you can work
against class imbalance
• Data stratification:
14

Imbalanced classes
• Most models have some kind of parameter for
class imbalance, for XGBoost:
• (‘craft’ – in my experience, other values than the
suggested above had better performance) 15

Imbalanced classes
• The fit method also has a parameter for
imbalanced classes:
16

Imbalanced classes
• So after implementing all this stuff, can I get a
better class accuracy?
17
Predicted core Predicted variable
Actual core 16471 3591
Actual variable 1817 2000

Base model
• Shouldn’t I make a base model first?
• I need to ‘beat’ something! I shouldn’t just use
XGBoost because it’s the flashy thing to do!
18

The base model
• Of all of my genes, 84.02% are core – that’s
what we have to beat!
• VERY different from the 50/50 you might have
assumed for two classes
19

Summary of this part
• Not shown: A whole bunch of experimenting with
AUC, ROC, MCC, LightGBM, CatBoost, 10-fold
validation, imbalanced-learn, BayesSearchCV
for parameter optimisation, fiddling with the
probability cut-off, f1 scores (precision/recall)
• (This talk is 15 minutes long, not 15 hours)
• This is – maybe? – all I can get out of this
dataset! At some point you have to walk away.
20

What has the model learned?
• That’s the actually interesting part!
• XGBoost has in-built methods for ‘gain’, ‘cover’,
‘weight’ (I always forget what does what) feature
importance
• These treat rare or low-variance variables
differently
21

Less confusing: Shapley
values!
• In a (wrong) nutshell: Make all possible
combinations of features, see how the model’s
prediction changes based on what you left out
https://christophm.github.io/interpretable-ml-book/shapley.html#shapley

Running SHAP in Python
• Easy to run, but takes a while:
• But takes much longer than training! With
XGBoost, higher model complexity settings
mean (n_estimators) waaaaaay longer runtime
• Comes with three kinds of plots: force plots,
dependence plots, and summary plots

SHAP in human survival
(summary plot)
24

Shapley values
28
• Unlike F-values reported by XGBoost’s
plot_importance, you can compare Shapley
values between different models!
Plot_importance tells you only whether a feature
is important, SHAP tells you whether high/low is
important too!
• As expected, in B. napus the further away from
centromeres, the higher Shapley values

My ‘sources’
29
• A lot of this I got from Twitter.

My ‘sources’
30
• Some I got from books –
• Géron’s Hands-On Machine Learning (2nd ed)
(Tim O’Reilly: ‘one of the best books O’Reilly
has published in our entire history’)
• Müller and Guido’s Introduction to Machine
Learning with Python
• And heaps of googling
(towardsdatascience.com, various Kaggle
notebooks)

My ‘sources’
31
• And from Perth’s machine learning community!

Summary
Especially not yourself! 32

Summary
33
• Beware class imbalance! Don’t trust any
measurement blindly.
• ALWAYS check your predictions manually,
either by looking at a confusion matrix or by
digging into your raw predictions
• At some point you just have to stop improving
your model. This is a craft, not a science – hard
to predict when to move on. Better to add
features than to fiddle with the model.

Summary
34
• SHAP is a fun way to learn more about what the
model actually learned – but the explanation is
only as good as your model. A garbage model
will have garbage explanations.
• In my case: maybe Shapley can explain core
genes, but not variable genes?
• When building your own models, don’t get
discouraged at all the things that can go wrong!
There is a huge community off- and online to
help you!

Summary
35
• All code shown today comes from Jupyter
notebooks, all hosted at
https://github.com/AppliedBioinformatics/

Acknowledgements
Armin Scheben
Andy Yuan
Habib Rijzaani
Clémentine Mercé
Haifei (Ricky) Hu
Robyn Anderson
Cassie Fernandez
Monica Danilevicz
Jacob Marsh
Nicola & Andrew
Forrest
Paul Johnson
Rochelle Gunn
Dave Edwards
Jacqueline Batley
Jason Williams
Nirav Merchant
Armand Gilles
Brent Verpaalen
Heaps more on Twitter but
Twitter’s Mentions
doesn’t go past last
October
Perth Machine Learning
Group
Shujun Ou
Contact:
Philipp.bayer@uwa.edu.au
@philippbayer

Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data

Recommended

Recommended

More Related Content

Similar to Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data

Similar to Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data (20)

Recently uploaded

Recently uploaded (20)

Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data

Editor's Notes