SlideShare a Scribd company logo
Predicting Gene Loss in Plants:
Lessons Learned from Laptop-Scale
Data
@PhilippBayer
Forrest Fellow, Edwards group
School of Biological Sciences
University of Western Australia
1
Who am I?
2
• Originally from Germany. PhD in Applied
Bioinformatics at UQ, worked on genotyping by
sequencing methods, finished 2016.
• Now Forrest Fellow at UWA, Perth in Edwards
group
My toolbox
3
• Originally did everything in Python – self-taught
• Jupyter notebooks on my laptop, scripts on our
servers
• Scikit-learn, pandas, fastai/keras
• Nowadays lots of R – workflowr, Rstudio, caret
• Whichever works. String fiddling in Python, then
stats analysis/plotting in R.
‘Science’ vs ‘craft’
• I think ML is much more a ‘craft’ than a ‘science’
• It’s very hard to predict whether thing A or thing
B will be more accurate or perform better, in
many cases methods will perform similarly
• At some point you develop a gut feel for what
may and may not work -> craft!
4
The project
5
• Used sequencing data for ~300 lines of Brassica
oleracea (cabbage), rapa, napus (canola)
XGBoost model
• Can we find out which genomic elements predict
gene variability? Lots of homeologous
recombination, lots of transposon activity
• Build three feature tables for each gene in B.
napus/oleracea/rapa
• Table includes size of chromosome, whether
gene is 1/2/3kb close to various transposons,
whether gene is in a syntenic block etc., to
predict the column ‘is a gene variable’
6
EDTA for TE prediction
7
XGBoost model
• Used XGBoost, one of the the current state-of-
the-art machine learning approaches for not-so-
big data and feature tables (~ table of numbers)
• Goal of the model: is a given gene ‘core’ or
‘variable’ (lost in at least one plant)?
• Input data:
• 120,000 canola genes (rows)
• Transposons of different classes (columns)
• Position on chromosome (columns)
8
XGBoost
9
n_estimators is probably the most important parameter. The higher,
the longer training takes, the more accuracy you get, the more
overfitting you get too! Everything downstream takes longer too
Initial accuracy!
10
I mean, biology is messy, right?? So 85.5% should be really good?
That’s almost 86%! Woo!
… but??
• Can we trust that? We should check the
confusion matrix!
11
Predicted core Predicted variable
Actual core 19914 148
Actual variable 3310 507
… but??
• The confusion matrix shows us that in this case,
accuracy is misleading!
• XGBoost mostly predicts ‘core’ and calls it a day.
12
Imbalanced classes
• Most real life datasets have heavily imbalanced
classes
• Example: Prediction of a specific cancer, >99%
of people won’t develop that cancer, so a model
just saying ‘no cancer’ will have >99% accuracy
• Class imbalance will make your models look like
they perform well when in reality, they perform
terribly
13
Imbalanced classes
• Scikit-learn has many spots where you can work
against class imbalance
• Data stratification:
14
Imbalanced classes
• Most models have some kind of parameter for
class imbalance, for XGBoost:
• (‘craft’ – in my experience, other values than the
suggested above had better performance) 15
Imbalanced classes
• The fit method also has a parameter for
imbalanced classes:
16
Imbalanced classes
• So after implementing all this stuff, can I get a
better class accuracy?
17
Predicted core Predicted variable
Actual core 16471 3591
Actual variable 1817 2000
Base model
• Shouldn’t I make a base model first?
• I need to ‘beat’ something! I shouldn’t just use
XGBoost because it’s the flashy thing to do!
18
The base model
• Of all of my genes, 84.02% are core – that’s
what we have to beat!
• VERY different from the 50/50 you might have
assumed for two classes
19
Summary of this part
• Not shown: A whole bunch of experimenting with
AUC, ROC, MCC, LightGBM, CatBoost, 10-fold
validation, imbalanced-learn, BayesSearchCV
for parameter optimisation, fiddling with the
probability cut-off, f1 scores (precision/recall)
• (This talk is 15 minutes long, not 15 hours)
• This is – maybe? – all I can get out of this
dataset! At some point you have to walk away.
20
What has the model learned?
• That’s the actually interesting part!
• XGBoost has in-built methods for ‘gain’, ‘cover’,
‘weight’ (I always forget what does what) feature
importance
• These treat rare or low-variance variables
differently
21
Less confusing: Shapley
values!
• In a (wrong) nutshell: Make all possible
combinations of features, see how the model’s
prediction changes based on what you left out
https://christophm.github.io/interpretable-ml-book/shapley.html#shapley
Running SHAP in Python
• Easy to run, but takes a while:
• But takes much longer than training! With
XGBoost, higher model complexity settings
mean (n_estimators) waaaaaay longer runtime
• Comes with three kinds of plots: force plots,
dependence plots, and summary plots
SHAP in human survival
(summary plot)
24
B. napus SHAP
25
SHAP dependence plot
26
27
B. oleracea
C1
B. napus C1
Shapley values
28
• Unlike F-values reported by XGBoost’s
plot_importance, you can compare Shapley
values between different models!
Plot_importance tells you only whether a feature
is important, SHAP tells you whether high/low is
important too!
• As expected, in B. napus the further away from
centromeres, the higher Shapley values
My ‘sources’
29
• A lot of this I got from Twitter.
My ‘sources’
30
• Some I got from books –
• Géron’s Hands-On Machine Learning (2nd ed)
(Tim O’Reilly: ‘one of the best books O’Reilly
has published in our entire history’)
• Müller and Guido’s Introduction to Machine
Learning with Python
• And heaps of googling
(towardsdatascience.com, various Kaggle
notebooks)
My ‘sources’
31
• And from Perth’s machine learning community!
Summary
Especially not yourself! 32
Summary
33
• Beware class imbalance! Don’t trust any
measurement blindly.
• ALWAYS check your predictions manually,
either by looking at a confusion matrix or by
digging into your raw predictions
• At some point you just have to stop improving
your model. This is a craft, not a science – hard
to predict when to move on. Better to add
features than to fiddle with the model.
Summary
34
• SHAP is a fun way to learn more about what the
model actually learned – but the explanation is
only as good as your model. A garbage model
will have garbage explanations.
• In my case: maybe Shapley can explain core
genes, but not variable genes?
• When building your own models, don’t get
discouraged at all the things that can go wrong!
There is a huge community off- and online to
help you!
Summary
35
• All code shown today comes from Jupyter
notebooks, all hosted at
https://github.com/AppliedBioinformatics/
Acknowledgements
Armin Scheben
Andy Yuan
Habib Rijzaani
Clémentine Mercé
Haifei (Ricky) Hu
Robyn Anderson
Cassie Fernandez
Monica Danilevicz
Jacob Marsh
Nicola & Andrew
Forrest
Paul Johnson
Rochelle Gunn
Dave Edwards
Jacqueline Batley
Jason Williams
Nirav Merchant
Armand Gilles
Brent Verpaalen
Heaps more on Twitter but
Twitter’s Mentions
doesn’t go past last
October
Perth Machine Learning
Group
Shujun Ou
Contact:
Philipp.bayer@uwa.edu.au
@philippbayer
37

More Related Content

Similar to Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data

Influx/Days 2017 San Francisco | Baron Schwartz
Influx/Days 2017 San Francisco | Baron SchwartzInflux/Days 2017 San Francisco | Baron Schwartz
Influx/Days 2017 San Francisco | Baron Schwartz
InfluxData
 
Agent-Based Modelling: Social Science Meets Computer Science?
Agent-Based Modelling: Social Science Meets Computer Science?Agent-Based Modelling: Social Science Meets Computer Science?
Agent-Based Modelling: Social Science Meets Computer Science?
Edmund Chattoe-Brown
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
Quinton Anderson
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
ebelani
 
CM20315_01_Intro_Machine_Learning_ap.pptx
CM20315_01_Intro_Machine_Learning_ap.pptxCM20315_01_Intro_Machine_Learning_ap.pptx
CM20315_01_Intro_Machine_Learning_ap.pptx
Ignajavier
 
Problem-Solving Strategies
Problem-Solving StrategiesProblem-Solving Strategies
Problem-Solving StrategiesStephen Babbitt
 
Stamps.pptx
Stamps.pptxStamps.pptx
Stamps.pptx
aaaa bbb
 
Michael Bolton - Heuristics: Solving Problems Rapidly
Michael Bolton - Heuristics: Solving Problems RapidlyMichael Bolton - Heuristics: Solving Problems Rapidly
Michael Bolton - Heuristics: Solving Problems Rapidly
TEST Huddle
 
Data Mining Lecture_2.pptx
Data Mining Lecture_2.pptxData Mining Lecture_2.pptx
Data Mining Lecture_2.pptx
Subrata Kumer Paul
 
Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine Learning
Meir Maor
 
Waves keynote2c
Waves keynote2cWaves keynote2c
Waves keynote2c
David Topps
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
fnothaft
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 
BSSML17 - Ensembles
BSSML17 - EnsemblesBSSML17 - Ensembles
BSSML17 - Ensembles
BigML, Inc
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
IPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityIPK - Reproducible research - To infinity
IPK - Reproducible research - To infinity
PeterMorrell4
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
wltrimbl
 

Similar to Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data (20)

Influx/Days 2017 San Francisco | Baron Schwartz
Influx/Days 2017 San Francisco | Baron SchwartzInflux/Days 2017 San Francisco | Baron Schwartz
Influx/Days 2017 San Francisco | Baron Schwartz
 
Agent-Based Modelling: Social Science Meets Computer Science?
Agent-Based Modelling: Social Science Meets Computer Science?Agent-Based Modelling: Social Science Meets Computer Science?
Agent-Based Modelling: Social Science Meets Computer Science?
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 
CM20315_01_Intro_Machine_Learning_ap.pptx
CM20315_01_Intro_Machine_Learning_ap.pptxCM20315_01_Intro_Machine_Learning_ap.pptx
CM20315_01_Intro_Machine_Learning_ap.pptx
 
Problem-Solving Strategies
Problem-Solving StrategiesProblem-Solving Strategies
Problem-Solving Strategies
 
Stamps.pptx
Stamps.pptxStamps.pptx
Stamps.pptx
 
Michael Bolton - Heuristics: Solving Problems Rapidly
Michael Bolton - Heuristics: Solving Problems RapidlyMichael Bolton - Heuristics: Solving Problems Rapidly
Michael Bolton - Heuristics: Solving Problems Rapidly
 
Data Mining Lecture_2.pptx
Data Mining Lecture_2.pptxData Mining Lecture_2.pptx
Data Mining Lecture_2.pptx
 
Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine Learning
 
Waves keynote2c
Waves keynote2cWaves keynote2c
Waves keynote2c
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
BSSML17 - Ensembles
BSSML17 - EnsemblesBSSML17 - Ensembles
BSSML17 - Ensembles
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
IPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityIPK - Reproducible research - To infinity
IPK - Reproducible research - To infinity
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
 

Recently uploaded

Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilityISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
SciAstra
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
frank0071
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 

Recently uploaded (20)

Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilityISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 

Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data

  • 1. Predicting Gene Loss in Plants: Lessons Learned from Laptop-Scale Data @PhilippBayer Forrest Fellow, Edwards group School of Biological Sciences University of Western Australia 1
  • 2. Who am I? 2 • Originally from Germany. PhD in Applied Bioinformatics at UQ, worked on genotyping by sequencing methods, finished 2016. • Now Forrest Fellow at UWA, Perth in Edwards group
  • 3. My toolbox 3 • Originally did everything in Python – self-taught • Jupyter notebooks on my laptop, scripts on our servers • Scikit-learn, pandas, fastai/keras • Nowadays lots of R – workflowr, Rstudio, caret • Whichever works. String fiddling in Python, then stats analysis/plotting in R.
  • 4. ‘Science’ vs ‘craft’ • I think ML is much more a ‘craft’ than a ‘science’ • It’s very hard to predict whether thing A or thing B will be more accurate or perform better, in many cases methods will perform similarly • At some point you develop a gut feel for what may and may not work -> craft! 4
  • 5. The project 5 • Used sequencing data for ~300 lines of Brassica oleracea (cabbage), rapa, napus (canola)
  • 6. XGBoost model • Can we find out which genomic elements predict gene variability? Lots of homeologous recombination, lots of transposon activity • Build three feature tables for each gene in B. napus/oleracea/rapa • Table includes size of chromosome, whether gene is 1/2/3kb close to various transposons, whether gene is in a syntenic block etc., to predict the column ‘is a gene variable’ 6
  • 7. EDTA for TE prediction 7
  • 8. XGBoost model • Used XGBoost, one of the the current state-of- the-art machine learning approaches for not-so- big data and feature tables (~ table of numbers) • Goal of the model: is a given gene ‘core’ or ‘variable’ (lost in at least one plant)? • Input data: • 120,000 canola genes (rows) • Transposons of different classes (columns) • Position on chromosome (columns) 8
  • 9. XGBoost 9 n_estimators is probably the most important parameter. The higher, the longer training takes, the more accuracy you get, the more overfitting you get too! Everything downstream takes longer too
  • 10. Initial accuracy! 10 I mean, biology is messy, right?? So 85.5% should be really good? That’s almost 86%! Woo!
  • 11. … but?? • Can we trust that? We should check the confusion matrix! 11 Predicted core Predicted variable Actual core 19914 148 Actual variable 3310 507
  • 12. … but?? • The confusion matrix shows us that in this case, accuracy is misleading! • XGBoost mostly predicts ‘core’ and calls it a day. 12
  • 13. Imbalanced classes • Most real life datasets have heavily imbalanced classes • Example: Prediction of a specific cancer, >99% of people won’t develop that cancer, so a model just saying ‘no cancer’ will have >99% accuracy • Class imbalance will make your models look like they perform well when in reality, they perform terribly 13
  • 14. Imbalanced classes • Scikit-learn has many spots where you can work against class imbalance • Data stratification: 14
  • 15. Imbalanced classes • Most models have some kind of parameter for class imbalance, for XGBoost: • (‘craft’ – in my experience, other values than the suggested above had better performance) 15
  • 16. Imbalanced classes • The fit method also has a parameter for imbalanced classes: 16
  • 17. Imbalanced classes • So after implementing all this stuff, can I get a better class accuracy? 17 Predicted core Predicted variable Actual core 16471 3591 Actual variable 1817 2000
  • 18. Base model • Shouldn’t I make a base model first? • I need to ‘beat’ something! I shouldn’t just use XGBoost because it’s the flashy thing to do! 18
  • 19. The base model • Of all of my genes, 84.02% are core – that’s what we have to beat! • VERY different from the 50/50 you might have assumed for two classes 19
  • 20. Summary of this part • Not shown: A whole bunch of experimenting with AUC, ROC, MCC, LightGBM, CatBoost, 10-fold validation, imbalanced-learn, BayesSearchCV for parameter optimisation, fiddling with the probability cut-off, f1 scores (precision/recall) • (This talk is 15 minutes long, not 15 hours) • This is – maybe? – all I can get out of this dataset! At some point you have to walk away. 20
  • 21. What has the model learned? • That’s the actually interesting part! • XGBoost has in-built methods for ‘gain’, ‘cover’, ‘weight’ (I always forget what does what) feature importance • These treat rare or low-variance variables differently 21
  • 22. Less confusing: Shapley values! • In a (wrong) nutshell: Make all possible combinations of features, see how the model’s prediction changes based on what you left out https://christophm.github.io/interpretable-ml-book/shapley.html#shapley
  • 23. Running SHAP in Python • Easy to run, but takes a while: • But takes much longer than training! With XGBoost, higher model complexity settings mean (n_estimators) waaaaaay longer runtime • Comes with three kinds of plots: force plots, dependence plots, and summary plots
  • 24. SHAP in human survival (summary plot) 24
  • 28. Shapley values 28 • Unlike F-values reported by XGBoost’s plot_importance, you can compare Shapley values between different models! Plot_importance tells you only whether a feature is important, SHAP tells you whether high/low is important too! • As expected, in B. napus the further away from centromeres, the higher Shapley values
  • 29. My ‘sources’ 29 • A lot of this I got from Twitter.
  • 30. My ‘sources’ 30 • Some I got from books – • Géron’s Hands-On Machine Learning (2nd ed) (Tim O’Reilly: ‘one of the best books O’Reilly has published in our entire history’) • Müller and Guido’s Introduction to Machine Learning with Python • And heaps of googling (towardsdatascience.com, various Kaggle notebooks)
  • 31. My ‘sources’ 31 • And from Perth’s machine learning community!
  • 33. Summary 33 • Beware class imbalance! Don’t trust any measurement blindly. • ALWAYS check your predictions manually, either by looking at a confusion matrix or by digging into your raw predictions • At some point you just have to stop improving your model. This is a craft, not a science – hard to predict when to move on. Better to add features than to fiddle with the model.
  • 34. Summary 34 • SHAP is a fun way to learn more about what the model actually learned – but the explanation is only as good as your model. A garbage model will have garbage explanations. • In my case: maybe Shapley can explain core genes, but not variable genes? • When building your own models, don’t get discouraged at all the things that can go wrong! There is a huge community off- and online to help you!
  • 35. Summary 35 • All code shown today comes from Jupyter notebooks, all hosted at https://github.com/AppliedBioinformatics/
  • 36. Acknowledgements Armin Scheben Andy Yuan Habib Rijzaani Clémentine Mercé Haifei (Ricky) Hu Robyn Anderson Cassie Fernandez Monica Danilevicz Jacob Marsh Nicola & Andrew Forrest Paul Johnson Rochelle Gunn Dave Edwards Jacqueline Batley Jason Williams Nirav Merchant Armand Gilles Brent Verpaalen Heaps more on Twitter but Twitter’s Mentions doesn’t go past last October Perth Machine Learning Group Shujun Ou Contact: Philipp.bayer@uwa.edu.au @philippbayer
  • 37. 37

Editor's Notes

  1. This is a PCA by chromosome – as you can see, some chromosomes ‘diverge’ more than others, mostly caused by how long chromosomes are.
  2. This is a PCA by chromosome – as you can see, some chromosomes ‘diverge’ more than others, mostly caused by how long chromosomes are.
  3. 85%! That’s good, right?!?
  4. But in reality, the model mostly predicts just ‘core’, so not much better!
  5. But in reality, the model mostly predicts just ‘core’, so not much better!
  6. Notice the ‘generally’ – in my experience, other values than the generally suggested one can give you higher accuracies!
  7. The accuracy is worse now BUT I have more predicted variable genes! Yay!
  8. As a more intuitive example, SHAP in a model of human mortality – sex is encoded as 0 male 1 female
  9. B. Napus – homeologous block! AAAND NO TRANSPOSONS
  10. This is again a human example. Dependence plots let you zoom into one feature only, compared with another feature
  11. B. Oleracea C on top, B. oleracea C on bottom. In B. oleracea, genes close to centromeres are ‘protected’ from gene loss (low Shapley), but far away has no consequence. In napus, far away genes have high Shapley!