Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Introduction to Machine Learning and Genomics

985 views

Published on

Women Who Code-HSV Event:

'An Introduction to Machine Learning and Genomics'. Dr. Lasseigne will introduce the R programming language and the foundational concepts of machine learning with real-world examples including applications in the field of genomics with an emphasis on complex human disease research.

Brittany Lasseigne, PhD, is a postdoctoral fellow in the lab of Dr. Richard Myers at the HudsonAlpha Institute for Biotechnology and a 2016-2017 Prevent Cancer Foundation Fellow. Dr. Lasseigne received a BS in biological engineering from the James Worth Bagley College of Engineering at Mississippi State University and a PhD in biotechnology science and engineering from The University of Alabama in Huntsville. As a graduate student, she studied the role of epigenetics and copy number variation in cancer, identifying novel diagnostic biomarkers and prognostic signatures associated with kidney cancer. In her current position, Dr. Lasseigne’s research focus is the application of genetics and genomics to complex human diseases. Her recent work includes the identification of gene variants linked to ALS, characterization of gene expression patterns in schizophrenia and bipolar disorder, and development of non-invasive biomarker assays. Dr. Lasseigne is currently focused on integrating genomic data across cancers with functional annotations and patient information to explore novel mechanisms in cancer etiology and progression, identify therapeutic targets, and understand genomic changes associated with patient survival. Based upon those analyses, she is creating tools to share with the scientific community.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

An Introduction to Machine Learning and Genomics

  1. 1. An Introduction to Machine Learning and Genomics Brittany N. Lasseigne, PhD HudsonAlpha Intstitute for Biotechnology 27 June 2017 @bnlasse blasseigne@hudsonalpha.org
  2. 2. • ‘Genomical’ Data • Introduction to Machine Learning and R • Machine Learning Algorithms • Applying Machine Learning to Genomics Data + Problems
  3. 3. • ‘Genomical’ Data • Introduction to Machine Learning and R • Machine Learning Algorithms • Applying Machine Learning to Genomics Data + Problems
  4. 4. 4American Cancer Society, 2015 & Harvard NeuroDiscovery Center, 2017. Cancer: • Men have a 1 in 2 lifetime risk of developing cancer and a 1 in 4 lifetime risk of dying from cancer • Women have a 1 in 3 lifetime risk of developing cancer and a 1 in 5 lifetime risk of dying from cancer Psychiatric Illness: • 1 in 4 American adults suffere from a diagnosable mental disorder in any given year • ~6% suffer serious disabilities as a result Neurodegenerative Disease: • ~6.5M Americans suffer (AD, PD, MS, ALS, HD), expected to rise to 12M by 2030 Complex Human Diseases: usually caused by a combination of genetic, environmental and lifetyle factors (most of which have not yet been identified)
  5. 5. • Which patients are high risk for developing cancer? • What are early biomarkers of cancer? • Which patients are likely to be short/long term cancer survivers? • What chemotherapeutic might a cancer patient benefit from? 5 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Complex problems
  6. 6. Genomics • Understanding the function of the genome (total genetic material) and how it relates to human disease (studying all of the genes at once!) • The sequencing of the human genome paved the way for genomic studies • Our goal it identify genetic/genomic variation associated with disease to improve patient care 6
  7. 7. 7 Sequencing
  8. 8. 8
  9. 9. Cells, Tissues, & Diseases Functional Annotations Image from encodeproject.org 9 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Multidimensional Data Sets Big Data
  10. 10. 10 Case study: The Cancer Genome Atlas • Mulitiple data types for 11,000+ patients across 33 tumor types • 549,625 files with 2000+ metadata attributes • >2.5 Petabytes of data
  11. 11. Genomics Data is Big Data 11Stephens, et al. PLOS Biology, 2015. 1 zettabyte (ZB) = 1024 EB 1 exabyte (EB) = 1024 PB 1 petabyte (PB) = 1024 TB 1 terabyte (TB) = 1024 GB
  12. 12. Astronomical ‘Genomical’ Data: the ‘four-headed beast’ of the data life-cycle (2025 Projections) 12Stephens, et al. PLOS Biology, 2015 and nanalyze.com. 1 zettabyte (ZB) = 1024 EB 1 exabyte (EB) = 1024 PB 1 petabyte (PB) = 1024 TB 1 terabyte (TB) = 1024 GB
  13. 13. • ‘Genomical’ Data • Introduction to Machine Learning and R • Machine Learning Algorithms • Applying Machine Learning to Genomics Data + Problems
  14. 14. Cells, Tissues, & Diseases Functional Annotations mage from encodeproject.org and xorlogics.com. 14 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Multidimensional Data Sets • We have lots of data and complex problems • We want to make data-driven predictions and need to automate model building
  15. 15. Cells, Tissues, & Diseases Functional Annotations mage from encodeproject.org and xorlogics.com. 15 Multidimensional Data Sets Complex problems + Big Data —> Machine Learning!
  16. 16. • data analysis method that automates analytical model building • make data driven predictions or discover patterns without explicit human intervention • Useful when have complex problems and lots of data (‘big data’) Machine Learning 16 Computer Data Program Output Traditional Programming Computer [2,3] + 5 Computer Data Output Program Machine Learning Computer [2,3] 5 + • Our goal isn’t to make perfect guesses, but to make useful guesses—we want to build a model that is useful for the future
  17. 17. 17 Supervised Learning: -Prediction Ex. linear & logistic regression Unsupervised Learning: -Find patterns Ex. Clustering, Principle Component Analysis Known Data + Known Response YES NO MODEL NEW DATA Predict Response Clusters of Categorized Data Uncategorized Data
  18. 18. Real-World Machine Learning Applications 18 Recommendation Engine Mail Sorting Self-Driving Car HBO’s Silicon Valley ‘not hotdog!’ app
  19. 19. The Rise of Machine Learning • Hardware Advances • Extreme performance hardware (ex. application-specific integrated circuits) • Smaller, cheaper hardware (Moore’s law) • Cloud computing (ex. AWS) • Software Advances • New machine learning algorithms including deep learning and reinforcement learning • Data Advances • High-performance, less expensive sensors & data generation • ex. wearables, next-gen sequencing, social media 19 2016 Q3
  20. 20. Machine Learning with the R Programming Language 20 kdnuggets, 2015 Python is also a great choice! • R tends to be favored by statisticians and academics (for research) • Python tends to be favored by engineers (with production workflows)
  21. 21. 21 Burtch Works asked data scientists and predictive analytics pros: Which do you prefer to use?
  22. 22. • Open source implementation of S which was originally developed at Bell Lab • Free programming language and software environment for advanced statistical computing and graphics • Functional programming language written primarily in C, Fortran • Good at data manipulation, modeling and computing, data visualization • Cross-platform compatible • Vast community (e.g., CRAN, R-bloggers, Bioconductor) • Over 10,000 packages including parallel/high-performance compute packages • Used extensively by statisticians and academics • Popularity is substantially increasing in recent years • Drawbacks: can be steep learning curve (better recently), limited GUI (RStudio!), documentation can be sparse, memory allocation can be an issue The R Programming Language 22
  23. 23. • ‘Genomical’ Data • Introduction to Machine Learning and R • Machine Learning Algorithms • Applying Machine Learning to Genomics Data + Problems
  24. 24. Fisher’s/Anderson's iris data set: measurements (cm) of the sepal length and width and petal length and width (4 features) for 50 flowers from each of 3 species (Iris setosa, versicolor, and virginica) Iris Dataset in R 24
  25. 25. Iris Dataset: Summarize/Descriptive Statistics (Observational) 25 Computer Data Program Output Traditional Programming Computer Sepal.Lenth mean(x) 5.843
  26. 26. Iris Dataset: Correlation (still Descriptive) t-test, p value < 2.2*10-16 26 Setosa Versicolor Virginica
  27. 27. Iris Dataset: Linear Regression is Machine Learning! • Red line is a linear regression line fit to the data describing petal length as a function of petal width • We can now PREDICT petal width given petal length Petal.Width~0.416*Petal.Length - 0.363 (y=mx+b) Computer Data Output Program Machine Learning Computer Petal.Length Petal.Width Petal.Width~ 0.416*Petal.Length - 0.363 27
  28. 28. Iris Data: Adding Regularization (LASSO) •Model building with a large # of features for a moderate number of samples can result in ‘overfitting’ —the model is too specific to the training set and not generalizable enough for accurate predictions with new data •Regularization is a technique for preventing this by introducing tuning parameters that penalize the coefficients of variables that are linearly dependent (redundant) •This results in FEATURE SELECTION •Ridge regression and LASSO regression are methods of regression with regularization 28 Computer Petal.Length Sepal.Width Sepal.Length Petal.Width Petal.Width~ 0.968*Sepal.Length + 0.187 Petal.Width ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + b 0 0 Petal.Width ~ 0*Petal.Length + 0*Sepal.Width + C* Sepal.Length + b Petal.Width ~ Sepal.Length + b p value < 2.2*10-16
  29. 29. Iris Data: Decision Trees • Decision trees can take different data types (categorical, binary, numeric) as input/output variables, handle missing data and outliers well, and are intuitive • Decision tree limitations include that each decision boundary at each split is a concrete binary decision and the decision criteria only consider one input feature at a time (not a combination of multiple input features) • Examples: Video games, clinical decision models 29 Petal.Length < 2.35 cm Setosa (40/0/0) Petal.Width < 1.65 cm Versicolor (0/40/12) Virginica (0/0/28)
  30. 30. Iris Data: Ensemble Methods Example: tree bagging and boosting • Instead of picking a single model, ensemble methods combine multiple models to fit the training data (‘bagging’ and ‘boosting’) • Random Forest is a Decision Tree Ensemble Method Image: Machado, et al. Veterinary Research, 2015. 30
  31. 31. Iris Data: Neural Nets • Neural Networks (NNs) emulate how the human brain works with a network of interconnected neurons (essentially logistic regression units) organized in multiple layers, allowing more complex, abstract, and subtle decisions • Lots of tuning parameters (# of hidden layers, # of neurons in each layer, and multiple ways to tune learning) • Learning is an iterative feedback mechanism where training data error is used to adjust the corresponding input weights which is propagated back to previous layers (i.e., back-propagation) • NNs are good at learning non-linear functions and can handle multiple outputs, but have a long training time and models are susceptible to local minimum traps (can be mitigated by doing multiple rounds—takes more time!) X1 X2 Output (Summation of Input and Activation with Sigmoid Fxn) ‘Neuron’ 31
  32. 32. Other Machine Learning Methods • Naive Bayes (based on prior probabilities) • Hidden Markov Models (Bayesian network with hidden states) • K Nearest Neighbors (instance-based learning—clustering!) • Support Vector Machines (discriminator defined by a separating hyperplane) • Additional Ensemble Method Approaches (combining multiple models) • And new methods coming out all the time… Raw Data Clean/Normalize Data Training Set Test Set Build Model Test Apply to New Data (Validation Cohort or Model Application) Tune Model 32 Algorithm Selection is an Important Step!
  33. 33. • ‘Genomical’ Data • Introduction to Machine Learning and R • Machine Learning Algorithms • Applying Machine Learning to Genomics Data + Problems
  34. 34. • Which patients are high risk for developing cancer? • What are early biomarkers of cancer? • Which patients are likely to be short/long term cancer survivers? • What chemotherapeutic might a cancer patient benefit from? 34 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Complex problems + Big Data —> Machine Learning
  35. 35. 35 Integrating genomic data with machine learning to improve predictive modeling 1) Cross-Cancer Patient Outcome Prediction Model 2) Improved Kidney Cancer Patient Outcome Prediction Model
  36. 36. Scaled -log10 Cox p-value -1 2 30 1 ‘Common Survival Genes’ across 19 cancers • ‘Common Survival Genes’ Cox regression uncorrected p-value <0.05 for a gene in at least 9/19 cancers: • 84 genes, enriched for proliferation-related processes including mitosis, cell and nuclear division, and spindle formation • Clustering by Cox regression p- values: 7 ‘Proliferative Informative Cancers’ and 12 ‘Non-Proliferative Informative Cancers’ 36 ESCA STAD OV LUSC GBM LAML LIHC SARC BLCA CESC HNSC BRCA ACC MESO KIRP LUAD PAAD LGG KIRC TopCrossCancerSurvivalGenes * C Ramaker & Lasseigne, et al. 2017.
  37. 37. Scaled -log10 Cox p-value -1 2 30 1 ‘Common Survival Genes’ across 19 cancers Proliferative Informative Cancers (PICs) 37 ESCA STAD OV LUSC GBM LAML LIHC SARC BLCA CESC HNSC BRCA ACC MESO KIRP LUAD PAAD LGG KIRC TopCrossCancerSurvivalGenes * C • ‘Common Survival Genes’ Cox regression uncorrected p-value <0.05 for a gene in at least 9/19 cancers: • 84 genes, enriched for proliferation-related processes including mitosis, cell and nuclear division, and spindle formation • Clustering by Cox regression p- values: 7 ‘Proliferative Informative Cancers’ and 12 ‘Non-Proliferative Informative Cancers’ Ramaker & Lasseigne, et al. 2017.
  38. 38. Scaled -log10 Cox p-value -1 2 30 1 ‘Common Survival Genes’ across 19 cancers Proliferative Informative Cancers (PICs) 38 ESCA STAD OV LUSC GBM LAML LIHC SARC BLCA CESC HNSC BRCA ACC MESO KIRP LUAD PAAD LGG KIRC TopCrossCancerSurvivalGenes * C Non-Proliferative Informative Cancers (Non-PICs) • ‘Common Survival Genes’ Cox regression uncorrected p-value <0.05 for a gene in at least 9/19 cancers: • 84 genes, enriched for proliferation-related processes including mitosis, cell and nuclear division, and spindle formation • Clustering by Cox regression p- values: 7 ‘Proliferative Informative Cancers’ and 12 ‘Non-Proliferative Informative Cancers’ Ramaker & Lasseigne, et al. 2017.
  39. 39. 39 Cross-Cancer Patient Outcome Model Cox regression with LASSO feature selection ~20,000 gene expression values Cancer Patient Survival Survival~ -0.104 + 0.086*ADAM12 + 0.037*CKS1 - 0.088*CRYL1 + 0.056*DNA2 + 0.013*DONSON + 0.098*HJURP - 0.022*NDRG2 + 0.031*RAD54B + 0.040*SHOX2 - 0.155*SUOX Ramaker & Lasseigne, et al. 2017.
  40. 40. 40 Integrating genomic data with machine learning to improve predictive modeling 1) Cross-Cancer Patient Outcome Prediction Model 2) Improved Kidney Cancer Patient Survival Prediction Model
  41. 41. TCGA Kidney Renal Cell Carcinoma (KIRC) Data Set • 291 tumor samples with clinical, RNA-seq, DNAm, and CNV data available (~1/3 of patients died from disease) 41
  42. 42. 42 Can we improve clinically relevant phenotype prediction with multi-omics classifiers? Clinically Annotated Multidimensional Data Sets DNAm CNV RNA Expression Protein Expression microRNA Expression Mutations RNA CNVDNAm Cox regression with LASSO feature selection
  43. 43. 43 Multi-omic classifiers to predict patient outcome RNA CNV DNAm Patient Outcome Model Test AUC CNV <0.5 RNA 0.5683 DNAm 0.6794 DNAm+CNV PCs 0.6571 RNA+CNV PCs 0.6730 RNA+DNAm PCs 0.7397 RNA+DNAm+CNV PCs PPCs 0.7619 accuracy:
  44. 44. 44 Multi-omic classifiers to predict patient outcome RNA CNV DNAmDNAm CNV RNA Patient Outcome Model Test AUC CNV <0.5 RNA 0.5683 DNAm 0.6794 DNAm+CNV 0.6571 RNA+CNV 0.6730 RNA+DNAm 0.7397 RNA+DNAm+CNV PCs PPCs 0.7619 accuracy:
  45. 45. 45 • RNA+DNAm+CNV model of patient survival outperformed each data type alone or with another single data type, as well as models built on features before dimension reduction • Synergistic effect by combining RNA, DNAm, and CNV into combined features for prediction of patient outcome • Some principal components were strongly correlated with CIN or DNAmIN status Multi-omic classifiers to predict patient outcome RNA CNVDNAm Patient Outcome Model Test AUC CNV <0.5 RNA 0.5683 DNAm 0.6794 DNAm+CNV 0.6571 RNA+CNV 0.6730 RNA+DNAm 0.7397 RNA+DNAm+CNV 0.7619 accuracy:
  46. 46. Take-Home Message • Genomics generates big data to address complex biological problems, e.g., improving human disease prevention, diagnosis, prognosis, and treatment efficacy • Machine learning is a data analysis method that automate analytical model building to make data driven predictions or discover patterns without explicit human intervention • Machine learning is a subfield of computer science—>the algorithms are implemented in code • Machine learning is useful when we have complex problems with lots of ‘big’ data 46 Computer Data Program Output Traditional Programming Computer [2,3] + 5 Computer Data Output Program Machine Learning Computer [2,3] 5 +
  47. 47. HudsonAlpha: hudsonalpha.org Information is Power: http://hudsonalpha.org/information-is-power R Programming Language and/or Machine Learning (mostly free): Software Carpentry (software-carpentry.org) and Data Carpentry (datacarpentry.org) coursera.org and datacamp.com Stanford Online’s ‘Statistical Learning’ class Books: Rosalind Franklin: The Dark Lady of DNA by Brenda Maddox (Female scientist biography) The Emperor of All Maladies by Siddhartha Mukherjee (History of cancer) The Gene by Siddhartha Mukherjee (History of genetics) Genome by Matt Ridley (Human Genome) Algorithms to Live By by Brian Christian and Tom Griffiths (CS application to real-life) Headstrong: 52 Women Who Changed Science-and the World by Rachel Swaby Lean In by Sheryl Sandberg (Women and the workplace) Bossypants by Tina Fey (Autobiography)
  48. 48. 48 Thanks! Brittany N. Lasseigne, PhD @bnlasse blasseigne@hudsonalpha.org

×