Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hands-on Introduction to Machine Learning

294 views

Published on

This workshop is a hands-on introduction to machine learning with R and was presented on December 8, 2017 at the University of South Carolina for the 2017 Computational Biology Symposium held by the International Society for Computational Biology Regional Student Group-Southeast USA.

Published in: Science
  • Be the first to comment

Hands-on Introduction to Machine Learning

  1. 1. Introduction to Machine Learning Brittany N. Lasseigne, PhD Senior Scientist HudsonAlpha Intstitute for Biotechnology 8 December 2017 @bnlasse blasseigne@hudsonalpha.org
  2. 2. • ‘Genomical’ and Biology Big Data • Introduction to Machine Learning and R • Machine Learning Algorithms • Applying Machine Learning to Genomics Data + Problems
  3. 3. • ‘Genomical’ and Biology Big Data • Introduction to Machine Learning and R • Machine Learning Algorithms • Applying Machine Learning to Genomics Data + Problems
  4. 4. Biology Big Data • Molecular and cellular profiling of large numbers of features in large numbers of samples (‘omics’ data) • Image processing: cell microscopy, neuroimaging, radiology and histology, crop imagery, etc. 4
  5. 5. Biology Big Data • Molecular and cellular profiling of large numbers of features in large numbers of samples (‘omics’ data) • Image processing: cell microscopy, neuroimaging, radiology and histology, crop imagery, etc. 4Esteva, et al. Nature, 2017.
  6. 6. Biology Big Data • Molecular and cellular profiling of large numbers of features in large numbers of samples (‘omics’ data) • Image processing: cell microscopy, neuroimaging, radiology and histology, crop imagery, etc. 4Esteva, et al. Nature, 2017. Resources: • Kan, Machine Learning applications in cell image analysis, Immunology and Cell Biology, 2017 • Angermueller, et al. Deep learning for computational biology, Mol Syst Biol, 2016. • Ching, et al. Opportunities and Obstacles for Deep Learning in Biology and Medicine, biorxiv, 2017
  7. 7. 5American Cancer Society, 2015 & Harvard NeuroDiscovery Center, 2017. Complex Human Diseases: usually caused by a combination of genetic, environmental and lifetyle factors (most of which have not yet been identified)
  8. 8. 5American Cancer Society, 2015 & Harvard NeuroDiscovery Center, 2017. Cancer: • Men have a 1 in 2 lifetime risk of developing cancer and a 1 in 4 lifetime risk of dying from cancer • Women have a 1 in 3 lifetime risk of developing cancer and a 1 in 5 lifetime risk of dying from cancer Psychiatric Illness: • 1 in 4 American adults suffere from a diagnosable mental disorder in any given year • ~6% suffer serious disabilities as a result Neurodegenerative Disease: • ~6.5M Americans suffer (AD, PD, MS, ALS, HD), expected to rise to 12M by 2030 Complex Human Diseases: usually caused by a combination of genetic, environmental and lifetyle factors (most of which have not yet been identified)
  9. 9. 5American Cancer Society, 2015 & Harvard NeuroDiscovery Center, 2017. Cancer: • Men have a 1 in 2 lifetime risk of developing cancer and a 1 in 4 lifetime risk of dying from cancer • Women have a 1 in 3 lifetime risk of developing cancer and a 1 in 5 lifetime risk of dying from cancer Psychiatric Illness: • 1 in 4 American adults suffere from a diagnosable mental disorder in any given year • ~6% suffer serious disabilities as a result Neurodegenerative Disease: • ~6.5M Americans suffer (AD, PD, MS, ALS, HD), expected to rise to 12M by 2030 Complex Human Diseases: usually caused by a combination of genetic, environmental and lifetyle factors (most of which have not yet been identified)
  10. 10. • Which patients are high risk for developing cancer? • What are early biomarkers of cancer? • Which patients are likely to be short/long term cancer survivers? • What chemotherapeutic might a cancer patient benefit from? 6 Improve disease prevention, diagnosis, prognosis, and treatment efficacy
  11. 11. • Which patients are high risk for developing cancer? • What are early biomarkers of cancer? • Which patients are likely to be short/long term cancer survivers? • What chemotherapeutic might a cancer patient benefit from? 6 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Complex problems
  12. 12. Genomics • Understanding the function of the genome (total genetic material) and how it relates to human disease (studying all of the genes at once!) 7
  13. 13. Genomics • Understanding the function of the genome (total genetic material) and how it relates to human disease (studying all of the genes at once!) • The sequencing of the human genome paved the way for genomic studies 7
  14. 14. Genomics • Understanding the function of the genome (total genetic material) and how it relates to human disease (studying all of the genes at once!) • The sequencing of the human genome paved the way for genomic studies • Our goal it identify genetic/genomic variation associated with disease to improve patient care 7
  15. 15. 8 Sequencing
  16. 16. 9
  17. 17. Image from encodeproject.org 10 Improve disease prevention, diagnosis, prognosis, and treatment efficacy
  18. 18. Image from encodeproject.org 10 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Multidimensional Data Sets
  19. 19. Cells, Tissues, & Diseases Image from encodeproject.org 10 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Multidimensional Data Sets
  20. 20. Cells, Tissues, & Diseases Functional Annotations Image from encodeproject.org 10 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Multidimensional Data Sets
  21. 21. Cells, Tissues, & Diseases Functional Annotations Image from encodeproject.org 10 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Multidimensional Data Sets Big Data
  22. 22. Genomics Data is Big Data 11Stephens, et al. PLOS Biology, 2015. 1 zettabyte (ZB) = 1024 EB 1 exabyte (EB) = 1024 PB 1 petabyte (PB) = 1024 TB 1 terabyte (TB) = 1024 GB
  23. 23. 12 Case study: The Cancer Genome Atlas • Mulitiple data types for 11,000+ patients across 33 tumor types • 549,625 files with 2000+ metadata attributes • >2.5 Petabytes of data
  24. 24. 12 Case study: The Cancer Genome Atlas • Mulitiple data types for 11,000+ patients across 33 tumor types • 549,625 files with 2000+ metadata attributes • >2.5 Petabytes of data 1 Petabyte of Data = 20M four-drawer filing cabinets filled with text or 13.3 years of HD-TV video or ~7 billion Facebook photos or 1 PB of MP3 songs requires ~2,000 years to play
  25. 25. Astronomical ‘Genomical’ Data: the ‘four-headed beast’ of the data life-cycle (2025 Projections) 13Stephens, et al. PLOS Biology, 2015 and nanalyze.com.
  26. 26. Astronomical ‘Genomical’ Data: the ‘four-headed beast’ of the data life-cycle (2025 Projections) 13Stephens, et al. PLOS Biology, 2015 and nanalyze.com. 1 zettabyte (ZB) = 1024 EB 1 exabyte (EB) = 1024 PB 1 petabyte (PB) = 1024 TB 1 terabyte (TB) = 1024 GB
  27. 27. Astronomical ‘Genomical’ Data: the ‘four-headed beast’ of the data life-cycle (2025 Projections) 13Stephens, et al. PLOS Biology, 2015 and nanalyze.com. 1 zettabyte (ZB) = 1024 EB 1 exabyte (EB) = 1024 PB 1 petabyte (PB) = 1024 TB 1 terabyte (TB) = 1024 GB
  28. 28. • ‘Genomical’ and Biology Big Data • Introduction to Machine Learning and R • Machine Learning Algorithms • Applying Machine Learning to Genomics Data + Problems
  29. 29. Cells, Tissues, & Diseases Functional Annotations mage from encodeproject.org and xorlogics.com. 15 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Multidimensional Data Sets
  30. 30. Cells, Tissues, & Diseases Functional Annotations mage from encodeproject.org and xorlogics.com. 15 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Multidimensional Data Sets • We have lots of data and complex problems • We want to make data-driven predictions and need to automate model building
  31. 31. Cells, Tissues, & Diseases Functional Annotations mage from encodeproject.org and xorlogics.com. 16 Multidimensional Data Sets Complex problems + Big Data —> Machine Learning!
  32. 32. Cells, Tissues, & Diseases Functional Annotations mage from encodeproject.org and xorlogics.com. 16 Multidimensional Data Sets Complex problems + Big Data —> Machine Learning! • Allows us to better utilize these increasingly large data sets to capture their inherent structure • Learning algorithms by training with data
  33. 33. • data analysis method that automates analytical model building • make data driven predictions or discover patterns without explicit human intervention • Useful when have complex problems and lots of data (‘big data’) Machine Learning 17
  34. 34. • data analysis method that automates analytical model building • make data driven predictions or discover patterns without explicit human intervention • Useful when have complex problems and lots of data (‘big data’) Machine Learning 17 Computer Data Program Output Traditional Programming Computer [2,3] + 5
  35. 35. • data analysis method that automates analytical model building • make data driven predictions or discover patterns without explicit human intervention • Useful when have complex problems and lots of data (‘big data’) Machine Learning 17 Computer Data Program Output Traditional Programming Computer [2,3] + 5 Computer Data Output Program Machine Learning Computer [2,3] 5 +
  36. 36. • data analysis method that automates analytical model building • make data driven predictions or discover patterns without explicit human intervention • Useful when have complex problems and lots of data (‘big data’) Machine Learning 17 Computer Data Program Output Traditional Programming Computer [2,3] + 5 Computer Data Output Program Machine Learning Computer [2,3] 5 + • Our goal isn’t to make perfect guesses, but to make useful guesses—we want to build a model that is useful for the future
  37. 37. 18 Supervised Learning: -Prediction Ex. linear & logistic regression Unsupervised Learning: -Find patterns Ex. Clustering, Principle Component Analysis
  38. 38. 18 Supervised Learning: -Prediction Ex. linear & logistic regression Unsupervised Learning: -Find patterns Ex. Clustering, Principle Component Analysis Known Data + Known Response YES NO
  39. 39. 18 Supervised Learning: -Prediction Ex. linear & logistic regression Unsupervised Learning: -Find patterns Ex. Clustering, Principle Component Analysis Known Data + Known Response YES NO MODEL
  40. 40. 18 Supervised Learning: -Prediction Ex. linear & logistic regression Unsupervised Learning: -Find patterns Ex. Clustering, Principle Component Analysis Known Data + Known Response YES NO MODEL NEW DATA Predict Response
  41. 41. 18 Supervised Learning: -Prediction Ex. linear & logistic regression Unsupervised Learning: -Find patterns Ex. Clustering, Principle Component Analysis Known Data + Known Response YES NO MODEL NEW DATA Predict Response Uncategorized Data
  42. 42. 18 Supervised Learning: -Prediction Ex. linear & logistic regression Unsupervised Learning: -Find patterns Ex. Clustering, Principle Component Analysis Known Data + Known Response YES NO MODEL NEW DATA Predict Response Clusters of Categorized Data Uncategorized Data
  43. 43. Real-World Machine Learning Applications 19 Recommendation Engine Mail Sorting Self-Driving Car
  44. 44. The Rise of Machine Learning • Hardware Advances • Extreme performance hardware (ex. application-specific integrated circuits) • Smaller, cheaper hardware (Moore’s law) • Cloud computing (ex. AWS) • Software Advances • New machine learning algorithms including deep learning and reinforcement learning • Data Advances • High-performance, less expensive sensors & data generation • ex. wearables, next-gen sequencing, social media 20
  45. 45. The Rise of Machine Learning • Hardware Advances • Extreme performance hardware (ex. application-specific integrated circuits) • Smaller, cheaper hardware (Moore’s law) • Cloud computing (ex. AWS) • Software Advances • New machine learning algorithms including deep learning and reinforcement learning • Data Advances • High-performance, less expensive sensors & data generation • ex. wearables, next-gen sequencing, social media 20
  46. 46. The Rise of Machine Learning • Hardware Advances • Extreme performance hardware (ex. application-specific integrated circuits) • Smaller, cheaper hardware (Moore’s law) • Cloud computing (ex. AWS) • Software Advances • New machine learning algorithms including deep learning and reinforcement learning • Data Advances • High-performance, less expensive sensors & data generation • ex. wearables, next-gen sequencing, social media 20 We often use R, but Python is also a great choice! • R tends to be favored by statisticians and academics (for research) • Python tends to be favored by engineers (with production workflows)
  47. 47. • Open source implementation of S which was originally developed at Bell Lab • Free programming language and software environment for advanced statistical computing and graphics • Functional programming language written primarily in C, Fortran • Good at data manipulation, modeling and computing, data visualization • Cross-platform compatible • Vast community (e.g., CRAN, R-bloggers, Bioconductor) • Over 10,000 packages including parallel/high-performance compute packages • Used extensively by statisticians and academics • Popularity is substantially increasing in recent years • Drawbacks: can be steep learning curve (better recently), limited GUI (RStudio!), documentation can be sparse, memory allocation can be an issue The R Programming Language 21
  48. 48. • ‘Genomical’ and Biology Big Data • Introduction to Machine Learning and R • Machine Learning Algorithms • Applying Machine Learning to Genomics Data + Problems
  49. 49. 23 Open RStudio
  50. 50. 24 Under File->New File->select R Script
  51. 51. 25 We will be working in the R script panel (top left)
  52. 52. 25 We will be working in the R script panel (top left)
  53. 53. Fisher’s/Anderson's iris data set: measurements (cm) of the sepal length and width, petal length and width, and species (Iris setosa, versicolor, and virginica) (5 features or variables) for 150 flowers (observations) Iris Dataset in R 26
  54. 54. 27 Data inspection: summary function, tab out, and ‘Help’ pages
  55. 55. 28 Data inspection: Built-in iris dataset (check out mtcars too!)
  56. 56. 29 Data inspection: execute a line of code with ctl+return or hit ‘Run’ button
  57. 57. 30 Data inspection: Can also inspect data with the str (structure) function
  58. 58. 31 Data inspection: Can also inspect data with the str (structure) function
  59. 59. 32 Data inspection: And examine the first 5 rows [x,] and first 5 columns [,y]
  60. 60. 33 Data inspection: the plot function
  61. 61. 34 Data inspection: $ notation for calling columns by name
  62. 62. 35 Data inspection: the plot function
  63. 63. 36 Data inspection: the cor.test function
  64. 64. Iris Dataset: Summarize/Descriptive Statistics (Observational) 37
  65. 65. Iris Dataset: Summarize/Descriptive Statistics (Observational) 37 Computer Data Program Output Traditional Programming
  66. 66. Iris Dataset: Summarize/Descriptive Statistics (Observational) 37 Computer Data Program Output Traditional Programming Computer Sepal.Lenth mean(x) 5.843
  67. 67. 38 Data modeling:
  68. 68. 39 Data modeling: the lm function
  69. 69. 40 Data modeling: the lm function
  70. 70. 40 Data modeling: the lm function y ~ mx + b
  71. 71. 41 Data modeling: the lm function y ~ mx + b Petal.Width ~ 0.4158*Petal.Length - 0.3631
  72. 72. 42 Data modeling: the abline function to add regression line to our plot
  73. 73. 43 Data modeling: the abline function to add regression line to our plot
  74. 74. Iris Dataset: Linear Regression is Machine Learning! • Purple line is a linear regression line fit to the data describing petal length as a function of petal width • We can now PREDICT petal width given petal length Petal.Width ~ 0.4158*Petal.Length - 0.3631 (y=mx+b) 44
  75. 75. Iris Dataset: Linear Regression is Machine Learning! • Purple line is a linear regression line fit to the data describing petal length as a function of petal width • We can now PREDICT petal width given petal length Petal.Width ~ 0.4158*Petal.Length - 0.3631 (y=mx+b) Computer Data Output Program Machine Learning 44
  76. 76. Iris Dataset: Linear Regression is Machine Learning! • Purple line is a linear regression line fit to the data describing petal length as a function of petal width • We can now PREDICT petal width given petal length Petal.Width ~ 0.4158*Petal.Length - 0.3631 (y=mx+b) Computer Data Output Program Machine Learning Computer Petal.Length Petal.Width Petal.Width ~ 0.4158*Petal.Length - 0.3631 44
  77. 77. 45 What do you notice about the plot?
  78. 78. 46 Data wrangling: Examining Iris species variable
  79. 79. 47 Data wrangling: Examining Iris species variable
  80. 80. 48 Data wrangling: Coding species labels as categories to color the points by
  81. 81. 49 Data wrangling: Examine species variable
  82. 82. 50 Data wrangling: the palette function describes a vector of default colors for plotting in R
  83. 83. 51 Data inspection: plotting with data points colored by species (setosa, versicolor, virginica)
  84. 84. Train an algorithm to classify Iris flowers by species 52 Fisher’s Iris Data n=150 Training Set n=105 Test Set n=45 70% 30%
  85. 85. 53 Defining training and test sets: use nrow function to code the total number of observations in the Iris dataset
  86. 86. 54 Defining training and test sets: use sample function to assign observations to the training set Note: I did not set a seed for this tutorial so you may get slightly different results. For more about setting seeds, see here: https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function
  87. 87. 55 Defining training and test sets: use sample function to assign observations to the training set, x is 1:n
  88. 88. 56 Defining training and test sets: use sample function to assign observations to the training set, size is round(0.7*n) -> 0.7*150 = 105
  89. 89. 57 Defining training and test sets: assign the 105 selected observations to the training set
  90. 90. 58 Defining training and test sets: assign the non-105 selected observations to the test set (the remaining 45 observations)
  91. 91. 59 Fisher’s Iris Data n=150 Training Set: “iristrain” n=105 Test Set: “iristest” n=45 70% 30% Train an algorithm to classify Iris flowers by species
  92. 92. Iris Data: Adding Regularization •Model building with a large # of features/variables for a moderate number of observations can result in ‘overfitting’ —the model is too specific to the training set and not generalizable enough for accurate predictions with new data 60
  93. 93. Iris Data: Adding Regularization •Model building with a large # of features/variables for a moderate number of observations can result in ‘overfitting’ —the model is too specific to the training set and not generalizable enough for accurate predictions with new data •Regularization is a technique for preventing this by introducing tuning parameters that penalize the coefficients of features/variables that are linearly dependent (redundant) •This results in FEATURE SELECTION 60
  94. 94. Iris Data: Adding Regularization •Model building with a large # of features/variables for a moderate number of observations can result in ‘overfitting’ —the model is too specific to the training set and not generalizable enough for accurate predictions with new data •Regularization is a technique for preventing this by introducing tuning parameters that penalize the coefficients of features/variables that are linearly dependent (redundant) •This results in FEATURE SELECTION •Example methods of regression with regularization: ridge, elastic net, LASSO 60
  95. 95. LASSO:
 Least Absolute Shrinkage and Selection Operator: • Linear regression: predictive analysis fitting a single line through data to describe relationships between one dependent variable and one or more independent variables 61 Credit Card Balance Credit Limit Images: Adapted from Tibshirani, et al.
  96. 96. LASSO:
 Least Absolute Shrinkage and Selection Operator: • Linear regression: predictive analysis fitting a single line through data to describe relationships between one dependent variable and one or more independent variables • LASSO regression: perform variable selection by including a penalty that forces some coefficient estimates to be exactly zero based on a turning parameter (λ), yielding a sparse model 61 Credit Card Balance Credit Limit Images: Adapted from Tibshirani, et al.
  97. 97. LASSO:
 Least Absolute Shrinkage and Selection Operator: • Linear regression: predictive analysis fitting a single line through data to describe relationships between one dependent variable and one or more independent variables • LASSO regression: perform variable selection by including a penalty that forces some coefficient estimates to be exactly zero based on a turning parameter (λ), yielding a sparse model 61 Credit Card Balance Credit Limit Images: Adapted from Tibshirani, et al.
  98. 98. LASSO:
 Least Absolute Shrinkage and Selection Operator: • Linear regression: predictive analysis fitting a single line through data to describe relationships between one dependent variable and one or more independent variables • LASSO regression: perform variable selection by including a penalty that forces some coefficient estimates to be exactly zero based on a turning parameter (λ), yielding a sparse model 61 Credit Card Balance Credit Limit Images: Adapted from Tibshirani, et al.
  99. 99. LASSO Tuning Parameter Selection • Select tuning parameter by cross- validation: – Partition data multiple times – Compute cross- validation error rate for each tuning parameter – Select tuning parameter value with smallest error 62 Example: 5-Fold Cross-Validation Image: goldenhelix.com
  100. 100. 63 Building a model for Iris species prediction: the glmnet package
  101. 101. 64 Building a model for Iris species prediction: the cv.glmnet function Note: I did not set a seed for this tutorial so you may get slightly different results. For more about setting seeds, see here: https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function
  102. 102. 65 Building a model for Iris species prediction: the cv.glmnet function on the training set as.matrix(iristrain[,-5] iristrain[,5]
  103. 103. 66 Building a model for Iris species prediction: use predict function to evaluate model in test set
  104. 104. 67 Building a model for Iris species prediction: use table function to view predicted species vs. actual species
  105. 105. 68 Building a model for Iris species prediction: view resulting predict object
  106. 106. 69 Building a model for Iris species prediction: examine cv.glmnet object
  107. 107. 71 Building a model for Iris species prediction: plot cv.glmnet object
  108. 108. 71 Building a model for Iris species prediction: plot cv.glmnet object # of predictors in the model Error Tuning Parameter Penalty
  109. 109. 71 Building a model for Iris species prediction: plot cv.glmnet object # of predictors in the model Error Tuning Parameter Penalty λmin Lambda with minimum cross- validated error λ1SE Largest lambda where error w/in 1 standard error of minimum
  110. 110. 72 Building a model for Iris species prediction: comparing coefficients at lambda.min and lambda.1se
  111. 111. Iris Data: Adding Regularization (LASSO) • Model building with a large # of features for a moderate number of samples can result in ‘overfitting’ — the model is too specific to the training set and not generalizable enough for accurate predictions with new data • Regularization is a technique for preventing this by introducing tuning parameters that penalize the coefficients of variables that are linearly dependent (redundant)- >Feature Selection 73
  112. 112. Iris Data: Adding Regularization (LASSO) • Model building with a large # of features for a moderate number of samples can result in ‘overfitting’ — the model is too specific to the training set and not generalizable enough for accurate predictions with new data • Regularization is a technique for preventing this by introducing tuning parameters that penalize the coefficients of variables that are linearly dependent (redundant)- >Feature Selection 73 Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
  113. 113. Iris Data: Adding Regularization (LASSO) • Model building with a large # of features for a moderate number of samples can result in ‘overfitting’ — the model is too specific to the training set and not generalizable enough for accurate predictions with new data • Regularization is a technique for preventing this by introducing tuning parameters that penalize the coefficients of variables that are linearly dependent (redundant)- >Feature Selection 73 Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b 0 0 Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
  114. 114. Iris Data: Adding Regularization (LASSO) • Model building with a large # of features for a moderate number of samples can result in ‘overfitting’ — the model is too specific to the training set and not generalizable enough for accurate predictions with new data • Regularization is a technique for preventing this by introducing tuning parameters that penalize the coefficients of variables that are linearly dependent (redundant)- >Feature Selection 73 Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b 0 0 Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b Species(setosa) ~ A*Petal.Length + B*Sepal.Width +B
  115. 115. Iris Data: Adding Regularization (LASSO) • Model building with a large # of features for a moderate number of samples can result in ‘overfitting’ — the model is too specific to the training set and not generalizable enough for accurate predictions with new data • Regularization is a technique for preventing this by introducing tuning parameters that penalize the coefficients of variables that are linearly dependent (redundant)- >Feature Selection 73 Computer Petal.Length Sepal.Width Sepal.Length Petal.Width Species Species(setosa)~ 1.58*Sepal.Width + -2.36*Petal.Length + 5.96 Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b 0 0 Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b Species(setosa) ~ A*Petal.Length + B*Sepal.Width +B
  116. 116. Iris Data: Decision Trees • Decision trees can take different data types (categorical, binary, numeric) as input/output variables, handle missing data and outliers well, and are intuitive • Decision tree limitations include that each decision boundary at each split is a concrete binary decision and the decision criteria only consider one input feature at a time (not a combination of multiple input features) • Examples: Video games, clinical decision models 74
  117. 117. Iris Data: Decision Trees • Decision trees can take different data types (categorical, binary, numeric) as input/output variables, handle missing data and outliers well, and are intuitive • Decision tree limitations include that each decision boundary at each split is a concrete binary decision and the decision criteria only consider one input feature at a time (not a combination of multiple input features) • Examples: Video games, clinical decision models 74
  118. 118. Iris Data: Decision Trees • Decision trees can take different data types (categorical, binary, numeric) as input/output variables, handle missing data and outliers well, and are intuitive • Decision tree limitations include that each decision boundary at each split is a concrete binary decision and the decision criteria only consider one input feature at a time (not a combination of multiple input features) • Examples: Video games, clinical decision models 74 Petal.Length < 2.35 cm Setosa (40/0/0)
  119. 119. Iris Data: Decision Trees • Decision trees can take different data types (categorical, binary, numeric) as input/output variables, handle missing data and outliers well, and are intuitive • Decision tree limitations include that each decision boundary at each split is a concrete binary decision and the decision criteria only consider one input feature at a time (not a combination of multiple input features) • Examples: Video games, clinical decision models 74 Petal.Length < 2.35 cm Setosa (40/0/0) Petal.Width < 1.65 cm Versicolor (0/40/12) Virginica (0/0/28)
  120. 120. Deep Learning (i.e. neural nets) • Subfield of machine learning describing ‘human-like AI’ • Algorithms are structured in layers to create artificial neural networks to learn and make decsions without human intervention • These networks represent the world as a nested hierarchy of concepts with each defined in relation to simipler concepts 75 X1 X2 Output (Summation of Input and Activation with Sigmoid Fxn) ‘Neuron’
  121. 121. Deep Learning (i.e. neural nets) • Subfield of machine learning describing ‘human-like AI’ • Algorithms are structured in layers to create artificial neural networks to learn and make decsions without human intervention • These networks represent the world as a nested hierarchy of concepts with each defined in relation to simipler concepts • Deep learning algorithms (compared to other machine learning): • need a lot more data to perform well • need more/better hardware • typically identify and extract features without human intervention • usually solves problems end-to end instead of in parts • takes a lot longer to train • typically less interpretabile 75 X1 X2 Output (Summation of Input and Activation with Sigmoid Fxn) ‘Neuron’
  122. 122. Deep Learning (i.e. neural nets) • Subfield of machine learning describing ‘human-like AI’ • Algorithms are structured in layers to create artificial neural networks to learn and make decsions without human intervention • These networks represent the world as a nested hierarchy of concepts with each defined in relation to simipler concepts • Deep learning algorithms (compared to other machine learning): • need a lot more data to perform well • need more/better hardware • typically identify and extract features without human intervention • usually solves problems end-to end instead of in parts • takes a lot longer to train • typically less interpretabile • Ex: Deep learning to automate resume scoring • Scoring performance may be excellent (i.e. near human performance) • Does not reveal why a particular applicant was given a score • Mathematically you can find out which nodes of the network were activated, but we don’t know what those neurons were supposed to model or what the layers of neurons were doing collectively • Interpretation is difficult 75 X1 X2 Output (Summation of Input and Activation with Sigmoid Fxn) ‘Neuron’
  123. 123. Deep Learning (i.e. neural nets) • Subfield of machine learning describing ‘human-like AI’ • Algorithms are structured in layers to create artificial neural networks to learn and make decsions without human intervention • These networks represent the world as a nested hierarchy of concepts with each defined in relation to simipler concepts • Deep learning algorithms (compared to other machine learning): • need a lot more data to perform well • need more/better hardware • typically identify and extract features without human intervention • usually solves problems end-to end instead of in parts • takes a lot longer to train • typically less interpretabile • Ex: Deep learning to automate resume scoring • Scoring performance may be excellent (i.e. near human performance) • Does not reveal why a particular applicant was given a score • Mathematically you can find out which nodes of the network were activated, but we don’t know what those neurons were supposed to model or what the layers of neurons were doing collectively • Interpretation is difficult 75 X1 X2 Output (Summation of Input and Activation with Sigmoid Fxn) ‘Neuron’
  124. 124. Other Machine Learning Methods • Neural Nets • Ensemble Methods (e.g. bagging, boosting) • Naive Bayes (based on prior probabilities) • Hidden Markov Models (Bayesian network with hidden states) • K Nearest Neighbors (instance-based learning—clustering!) • Support Vector Machines (discriminator defined by a separating hyperplane) • Additional Ensemble Method Approaches (combining multiple models) • And new methods coming out all the time… 76
  125. 125. Other Machine Learning Methods • Neural Nets • Ensemble Methods (e.g. bagging, boosting) • Naive Bayes (based on prior probabilities) • Hidden Markov Models (Bayesian network with hidden states) • K Nearest Neighbors (instance-based learning—clustering!) • Support Vector Machines (discriminator defined by a separating hyperplane) • Additional Ensemble Method Approaches (combining multiple models) • And new methods coming out all the time… Raw Data Clean/Normalize Data Training Set Test Set Build Model Test Apply to New Data (Validation Cohort or Model Application) Tune Model 76
  126. 126. Other Machine Learning Methods • Neural Nets • Ensemble Methods (e.g. bagging, boosting) • Naive Bayes (based on prior probabilities) • Hidden Markov Models (Bayesian network with hidden states) • K Nearest Neighbors (instance-based learning—clustering!) • Support Vector Machines (discriminator defined by a separating hyperplane) • Additional Ensemble Method Approaches (combining multiple models) • And new methods coming out all the time… Raw Data Clean/Normalize Data Training Set Test Set Build Model Test Apply to New Data (Validation Cohort or Model Application) Tune Model 76 Algorithm Selection is an Important Step!
  127. 127. • ‘Genomical’ and Biology Big Data • Introduction to Machine Learning and R • Machine Learning Algorithms • Applying Machine Learning to Genomics Data + Problems
  128. 128. 78 Improve disease prevention, diagnosis, prognosis, and treatment efficacy
  129. 129. • Which patients are high risk for developing cancer? • What are early biomarkers of cancer? • Which patients are likely to be short/long term cancer survivers? • What chemotherapeutic might a cancer patient benefit from? 78 Improve disease prevention, diagnosis, prognosis, and treatment efficacy
  130. 130. • Which patients are high risk for developing cancer? • What are early biomarkers of cancer? • Which patients are likely to be short/long term cancer survivers? • What chemotherapeutic might a cancer patient benefit from? 78 Improve disease prevention, diagnosis, prognosis, and treatment efficacy Complex problems + Big Data —> Machine Learning
  131. 131. 79 Integrating genomic data with machine learning to improve predictive modeling Cross-Cancer Patient Outcome Prediction Model
  132. 132. Scaled -log10 Cox p-value -1 2 30 1 ‘Common Survival Genes’ across 19 cancers • ‘Common Survival Genes’ Cox regression uncorrected p-value <0.05 for a gene in at least 9/19 cancers: • 84 genes, enriched for proliferation-related processes including mitosis, cell and nuclear division, and spindle formation • Clustering by Cox regression p- values: 7 ‘Proliferative Informative Cancers’ and 12 ‘Non-Proliferative Informative Cancers’ 80 ESCA STAD OV LUSC GBM LAML LIHC SARC BLCA CESC HNSC BRCA ACC MESO KIRP LUAD PAAD LGG KIRC TopCrossCancerSurvivalGenes * C Ramaker & Lasseigne, et al. 2017.
  133. 133. Scaled -log10 Cox p-value -1 2 30 1 ‘Common Survival Genes’ across 19 cancers Proliferative Informative Cancers (PICs) 81 ESCA STAD OV LUSC GBM LAML LIHC SARC BLCA CESC HNSC BRCA ACC MESO KIRP LUAD PAAD LGG KIRC TopCrossCancerSurvivalGenes * C • ‘Common Survival Genes’ Cox regression uncorrected p-value <0.05 for a gene in at least 9/19 cancers: • 84 genes, enriched for proliferation-related processes including mitosis, cell and nuclear division, and spindle formation • Clustering by Cox regression p- values: 7 ‘Proliferative Informative Cancers’ and 12 ‘Non-Proliferative Informative Cancers’ Ramaker & Lasseigne, et al. 2017.
  134. 134. Scaled -log10 Cox p-value -1 2 30 1 ‘Common Survival Genes’ across 19 cancers Proliferative Informative Cancers (PICs) 82 ESCA STAD OV LUSC GBM LAML LIHC SARC BLCA CESC HNSC BRCA ACC MESO KIRP LUAD PAAD LGG KIRC TopCrossCancerSurvivalGenes * C Non-Proliferative Informative Cancers (Non-PICs) • ‘Common Survival Genes’ Cox regression uncorrected p-value <0.05 for a gene in at least 9/19 cancers: • 84 genes, enriched for proliferation-related processes including mitosis, cell and nuclear division, and spindle formation • Clustering by Cox regression p- values: 7 ‘Proliferative Informative Cancers’ and 12 ‘Non-Proliferative Informative Cancers’ Ramaker & Lasseigne, et al. 2017.
  135. 135. 83 Cross-Cancer Patient Outcome Model Ramaker & Lasseigne, et al. 2017.
  136. 136. 83 Cross-Cancer Patient Outcome Model Cox regression with LASSO feature selection ~20,000 gene expression values Cancer Patient Survival Survival~ -0.104 + 0.086*ADAM12 + 0.037*CKS1 - 0.088*CRYL1 + 0.056*DNA2 + 0.013*DONSON + 0.098*HJURP - 0.022*NDRG2 + 0.031*RAD54B + 0.040*SHOX2 - 0.155*SUOX Ramaker & Lasseigne, et al. 2017.
  137. 137. 83 Cross-Cancer Patient Outcome Model Cox regression with LASSO feature selection ~20,000 gene expression values Cancer Patient Survival Survival~ -0.104 + 0.086*ADAM12 + 0.037*CKS1 - 0.088*CRYL1 + 0.056*DNA2 + 0.013*DONSON + 0.098*HJURP - 0.022*NDRG2 + 0.031*RAD54B + 0.040*SHOX2 - 0.155*SUOX Ramaker & Lasseigne, et al. 2017.
  138. 138. Take-Home Message • Genomics generates big data to address complex biological problems, e.g., improving human disease prevention, diagnosis, prognosis, and treatment efficacy • Machine learning is a data analysis method that automate analytical model building to make data driven predictions or discover patterns without explicit human intervention • Machine learning is a subfield of computer science—>the algorithms are implemented in code • Machine learning is useful when we have complex problems with lots of ‘big’ data 84
  139. 139. Take-Home Message • Genomics generates big data to address complex biological problems, e.g., improving human disease prevention, diagnosis, prognosis, and treatment efficacy • Machine learning is a data analysis method that automate analytical model building to make data driven predictions or discover patterns without explicit human intervention • Machine learning is a subfield of computer science—>the algorithms are implemented in code • Machine learning is useful when we have complex problems with lots of ‘big’ data 84 Computer Data Program Output Traditional Programming Computer [2,3] + 5
  140. 140. Take-Home Message • Genomics generates big data to address complex biological problems, e.g., improving human disease prevention, diagnosis, prognosis, and treatment efficacy • Machine learning is a data analysis method that automate analytical model building to make data driven predictions or discover patterns without explicit human intervention • Machine learning is a subfield of computer science—>the algorithms are implemented in code • Machine learning is useful when we have complex problems with lots of ‘big’ data 84 Computer Data Program Output Traditional Programming Computer [2,3] + 5 Computer Data Output Program Machine Learning Computer [2,3] 5 +
  141. 141. HudsonAlpha: hudsonalpha.org R Programming Language and/or Machine Learning (mostly free): Software Carpentry (software-carpentry.org) and Data Carpentry (datacarpentry.org) coursera.org and datacamp.com Stanford Online’s ‘Statistical Learning’ class Books: Rosalind Franklin: The Dark Lady of DNA by Brenda Maddox (Female scientist biography) The Emperor of All Maladies by Siddhartha Mukherjee (History of cancer) The Gene by Siddhartha Mukherjee (History of genetics) Genome by Matt Ridley (Human Genome) Headstrong: 52 Women Who Changed Science-and the World by Rachel Swaby
  142. 142. 86 Thanks! Brittany N. Lasseigne, PhD @bnlasse blasseigne@hudsonalpha.org
  143. 143. Iris Data: Ensemble Methods Example: tree bagging and boosting • Instead of picking a single model, ensemble methods combine multiple models to fit the training data (‘bagging’ and ‘boosting’) • Random Forest is a Decision Tree Ensemble Method Image: Machado, et al. Veterinary Research, 2015. 87
  144. 144. Iris Data: Neural Nets • Neural Networks (NNs) emulate how the human brain works with a network of interconnected neurons (essentially logistic regression units) organized in multiple layers, allowing more complex, abstract, and subtle decisions • Lots of tuning parameters (# of hidden layers, # of neurons in each layer, and multiple ways to tune learning) • Learning is an iterative feedback mechanism where training data error is used to adjust the corresponding input weights which is propagated back to previous layers (i.e., back-propagation) • NNs are good at learning non-linear functions and can handle multiple outputs, but have a long training time and models are susceptible to local minimum traps (can be mitigated by doing multiple rounds—takes more time!) X1 X2 Output (Summation of Input and Activation with Sigmoid Fxn) ‘Neuron’ 88

×