SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
This workshop is a hands-on introduction to machine learning with R and was presented on December 8, 2017 at the University of South Carolina for the 2017 Computational Biology Symposium held by the International Society for Computational Biology Regional Student Group-Southeast USA.
Postdoctoral Fellow at HudsonAlpha Institute for Biotechnology
This workshop is a hands-on introduction to machine learning with R and was presented on December 8, 2017 at the University of South Carolina for the 2017 Computational Biology Symposium held by the International Society for Computational Biology Regional Student Group-Southeast USA.
1.
Introduction to Machine Learning
Brittany N. Lasseigne, PhD
Senior Scientist
HudsonAlpha Intstitute for Biotechnology
8 December 2017
@bnlasse blasseigne@hudsonalpha.org
2.
• ‘Genomical’ and Biology Big Data
• Introduction to Machine
Learning and R
• Machine Learning Algorithms
• Applying Machine Learning to
Genomics Data + Problems
3.
• ‘Genomical’ and Biology Big Data
• Introduction to Machine
Learning and R
• Machine Learning Algorithms
• Applying Machine Learning to
Genomics Data + Problems
4.
Biology Big Data
• Molecular and cellular profiling of large numbers of features in large numbers
of samples (‘omics’ data)
• Image processing: cell microscopy, neuroimaging, radiology and histology,
crop imagery, etc.
4
5.
Biology Big Data
• Molecular and cellular profiling of large numbers of features in large numbers
of samples (‘omics’ data)
• Image processing: cell microscopy, neuroimaging, radiology and histology,
crop imagery, etc.
4Esteva, et al. Nature, 2017.
6.
Biology Big Data
• Molecular and cellular profiling of large numbers of features in large numbers
of samples (‘omics’ data)
• Image processing: cell microscopy, neuroimaging, radiology and histology,
crop imagery, etc.
4Esteva, et al. Nature, 2017.
Resources:
• Kan, Machine Learning applications in cell image analysis, Immunology and Cell Biology, 2017
• Angermueller, et al. Deep learning for computational biology, Mol Syst Biol, 2016.
• Ching, et al. Opportunities and Obstacles for Deep Learning in Biology and Medicine, biorxiv, 2017
7.
5American Cancer Society, 2015 & Harvard NeuroDiscovery Center, 2017.
Complex Human Diseases:
usually caused by a combination of genetic, environmental and lifetyle factors
(most of which have not yet been identified)
8.
5American Cancer Society, 2015 & Harvard NeuroDiscovery Center, 2017.
Cancer:
• Men have a 1 in 2 lifetime risk of developing cancer and a 1 in 4 lifetime risk of dying from cancer
• Women have a 1 in 3 lifetime risk of developing cancer and a 1 in 5 lifetime risk of dying from
cancer
Psychiatric Illness:
• 1 in 4 American adults suffere from a diagnosable mental disorder in any given year
• ~6% suffer serious disabilities as a result
Neurodegenerative Disease:
• ~6.5M Americans suffer (AD, PD, MS, ALS, HD), expected to rise to 12M by 2030
Complex Human Diseases:
usually caused by a combination of genetic, environmental and lifetyle factors
(most of which have not yet been identified)
9.
5American Cancer Society, 2015 & Harvard NeuroDiscovery Center, 2017.
Cancer:
• Men have a 1 in 2 lifetime risk of developing cancer and a 1 in 4 lifetime risk of dying from cancer
• Women have a 1 in 3 lifetime risk of developing cancer and a 1 in 5 lifetime risk of dying from
cancer
Psychiatric Illness:
• 1 in 4 American adults suffere from a diagnosable mental disorder in any given year
• ~6% suffer serious disabilities as a result
Neurodegenerative Disease:
• ~6.5M Americans suffer (AD, PD, MS, ALS, HD), expected to rise to 12M by 2030
Complex Human Diseases:
usually caused by a combination of genetic, environmental and lifetyle factors
(most of which have not yet been identified)
10.
• Which patients are high risk for developing cancer?
• What are early biomarkers of cancer?
• Which patients are likely to be short/long term cancer survivers?
• What chemotherapeutic might a cancer patient benefit from?
6
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
11.
• Which patients are high risk for developing cancer?
• What are early biomarkers of cancer?
• Which patients are likely to be short/long term cancer survivers?
• What chemotherapeutic might a cancer patient benefit from?
6
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Complex problems
12.
Genomics
• Understanding the function of the
genome (total genetic material) and
how it relates to human disease
(studying all of the genes at once!)
7
13.
Genomics
• Understanding the function of the
genome (total genetic material) and
how it relates to human disease
(studying all of the genes at once!)
• The sequencing of the human
genome paved the way for genomic
studies
7
14.
Genomics
• Understanding the function of the
genome (total genetic material) and
how it relates to human disease
(studying all of the genes at once!)
• The sequencing of the human
genome paved the way for genomic
studies
• Our goal it identify genetic/genomic
variation associated with disease to
improve patient care
7
17.
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
18.
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
19.
Cells, Tissues, & Diseases
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
20.
Cells, Tissues, & Diseases Functional Annotations
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
21.
Cells, Tissues, & Diseases Functional Annotations
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
Big Data
22.
Genomics Data is Big Data
11Stephens, et al. PLOS Biology, 2015.
1 zettabyte (ZB) = 1024 EB
1 exabyte (EB) = 1024 PB
1 petabyte (PB) = 1024 TB
1 terabyte (TB) = 1024 GB
23.
12
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 files with 2000+ metadata attributes
• >2.5 Petabytes of data
24.
12
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 files with 2000+ metadata attributes
• >2.5 Petabytes of data
1 Petabyte of Data =
20M four-drawer filing cabinets filled with text
or
13.3 years of HD-TV video
or
~7 billion Facebook photos
or
1 PB of MP3 songs requires ~2,000 years to play
25.
Astronomical ‘Genomical’ Data:
the ‘four-headed beast’ of the data life-cycle (2025 Projections)
13Stephens, et al. PLOS Biology, 2015 and nanalyze.com.
26.
Astronomical ‘Genomical’ Data:
the ‘four-headed beast’ of the data life-cycle (2025 Projections)
13Stephens, et al. PLOS Biology, 2015 and nanalyze.com.
1 zettabyte (ZB) = 1024 EB
1 exabyte (EB) = 1024 PB
1 petabyte (PB) = 1024 TB
1 terabyte (TB) = 1024 GB
27.
Astronomical ‘Genomical’ Data:
the ‘four-headed beast’ of the data life-cycle (2025 Projections)
13Stephens, et al. PLOS Biology, 2015 and nanalyze.com.
1 zettabyte (ZB) = 1024 EB
1 exabyte (EB) = 1024 PB
1 petabyte (PB) = 1024 TB
1 terabyte (TB) = 1024 GB
28.
• ‘Genomical’ and Biology Big Data
• Introduction to Machine
Learning and R
• Machine Learning Algorithms
• Applying Machine Learning to
Genomics Data + Problems
29.
Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com. 15
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
30.
Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com. 15
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
• We have lots of data and complex problems
• We want to make data-driven predictions
and need to automate model building
31.
Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com.
16
Multidimensional Data Sets
Complex problems + Big Data —> Machine Learning!
32.
Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com.
16
Multidimensional Data Sets
Complex problems + Big Data —> Machine Learning!
• Allows us to better utilize these increasingly large
data sets to capture their inherent structure
• Learning algorithms by training with data
33.
• data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
34.
• data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
35.
• data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+
36.
• data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+
• Our goal isn’t to make perfect guesses, but to make useful guesses—we want to
build a model that is useful for the future
38.
18
Supervised Learning:
-Prediction
Ex. linear & logistic regression
Unsupervised Learning:
-Find patterns
Ex. Clustering, Principle Component Analysis
Known Data + Known Response
YES
NO
39.
18
Supervised Learning:
-Prediction
Ex. linear & logistic regression
Unsupervised Learning:
-Find patterns
Ex. Clustering, Principle Component Analysis
Known Data + Known Response
YES
NO
MODEL
40.
18
Supervised Learning:
-Prediction
Ex. linear & logistic regression
Unsupervised Learning:
-Find patterns
Ex. Clustering, Principle Component Analysis
Known Data + Known Response
YES
NO
MODEL
NEW DATA
Predict Response
41.
18
Supervised Learning:
-Prediction
Ex. linear & logistic regression
Unsupervised Learning:
-Find patterns
Ex. Clustering, Principle Component Analysis
Known Data + Known Response
YES
NO
MODEL
NEW DATA
Predict Response
Uncategorized Data
42.
18
Supervised Learning:
-Prediction
Ex. linear & logistic regression
Unsupervised Learning:
-Find patterns
Ex. Clustering, Principle Component Analysis
Known Data + Known Response
YES
NO
MODEL
NEW DATA
Predict Response
Clusters of Categorized Data
Uncategorized Data
43.
Real-World Machine Learning Applications
19
Recommendation Engine
Mail Sorting
Self-Driving Car
44.
The Rise of Machine Learning
• Hardware Advances
• Extreme performance
hardware (ex.
application-specific
integrated circuits)
• Smaller, cheaper
hardware (Moore’s law)
• Cloud computing (ex.
AWS)
• Software Advances
• New machine learning
algorithms including
deep learning and
reinforcement learning
• Data Advances
• High-performance, less
expensive sensors & data
generation
• ex. wearables, next-gen
sequencing, social media
20
45.
The Rise of Machine Learning
• Hardware Advances
• Extreme performance
hardware (ex.
application-specific
integrated circuits)
• Smaller, cheaper
hardware (Moore’s law)
• Cloud computing (ex.
AWS)
• Software Advances
• New machine learning
algorithms including
deep learning and
reinforcement learning
• Data Advances
• High-performance, less
expensive sensors & data
generation
• ex. wearables, next-gen
sequencing, social media
20
46.
The Rise of Machine Learning
• Hardware Advances
• Extreme performance
hardware (ex.
application-specific
integrated circuits)
• Smaller, cheaper
hardware (Moore’s law)
• Cloud computing (ex.
AWS)
• Software Advances
• New machine learning
algorithms including
deep learning and
reinforcement learning
• Data Advances
• High-performance, less
expensive sensors & data
generation
• ex. wearables, next-gen
sequencing, social media
20
We often use R, but Python is also a great choice!
• R tends to be favored by statisticians and academics
(for research)
• Python tends to be favored by engineers (with
production workflows)
47.
• Open source implementation of S which was originally developed at Bell Lab
• Free programming language and software environment for advanced statistical
computing and graphics
• Functional programming language written primarily in C, Fortran
• Good at data manipulation, modeling and computing, data visualization
• Cross-platform compatible
• Vast community (e.g., CRAN, R-bloggers, Bioconductor)
• Over 10,000 packages including parallel/high-performance compute packages
• Used extensively by statisticians and academics
• Popularity is substantially increasing in recent years
• Drawbacks: can be steep learning curve (better recently), limited GUI (RStudio!),
documentation can be sparse, memory allocation can be an issue
The R Programming Language
21
48.
• ‘Genomical’ and Biology Big Data
• Introduction to Machine
Learning and R
• Machine Learning Algorithms
• Applying Machine Learning to
Genomics Data + Problems
51.
25
We will be working in the R script panel (top left)
52.
25
We will be working in the R script panel (top left)
53.
Fisher’s/Anderson's iris data set:
measurements (cm) of the sepal length and width, petal length and width, and species (Iris setosa, versicolor,
and virginica) (5 features or variables) for 150 flowers (observations)
Iris Dataset in R
26
54.
27
Data inspection:
summary function, tab out, and ‘Help’ pages
55.
28
Data inspection:
Built-in iris dataset (check out mtcars too!)
56.
29
Data inspection:
execute a line of code with ctl+return or hit ‘Run’ button
57.
30
Data inspection:
Can also inspect data with the str (structure) function
58.
31
Data inspection:
Can also inspect data with the str (structure) function
59.
32
Data inspection:
And examine the first 5 rows [x,] and first 5 columns [,y]
71.
41
Data modeling:
the lm function
y ~ mx + b
Petal.Width ~ 0.4158*Petal.Length - 0.3631
72.
42
Data modeling:
the abline function to add regression line to our plot
73.
43
Data modeling:
the abline function to add regression line to our plot
74.
Iris Dataset:
Linear Regression is Machine Learning!
• Purple line is a linear regression line
fit to the data describing petal length
as a function of petal width
• We can now PREDICT petal width
given petal length
Petal.Width ~ 0.4158*Petal.Length - 0.3631
(y=mx+b)
44
75.
Iris Dataset:
Linear Regression is Machine Learning!
• Purple line is a linear regression line
fit to the data describing petal length
as a function of petal width
• We can now PREDICT petal width
given petal length
Petal.Width ~ 0.4158*Petal.Length - 0.3631
(y=mx+b)
Computer
Data
Output
Program
Machine Learning
44
76.
Iris Dataset:
Linear Regression is Machine Learning!
• Purple line is a linear regression line
fit to the data describing petal length
as a function of petal width
• We can now PREDICT petal width
given petal length
Petal.Width ~ 0.4158*Petal.Length - 0.3631
(y=mx+b)
Computer
Data
Output
Program
Machine Learning
Computer
Petal.Length
Petal.Width
Petal.Width ~
0.4158*Petal.Length -
0.3631
44
82.
50
Data wrangling:
the palette function describes a vector of default colors for plotting in R
83.
51
Data inspection:
plotting with data points colored by species (setosa, versicolor, virginica)
84.
Train an algorithm to classify
Iris flowers by species
52
Fisher’s Iris Data
n=150
Training Set
n=105
Test Set
n=45
70% 30%
85.
53
Defining training and test sets:
use nrow function to code the total number of observations in the Iris dataset
86.
54
Defining training and test sets:
use sample function to assign observations to the training set
Note: I did not set a seed for this tutorial so you may get slightly different results. For more about setting seeds,
see here: https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function
87.
55
Defining training and test sets:
use sample function to assign observations to the training set,
x is 1:n
88.
56
Defining training and test sets:
use sample function to assign observations to the training set,
size is round(0.7*n) -> 0.7*150 = 105
89.
57
Defining training and test sets:
assign the 105 selected observations to the training set
90.
58
Defining training and test sets:
assign the non-105 selected observations to the test set (the remaining 45 observations)
91.
59
Fisher’s Iris Data
n=150
Training Set: “iristrain”
n=105
Test Set: “iristest”
n=45
70% 30%
Train an algorithm to classify
Iris flowers by species
92.
Iris Data: Adding Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
60
93.
Iris Data: Adding Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
•Regularization is a technique for preventing this by introducing tuning
parameters that penalize the coefficients of features/variables that are
linearly dependent (redundant)
•This results in FEATURE SELECTION
60
94.
Iris Data: Adding Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
•Regularization is a technique for preventing this by introducing tuning
parameters that penalize the coefficients of features/variables that are
linearly dependent (redundant)
•This results in FEATURE SELECTION
•Example methods of regression with regularization: ridge, elastic net, LASSO
60
95.
LASSO:
Least Absolute Shrinkage and Selection Operator:
• Linear regression:
predictive analysis fitting a single
line through data to describe
relationships between one
dependent variable and one or
more independent variables
61
Credit Card Balance
Credit Limit
Images: Adapted from Tibshirani, et al.
96.
LASSO:
Least Absolute Shrinkage and Selection Operator:
• Linear regression:
predictive analysis fitting a single
line through data to describe
relationships between one
dependent variable and one or
more independent variables
• LASSO regression:
perform variable selection by
including a penalty that forces some
coefficient estimates to be exactly
zero based on a turning parameter
(λ), yielding a sparse model
61
Credit Card Balance
Credit Limit
Images: Adapted from Tibshirani, et al.
97.
LASSO:
Least Absolute Shrinkage and Selection Operator:
• Linear regression:
predictive analysis fitting a single
line through data to describe
relationships between one
dependent variable and one or
more independent variables
• LASSO regression:
perform variable selection by
including a penalty that forces some
coefficient estimates to be exactly
zero based on a turning parameter
(λ), yielding a sparse model
61
Credit Card Balance
Credit Limit
Images: Adapted from Tibshirani, et al.
98.
LASSO:
Least Absolute Shrinkage and Selection Operator:
• Linear regression:
predictive analysis fitting a single
line through data to describe
relationships between one
dependent variable and one or
more independent variables
• LASSO regression:
perform variable selection by
including a penalty that forces some
coefficient estimates to be exactly
zero based on a turning parameter
(λ), yielding a sparse model
61
Credit Card Balance
Credit Limit
Images: Adapted from Tibshirani, et al.
99.
LASSO Tuning Parameter Selection
• Select tuning
parameter by cross-
validation:
– Partition data
multiple times
– Compute cross-
validation error rate
for each tuning
parameter
– Select tuning
parameter value
with smallest error
62
Example: 5-Fold Cross-Validation
Image: goldenhelix.com
100.
63
Building a model for Iris species prediction:
the glmnet package
101.
64
Building a model for Iris species prediction:
the cv.glmnet function
Note: I did not set a seed for this tutorial so you may get slightly different results. For more about setting seeds,
see here: https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function
102.
65
Building a model for Iris species prediction:
the cv.glmnet function on the training set
as.matrix(iristrain[,-5]
iristrain[,5]
103.
66
Building a model for Iris species prediction:
use predict function to evaluate model in test set
104.
67
Building a model for Iris species prediction:
use table function to view predicted species vs. actual species
105.
68
Building a model for Iris species prediction:
view resulting predict object
106.
69
Building a model for Iris species prediction:
examine cv.glmnet object
107.
71
Building a model for Iris species prediction:
plot cv.glmnet object
108.
71
Building a model for Iris species prediction:
plot cv.glmnet object
# of predictors in the model
Error
Tuning Parameter Penalty
109.
71
Building a model for Iris species prediction:
plot cv.glmnet object
# of predictors in the model
Error
Tuning Parameter Penalty
λmin
Lambda with
minimum cross-
validated error
λ1SE
Largest lambda where
error w/in 1 standard
error of minimum
110.
72
Building a model for Iris species prediction:
comparing coefficients at lambda.min and lambda.1se
111.
Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
112.
Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
113.
Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
0 0
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
114.
Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
0 0
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
Species(setosa) ~ A*Petal.Length + B*Sepal.Width +B
115.
Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Computer
Petal.Length
Sepal.Width
Sepal.Length
Petal.Width
Species
Species(setosa)~
1.58*Sepal.Width +
-2.36*Petal.Length
+ 5.96
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
0 0
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
Species(setosa) ~ A*Petal.Length + B*Sepal.Width +B
116.
Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
117.
Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
118.
Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
Petal.Length < 2.35 cm
Setosa (40/0/0)
119.
Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
Petal.Length < 2.35 cm
Setosa (40/0/0)
Petal.Width < 1.65 cm
Versicolor (0/40/12) Virginica (0/0/28)
120.
Deep Learning (i.e. neural nets)
• Subfield of machine learning describing ‘human-like AI’
• Algorithms are structured in layers to create artificial neural
networks to learn and make decsions without human intervention
• These networks represent the world as a nested hierarchy of
concepts with each defined in relation to simipler concepts
75
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
121.
Deep Learning (i.e. neural nets)
• Subfield of machine learning describing ‘human-like AI’
• Algorithms are structured in layers to create artificial neural
networks to learn and make decsions without human intervention
• These networks represent the world as a nested hierarchy of
concepts with each defined in relation to simipler concepts
• Deep learning algorithms (compared to other machine learning):
• need a lot more data to perform well
• need more/better hardware
• typically identify and extract features without human
intervention
• usually solves problems end-to end instead of in parts
• takes a lot longer to train
• typically less interpretabile
75
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
122.
Deep Learning (i.e. neural nets)
• Subfield of machine learning describing ‘human-like AI’
• Algorithms are structured in layers to create artificial neural
networks to learn and make decsions without human intervention
• These networks represent the world as a nested hierarchy of
concepts with each defined in relation to simipler concepts
• Deep learning algorithms (compared to other machine learning):
• need a lot more data to perform well
• need more/better hardware
• typically identify and extract features without human
intervention
• usually solves problems end-to end instead of in parts
• takes a lot longer to train
• typically less interpretabile
• Ex: Deep learning to automate resume scoring
• Scoring performance may be excellent (i.e. near human
performance)
• Does not reveal why a particular applicant was given a score
• Mathematically you can find out which nodes of the network
were activated, but we don’t know what those neurons were
supposed to model or what the layers of neurons were doing
collectively
• Interpretation is difficult
75
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
123.
Deep Learning (i.e. neural nets)
• Subfield of machine learning describing ‘human-like AI’
• Algorithms are structured in layers to create artificial neural
networks to learn and make decsions without human intervention
• These networks represent the world as a nested hierarchy of
concepts with each defined in relation to simipler concepts
• Deep learning algorithms (compared to other machine learning):
• need a lot more data to perform well
• need more/better hardware
• typically identify and extract features without human
intervention
• usually solves problems end-to end instead of in parts
• takes a lot longer to train
• typically less interpretabile
• Ex: Deep learning to automate resume scoring
• Scoring performance may be excellent (i.e. near human
performance)
• Does not reveal why a particular applicant was given a score
• Mathematically you can find out which nodes of the network
were activated, but we don’t know what those neurons were
supposed to model or what the layers of neurons were doing
collectively
• Interpretation is difficult
75
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
124.
Other Machine Learning Methods
• Neural Nets
• Ensemble Methods (e.g. bagging, boosting)
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian network
with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the time…
76
125.
Other Machine Learning Methods
• Neural Nets
• Ensemble Methods (e.g. bagging, boosting)
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian network
with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the time…
Raw Data
Clean/Normalize Data
Training Set Test Set
Build Model
Test
Apply to New Data
(Validation Cohort or
Model Application)
Tune Model
76
126.
Other Machine Learning Methods
• Neural Nets
• Ensemble Methods (e.g. bagging, boosting)
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian network
with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the time…
Raw Data
Clean/Normalize Data
Training Set Test Set
Build Model
Test
Apply to New Data
(Validation Cohort or
Model Application)
Tune Model
76
Algorithm Selection is an Important Step!
127.
• ‘Genomical’ and Biology Big Data
• Introduction to Machine
Learning and R
• Machine Learning Algorithms
• Applying Machine Learning to
Genomics Data + Problems
128.
78
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
129.
• Which patients are high risk for
developing cancer?
• What are early biomarkers of
cancer?
• Which patients are likely to be
short/long term cancer survivers?
• What chemotherapeutic might a
cancer patient benefit from?
78
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
130.
• Which patients are high risk for
developing cancer?
• What are early biomarkers of
cancer?
• Which patients are likely to be
short/long term cancer survivers?
• What chemotherapeutic might a
cancer patient benefit from?
78
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Complex problems + Big Data —>
Machine Learning
131.
79
Integrating genomic data with machine learning to improve
predictive modeling
Cross-Cancer Patient Outcome Prediction Model
132.
Scaled -log10 Cox p-value
-1 2 30 1
‘Common Survival Genes’ across 19 cancers
• ‘Common Survival Genes’
Cox regression uncorrected p-value
<0.05 for a gene in at least 9/19
cancers:
• 84 genes, enriched for
proliferation-related processes
including mitosis, cell and
nuclear division, and spindle
formation
• Clustering by Cox regression p-
values:
7 ‘Proliferative Informative Cancers’
and 12 ‘Non-Proliferative Informative
Cancers’
80
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
TopCrossCancerSurvivalGenes
*
C
Ramaker & Lasseigne, et al. 2017.
133.
Scaled -log10 Cox p-value
-1 2 30 1
‘Common Survival Genes’ across 19 cancers
Proliferative Informative Cancers
(PICs)
81
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
TopCrossCancerSurvivalGenes
*
C
• ‘Common Survival Genes’
Cox regression uncorrected p-value
<0.05 for a gene in at least 9/19
cancers:
• 84 genes, enriched for
proliferation-related processes
including mitosis, cell and
nuclear division, and spindle
formation
• Clustering by Cox regression p-
values:
7 ‘Proliferative Informative Cancers’
and 12 ‘Non-Proliferative Informative
Cancers’
Ramaker & Lasseigne, et al. 2017.
134.
Scaled -log10 Cox p-value
-1 2 30 1
‘Common Survival Genes’ across 19 cancers
Proliferative Informative Cancers
(PICs)
82
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
TopCrossCancerSurvivalGenes
*
C
Non-Proliferative Informative Cancers
(Non-PICs)
• ‘Common Survival Genes’
Cox regression uncorrected p-value
<0.05 for a gene in at least 9/19
cancers:
• 84 genes, enriched for
proliferation-related processes
including mitosis, cell and
nuclear division, and spindle
formation
• Clustering by Cox regression p-
values:
7 ‘Proliferative Informative Cancers’
and 12 ‘Non-Proliferative Informative
Cancers’
Ramaker & Lasseigne, et al. 2017.
135.
83
Cross-Cancer Patient Outcome Model
Ramaker & Lasseigne, et al. 2017.
138.
Take-Home Message
• Genomics generates big data to address complex biological problems, e.g., improving human
disease prevention, diagnosis, prognosis, and treatment efficacy
• Machine learning is a data analysis method that automate analytical model building to make
data driven predictions or discover patterns without explicit human intervention
• Machine learning is a subfield of computer science—>the algorithms are implemented in code
• Machine learning is useful when we have complex problems with lots of ‘big’ data
84
139.
Take-Home Message
• Genomics generates big data to address complex biological problems, e.g., improving human
disease prevention, diagnosis, prognosis, and treatment efficacy
• Machine learning is a data analysis method that automate analytical model building to make
data driven predictions or discover patterns without explicit human intervention
• Machine learning is a subfield of computer science—>the algorithms are implemented in code
• Machine learning is useful when we have complex problems with lots of ‘big’ data
84
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
140.
Take-Home Message
• Genomics generates big data to address complex biological problems, e.g., improving human
disease prevention, diagnosis, prognosis, and treatment efficacy
• Machine learning is a data analysis method that automate analytical model building to make
data driven predictions or discover patterns without explicit human intervention
• Machine learning is a subfield of computer science—>the algorithms are implemented in code
• Machine learning is useful when we have complex problems with lots of ‘big’ data
84
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+
141.
HudsonAlpha:
hudsonalpha.org
R Programming Language and/or Machine Learning (mostly free):
Software Carpentry (software-carpentry.org) and Data Carpentry (datacarpentry.org)
coursera.org and datacamp.com
Stanford Online’s ‘Statistical Learning’ class
Books:
Rosalind Franklin: The Dark Lady of DNA by Brenda Maddox (Female scientist biography)
The Emperor of All Maladies by Siddhartha Mukherjee (History of cancer)
The Gene by Siddhartha Mukherjee (History of genetics)
Genome by Matt Ridley (Human Genome)
Headstrong: 52 Women Who Changed Science-and the World by Rachel Swaby
142.
86
Thanks!
Brittany N. Lasseigne, PhD
@bnlasse blasseigne@hudsonalpha.org
143.
Iris Data: Ensemble Methods
Example: tree bagging and boosting
• Instead of picking a single model, ensemble methods
combine multiple models to fit the training data
(‘bagging’ and ‘boosting’)
• Random Forest is a Decision Tree Ensemble Method
Image: Machado, et al. Veterinary Research, 2015. 87
144.
Iris Data: Neural Nets
• Neural Networks (NNs) emulate how the
human brain works with a network of
interconnected neurons (essentially
logistic regression units) organized in
multiple layers, allowing more complex,
abstract, and subtle decisions
• Lots of tuning parameters (# of hidden
layers, # of neurons in each layer, and
multiple ways to tune learning)
• Learning is an iterative feedback
mechanism where training data error is
used to adjust the corresponding input
weights which is propagated back to
previous layers (i.e., back-propagation)
• NNs are good at learning non-linear
functions and can handle multiple
outputs, but have a long training time and
models are susceptible to local minimum
traps (can be mitigated by doing multiple
rounds—takes more time!)
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
88