Hands-on Introduction to Machine Learning

Introduction to Machine Learning
Brittany N. Lasseigne, PhD
Senior Scientist
HudsonAlpha Intstitute for Biotechnology
8 December 2017
@bnlasse blasseigne@hudsonalpha.org

• ‘Genomical’ and Biology Big Data
• Introduction to Machine
Learning and R
• Machine Learning Algorithms
• Applying Machine Learning to
Genomics Data + Problems

Biology Big Data
• Molecular and cellular profiling of large numbers of features in large numbers
of samples (‘omics’ data)
• Image processing: cell microscopy, neuroimaging, radiology and histology,
crop imagery, etc.
4

Biology Big Data
crop imagery, etc.
4Esteva, et al. Nature, 2017.

Biology Big Data
crop imagery, etc.
4Esteva, et al. Nature, 2017.
Resources:
• Kan, Machine Learning applications in cell image analysis, Immunology and Cell Biology, 2017
• Angermueller, et al. Deep learning for computational biology, Mol Syst Biol, 2016.
• Ching, et al. Opportunities and Obstacles for Deep Learning in Biology and Medicine, biorxiv, 2017

5American Cancer Society, 2015 & Harvard NeuroDiscovery Center, 2017.
Complex Human Diseases:
usually caused by a combination of genetic, environmental and lifetyle factors
(most of which have not yet been identified)

5American Cancer Society, 2015 & Harvard NeuroDiscovery Center, 2017.
Cancer:
• Men have a 1 in 2 lifetime risk of developing cancer and a 1 in 4 lifetime risk of dying from cancer
• Women have a 1 in 3 lifetime risk of developing cancer and a 1 in 5 lifetime risk of dying from
cancer
Psychiatric Illness:
• 1 in 4 American adults suffere from a diagnosable mental disorder in any given year
• ~6% suffer serious disabilities as a result
Neurodegenerative Disease:
• ~6.5M Americans suffer (AD, PD, MS, ALS, HD), expected to rise to 12M by 2030
Complex Human Diseases:
usually caused by a combination of genetic, environmental and lifetyle factors
(most of which have not yet been identified)

• Which patients are high risk for developing cancer?
• What are early biomarkers of cancer?
• Which patients are likely to be short/long term cancer survivers?
• What chemotherapeutic might a cancer patient benefit from?
6
Improve disease prevention, diagnosis, prognosis, and treatment efficacy

• Which patients are high risk for developing cancer?
• What are early biomarkers of cancer?
• Which patients are likely to be short/long term cancer survivers?
• What chemotherapeutic might a cancer patient benefit from?
6
Complex problems

Genomics
• Understanding the function of the
genome (total genetic material) and
how it relates to human disease
(studying all of the genes at once!)
7

Genomics
• The sequencing of the human
genome paved the way for genomic
studies
7

Genomics
• The sequencing of the human
genome paved the way for genomic
studies
• Our goal it identify genetic/genomic
variation associated with disease to
improve patient care
7

Image from encodeproject.org 10

Multidimensional Data Sets

Cells, Tissues, & Diseases

Cells, Tissues, & Diseases Functional Annotations

Big Data

Genomics Data is Big Data
11Stephens, et al. PLOS Biology, 2015.
1 zettabyte (ZB) = 1024 EB
1 exabyte (EB) = 1024 PB
1 petabyte (PB) = 1024 TB
1 terabyte (TB) = 1024 GB

12
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 ﬁles with 2000+ metadata attributes
• >2.5 Petabytes of data

12
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 ﬁles with 2000+ metadata attributes
• >2.5 Petabytes of data
1 Petabyte of Data =
20M four-drawer filing cabinets filled with text
or
13.3 years of HD-TV video
or
~7 billion Facebook photos
or
1 PB of MP3 songs requires ~2,000 years to play

Astronomical ‘Genomical’ Data:
the ‘four-headed beast’ of the data life-cycle (2025 Projections)
13Stephens, et al. PLOS Biology, 2015 and nanalyze.com.

Astronomical ‘Genomical’ Data:
the ‘four-headed beast’ of the data life-cycle (2025 Projections)
13Stephens, et al. PLOS Biology, 2015 and nanalyze.com.
1 zettabyte (ZB) = 1024 EB
1 exabyte (EB) = 1024 PB
1 petabyte (PB) = 1024 TB
1 terabyte (TB) = 1024 GB

mage from encodeproject.org and xorlogics.com. 15

mage from encodeproject.org and xorlogics.com. 15
• We have lots of data and complex problems
• We want to make data-driven predictions
and need to automate model building

mage from encodeproject.org and xorlogics.com.
16
Complex problems + Big Data —> Machine Learning!

mage from encodeproject.org and xorlogics.com.
16
Complex problems + Big Data —> Machine Learning!
• Allows us to better utilize these increasingly large
data sets to capture their inherent structure
• Learning algorithms by training with data

• data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17

Machine Learning
17
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5

Machine Learning
17
Computer
Data
Program
Output
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+

Machine Learning
17
Computer
Data
Program
Output
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+
• Our goal isn’t to make perfect guesses, but to make useful guesses—we want to
build a model that is useful for the future

18
Supervised Learning:
-Prediction
Ex. linear & logistic regression
Unsupervised Learning:
-Find patterns
Ex. Clustering, Principle Component Analysis

18
-Prediction
-Find patterns
Known Data + Known Response
YES
NO

18
-Prediction
-Find patterns
YES
NO
MODEL

18
-Prediction
-Find patterns
YES
NO
MODEL
NEW DATA
Predict Response

18
-Prediction
-Find patterns
YES
NO
MODEL
NEW DATA
Predict Response
Uncategorized Data

18
-Prediction
-Find patterns
YES
NO
MODEL
NEW DATA
Predict Response
Clusters of Categorized Data
Uncategorized Data

Real-World Machine Learning Applications
19
Recommendation Engine
Mail Sorting
Self-Driving Car

The Rise of Machine Learning
• Hardware Advances
• Extreme performance
hardware (ex.
application-specific
integrated circuits)
• Smaller, cheaper
hardware (Moore’s law)
• Cloud computing (ex.
AWS)
• Software Advances
• New machine learning
algorithms including
deep learning and
reinforcement learning
• Data Advances
• High-performance, less
expensive sensors & data
generation
• ex. wearables, next-gen
sequencing, social media
20

The Rise of Machine Learning
• Hardware Advances
• Extreme performance
hardware (ex.
application-specific
integrated circuits)
• Smaller, cheaper
hardware (Moore’s law)
• Cloud computing (ex.
AWS)
• Software Advances
• New machine learning
algorithms including
deep learning and
reinforcement learning
• Data Advances
• High-performance, less
expensive sensors & data
generation
• ex. wearables, next-gen
sequencing, social media
20
We often use R, but Python is also a great choice!
• R tends to be favored by statisticians and academics
(for research)
• Python tends to be favored by engineers (with
production workflows)

• Open source implementation of S which was originally developed at Bell Lab
• Free programming language and software environment for advanced statistical
computing and graphics
• Functional programming language written primarily in C, Fortran
• Good at data manipulation, modeling and computing, data visualization
• Cross-platform compatible
• Vast community (e.g., CRAN, R-bloggers, Bioconductor)
• Over 10,000 packages including parallel/high-performance compute packages
• Used extensively by statisticians and academics
• Popularity is substantially increasing in recent years
• Drawbacks: can be steep learning curve (better recently), limited GUI (RStudio!),
documentation can be sparse, memory allocation can be an issue
The R Programming Language
21

24
Under File->New File->select R Script

25
We will be working in the R script panel (top left)

Fisher’s/Anderson's iris data set:
measurements (cm) of the sepal length and width, petal length and width, and species (Iris setosa, versicolor,
and virginica) (5 features or variables) for 150 ﬂowers (observations)
Iris Dataset in R
26

27
Data inspection:
summary function, tab out, and ‘Help’ pages

28
Data inspection:
Built-in iris dataset (check out mtcars too!)

29
Data inspection:
execute a line of code with ctl+return or hit ‘Run’ button

30
Data inspection:
Can also inspect data with the str (structure) function

31
Data inspection:
Can also inspect data with the str (structure) function

32
Data inspection:
And examine the first 5 rows [x,] and first 5 columns [,y]

33
Data inspection:
the plot function

34
Data inspection:
$ notation for calling columns by name

35
Data inspection:
the plot function

36
Data inspection:
the cor.test function

Iris Dataset:
Summarize/Descriptive Statistics (Observational)
37

Iris Dataset:
37
Computer
Data
Program
Output

Iris Dataset:
37
Computer
Data
Program
Output
Computer
Sepal.Lenth
mean(x)
5.843

39
Data modeling:
the lm function

40
Data modeling:
the lm function

40
Data modeling:
the lm function
y ~ mx + b

41
Data modeling:
the lm function
y ~ mx + b
Petal.Width ~ 0.4158*Petal.Length - 0.3631

42
Data modeling:
the abline function to add regression line to our plot

43
Data modeling:
the abline function to add regression line to our plot

Iris Dataset:
Linear Regression is Machine Learning!
• Purple line is a linear regression line
fit to the data describing petal length
as a function of petal width
• We can now PREDICT petal width
given petal length

(y=mx+b)
44

Iris Dataset:
given petal length

(y=mx+b)
Computer
Data
Output
Program
Machine Learning
44

Iris Dataset:
given petal length

(y=mx+b)
Computer
Data
Output
Program
Machine Learning
Computer
Petal.Length
Petal.Width
Petal.Width ~
0.4158*Petal.Length -
0.3631
44

45
What do you notice about the plot?

46
Data wrangling:
Examining Iris species variable

47
Data wrangling:
Examining Iris species variable

48
Data wrangling:
Coding species labels as categories to color the points by

49
Data wrangling:
Examine species variable

50
Data wrangling:
the palette function describes a vector of default colors for plotting in R

51
Data inspection:
plotting with data points colored by species (setosa, versicolor, virginica)

Train an algorithm to classify
Iris flowers by species
52
Fisher’s Iris Data
n=150
Training Set
n=105
Test Set
n=45
70% 30%

53
Defining training and test sets:
use nrow function to code the total number of observations in the Iris dataset

54
use sample function to assign observations to the training set
Note: I did not set a seed for this tutorial so you may get slightly different results. For more about setting seeds,
see here: https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function

55
use sample function to assign observations to the training set,
x is 1:n

56
use sample function to assign observations to the training set,
size is round(0.7*n) -> 0.7*150 = 105

57
assign the 105 selected observations to the training set

58
assign the non-105 selected observations to the test set (the remaining 45 observations)

59
Fisher’s Iris Data
n=150
Training Set: “iristrain”
n=105
Test Set: “iristest”
n=45
70% 30%
Train an algorithm to classify
Iris flowers by species

Iris Data: Adding Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
60

data
•Regularization is a technique for preventing this by introducing tuning
parameters that penalize the coefficients of features/variables that are
linearly dependent (redundant)
•This results in FEATURE SELECTION
60

data
•Regularization is a technique for preventing this by introducing tuning
parameters that penalize the coefficients of features/variables that are
linearly dependent (redundant)
•This results in FEATURE SELECTION
•Example methods of regression with regularization: ridge, elastic net, LASSO
60

LASSO: 
Least Absolute Shrinkage and Selection Operator:
• Linear regression:
predictive analysis fitting a single
line through data to describe
relationships between one
dependent variable and one or
more independent variables
61
Credit Card Balance
Credit Limit
Images: Adapted from Tibshirani, et al.

LASSO: 
Least Absolute Shrinkage and Selection Operator:
• Linear regression:
predictive analysis fitting a single
line through data to describe
relationships between one
dependent variable and one or
more independent variables
• LASSO regression:
perform variable selection by
including a penalty that forces some
coefficient estimates to be exactly
zero based on a turning parameter
(λ), yielding a sparse model
61
Credit Card Balance
Credit Limit
Images: Adapted from Tibshirani, et al.

LASSO Tuning Parameter Selection
• Select tuning
parameter by cross-
validation:
– Partition data
multiple times
– Compute cross-
validation error rate
for each tuning
parameter
– Select tuning
parameter value
with smallest error
62
Example: 5-Fold Cross-Validation
Image: goldenhelix.com

63
Building a model for Iris species prediction:
the glmnet package

64
the cv.glmnet function
Note: I did not set a seed for this tutorial so you may get slightly different results. For more about setting seeds,
see here: https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function

65
the cv.glmnet function on the training set
as.matrix(iristrain[,-5]
iristrain[,5]

66
use predict function to evaluate model in test set

67
use table function to view predicted species vs. actual species

68
view resulting predict object

69
examine cv.glmnet object

71
plot cv.glmnet object

71
# of predictors in the model
Error
Tuning Parameter Penalty

71
# of predictors in the model
Error
Tuning Parameter Penalty
λmin

Lambda with
minimum cross-
validated error
λ1SE

Largest lambda where
error w/in 1 standard
error of minimum

72
comparing coefficients at lambda.min and lambda.1se

Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73

with new data
>Feature Selection
73
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b

with new data
>Feature Selection
73
0 0

with new data
>Feature Selection
73
0 0
Species(setosa) ~ A*Petal.Length + B*Sepal.Width +B

with new data
>Feature Selection
73
Computer
Petal.Length
Sepal.Width
Sepal.Length
Petal.Width
Species
Species(setosa)~
1.58*Sepal.Width +
-2.36*Petal.Length
+ 5.96
0 0
Species(setosa) ~ A*Petal.Length + B*Sepal.Width +B

Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74

decision models
74
Petal.Length < 2.35 cm
Setosa (40/0/0)

decision models
74
Petal.Length < 2.35 cm
Setosa (40/0/0)
Petal.Width < 1.65 cm
Versicolor (0/40/12) Virginica (0/0/28)

Deep Learning (i.e. neural nets)
• Subfield of machine learning describing ‘human-like AI’
• Algorithms are structured in layers to create artificial neural
networks to learn and make decsions without human intervention
• These networks represent the world as a nested hierarchy of
concepts with each defined in relation to simipler concepts
75
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’

• Deep learning algorithms (compared to other machine learning):
• need a lot more data to perform well
• need more/better hardware
• typically identify and extract features without human
intervention
• usually solves problems end-to end instead of in parts
• takes a lot longer to train
• typically less interpretabile
75
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’

• Deep learning algorithms (compared to other machine learning):
• need a lot more data to perform well
• need more/better hardware
• typically identify and extract features without human
intervention
• usually solves problems end-to end instead of in parts
• takes a lot longer to train
• typically less interpretabile
• Ex: Deep learning to automate resume scoring
• Scoring performance may be excellent (i.e. near human
performance)
• Does not reveal why a particular applicant was given a score
• Mathematically you can find out which nodes of the network
were activated, but we don’t know what those neurons were
supposed to model or what the layers of neurons were doing
collectively
• Interpretation is difficult
75
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’

Other Machine Learning Methods
• Neural Nets
• Ensemble Methods (e.g. bagging, boosting)
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian network
with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the time…
76

• Neural Nets
with hidden states)
Raw Data
Clean/Normalize Data
Training Set Test Set
Build Model
Test
Apply to New Data
(Validation Cohort or
Model Application)
Tune Model
76

• Neural Nets
with hidden states)
Raw Data
Clean/Normalize Data
Training Set Test Set
Build Model
Test
Apply to New Data
(Validation Cohort or
Model Application)
Tune Model
76
Algorithm Selection is an Important Step!

78

• Which patients are high risk for
developing cancer?
• What are early biomarkers of
cancer?
• Which patients are likely to be
short/long term cancer survivers?
• What chemotherapeutic might a
cancer patient benefit from?
78

• Which patients are high risk for
developing cancer?
• What are early biomarkers of
cancer?
• Which patients are likely to be
short/long term cancer survivers?
• What chemotherapeutic might a
cancer patient benefit from?
78
Complex problems + Big Data —>
Machine Learning

79
Integrating genomic data with machine learning to improve
predictive modeling
Cross-Cancer Patient Outcome Prediction Model

Scaled -log10 Cox p-value
-1 2 30 1
‘Common Survival Genes’ across 19 cancers
• ‘Common Survival Genes’
Cox regression uncorrected p-value
<0.05 for a gene in at least 9/19
cancers:
• 84 genes, enriched for
proliferation-related processes
including mitosis, cell and
nuclear division, and spindle
formation
• Clustering by Cox regression p-
values:
7 ‘Proliferative Informative Cancers’
and 12 ‘Non-Proliferative Informative
Cancers’
80
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
TopCrossCancerSurvivalGenes
*
C
Ramaker & Lasseigne, et al. 2017.

-1 2 30 1
Proliferative Informative Cancers
(PICs)
81
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
*
C
cancers:
formation
values:
Cancers’

-1 2 30 1
Proliferative Informative Cancers
(PICs)
82
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
*
C
Non-Proliferative Informative Cancers
(Non-PICs)
cancers:
formation
values:
Cancers’

83
Cross-Cancer Patient Outcome Model

83
Cross-Cancer Patient Outcome Model
Cox
regression
with
LASSO
feature
selection
~20,000 gene
expression
values
Cancer Patient
Survival
Survival~ -0.104 + 0.086*ADAM12
+ 0.037*CKS1 - 0.088*CRYL1 +
0.056*DNA2 + 0.013*DONSON +
0.098*HJURP - 0.022*NDRG2 +
0.031*RAD54B + 0.040*SHOX2 -
0.155*SUOX

Take-Home Message
• Genomics generates big data to address complex biological problems, e.g., improving human
disease prevention, diagnosis, prognosis, and treatment efficacy
• Machine learning is a data analysis method that automate analytical model building to make
data driven predictions or discover patterns without explicit human intervention
• Machine learning is a subfield of computer science—>the algorithms are implemented in code
• Machine learning is useful when we have complex problems with lots of ‘big’ data
84

Take-Home Message
84
Computer
Data
Program
Output
Computer
[2,3]
+
5

Take-Home Message
84
Computer
Data
Program
Output
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+

HudsonAlpha:
hudsonalpha.org
R Programming Language and/or Machine Learning (mostly free):
Software Carpentry (software-carpentry.org) and Data Carpentry (datacarpentry.org)
coursera.org and datacamp.com
Stanford Online’s ‘Statistical Learning’ class
Books:
Rosalind Franklin: The Dark Lady of DNA by Brenda Maddox (Female scientist biography)
The Emperor of All Maladies by Siddhartha Mukherjee (History of cancer)
The Gene by Siddhartha Mukherjee (History of genetics)
Genome by Matt Ridley (Human Genome)
Headstrong: 52 Women Who Changed Science-and the World by Rachel Swaby

86
Thanks!
Brittany N. Lasseigne, PhD
@bnlasse blasseigne@hudsonalpha.org

Iris Data: Ensemble Methods
Example: tree bagging and boosting
• Instead of picking a single model, ensemble methods
combine multiple models to fit the training data
(‘bagging’ and ‘boosting’)
• Random Forest is a Decision Tree Ensemble Method
Image: Machado, et al. Veterinary Research, 2015. 87

Iris Data: Neural Nets
• Neural Networks (NNs) emulate how the
human brain works with a network of
interconnected neurons (essentially
logistic regression units) organized in
multiple layers, allowing more complex,
abstract, and subtle decisions
• Lots of tuning parameters (# of hidden
layers, # of neurons in each layer, and
multiple ways to tune learning)
• Learning is an iterative feedback
mechanism where training data error is
used to adjust the corresponding input
weights which is propagated back to
previous layers (i.e., back-propagation)
• NNs are good at learning non-linear
functions and can handle multiple
outputs, but have a long training time and
models are susceptible to local minimum
traps (can be mitigated by doing multiple
rounds—takes more time!)
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
88

Hands-on Introduction to Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hands-on Introduction to Machine Learning

Similar to Hands-on Introduction to Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Hands-on Introduction to Machine Learning