Week 1 lecture for High School Bioinformatics course; covers why we need to use computers in biology, what bioinformatics/computational biology is, an introduction to machine learning, and examples from current research
15. Cells, Tissues, & Diseases Functional Annotations
Image from encodeproject.org 13
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
Big Data
16. 15
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 files with 2000+ metadata attributes
• >2.5 Petabytes of data
1 Petabyte of Data =
20M four-drawer filing cabinets filled with text
or
13.3 years of HD-TV video
or
~7 billion Facebook photos
or
1 PB of MP3 songs requires ~2,000 years to play
20. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com. 19
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
21. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com. 19
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
• We have lots of data and complex problems
• We want to manage lots of data and make
data-driven predictions
22. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com.
20
Multidimensional Data Sets
Complex problems + Big Data —>
Computer Science + Mathematics
27. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com.
25
Multidimensional Data Sets
Complex problems + Big Data —>
Machine Learning!
28. • data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
26
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+
• Our goal isn’t to make perfect guesses, but to make useful guesses—we want to
build a model that is useful for the future
32. Fisher’s/Anderson's iris data set:
measurements (cm) of the sepal length and width and petal length and width (4 features) for 50 flowers from
each of 3 species (Iris setosa, versicolor, and virginica)
Iris Dataset in R
29
36. Iris Data: Adding Regularization (LASSO)
•Model building with a large # of
features for a moderate
number of samples can result
in ‘overfitting’ —the model is
too specific to the training set
and not generalizable enough
for accurate predictions with
new data
•Regularization is a technique
for preventing this by
introducing tuning parameters
that penalize the coefficients
of variables that are linearly
dependent (redundant)
•This results in FEATURE
SELECTION
•Ridge regression and LASSO
regression are methods of
regression with regularization
33
Computer
Petal.Length
Sepal.Width
Sepal.Length
Petal.Width
Petal.Width~
0.968*Sepal.Length
+ 0.187
Petal.Width ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + b
0 0
Petal.Width ~ 0*Petal.Length + 0*Sepal.Width + C* Sepal.Length + b
Petal.Width ~ Sepal.Length + b
p value < 2.2*10-16
37. Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
34
Petal.Length < 2.35 cm
Setosa (40/0/0)
Petal.Width < 1.65 cm
Versicolor (0/40/12) Virginica (0/0/28)
39. Other Machine Learning Methods
• Ensemble Methods (combine models, e.g.
bagging and boosting)
• Neural Nets (inspired by the human brain)
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian network
with hidden states)
• K Nearest Neighbors (instance-based learning
—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the time…
Raw Data
Clean/Normalize Data
Training Set Test Set
Build Model
Test
Apply to New Data
(Validation Cohort or
Model Application)
Tune Model
35
Algorithm Selection is an Important Step!
41. 37
Example Computational Biology Experiments and Tasks:
• Example 1: Identify Variants Associated with a Predisposition to ALS
• Annotation
• Databasing
• Statistical Programming (analysis + visualization)
• Hypothesis-Generating Research
• Example 2: Develop Biomarkers for Kidney Cancer Diagnosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Direct Clinical Application
• Interdependent and Complementary ‘Wet’/‘Dry’ Biology Research
• Example 3: Generate Pan-Cancer Models of Patient Prognosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Software Development
• Computational Research
42. 38
Example Computational Biology Experiments and Tasks:
• Example 1: Identify Variants Associated with a Predisposition to ALS
• Annotation
• Databasing
• Statistical Programming (analysis + visualization)
• Hypothesis-Generating Research
• Example 2: Develop Biomarkers for Kidney Cancer Diagnosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Direct Clinical Application
• Interdependent and Complementary ‘Wet’/‘Dry’ Biology Research
• Example 3: Generate Pan-Cancer Models of Patient Prognosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Software Development
• Computational Research
54. 49
Example Computational Biology Experiments and Tasks:
• Example 1: Identify Variants Associated with a Predisposition to ALS
• Annotation
• Databasing
• Statistical Programming (analysis + visualization)
• Hypothesis-Generating Research
• Example 2: Develop Biomarkers for Kidney Cancer Diagnosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Direct Clinical Application
• Interdependent and Complementary ‘Wet’/‘Dry’ Biology Research
• Example 3: Generate Pan-Cancer Models of Patient Prognosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Software Development
• Computational Research
55. 50
Example Computational Biology Experiments and Tasks:
• Example 1: Identify Variants Associated with a Predisposition to ALS
• Annotation
• Databasing
• Statistical Programming (analysis + visualization)
• Hypothesis-Generating Research
• Example 2: Develop Biomarkers for Kidney Cancer Diagnosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Direct Clinical Application
• Interdependent and Complementary ‘Wet’/‘Dry’ Biology Research
• Example 3: Generate Pan-Cancer Models of Patient Prognosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Software Development
• Computational Research
65. From Bench To Bedside:
’liquid biopsies’ from peripheral fluids
Cell-free DNA
Blood Test
Urine Test
Patient with
Kidney cancer
•Early diagnosis for non-specific symptoms
•Clarify between small benign lesions and
malignant tumors
•Follow patients after surgery or during
treatment to watch for recurrence
•Monitor molecular changes associated
with patient outcome
66. 58
Example Computational Biology Experiments and Tasks:
• Example 1: Identify Variants Associated with a Predisposition to ALS
• Annotation
• Databasing
• Statistical Programming (analysis + visualization)
• Hypothesis-Generating Research
• Example 2: Develop Biomarkers for Kidney Cancer Diagnosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Direct Clinical Application
• Interdependent and Complementary ‘Wet’/‘Dry’ Biology Research
• Example 3: Generate Pan-Cancer Models of Patient Prognosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Software Development
• Computational Research
67. 59
Example Computational Biology Experiments and Tasks:
• Example 1: Identify Variants Associated with a Predisposition to ALS
• Annotation
• Databasing
• Statistical Programming (analysis + visualization)
• Hypothesis-Generating Research
• Example 2: Develop Biomarkers for Kidney Cancer Diagnosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Direct Clinical Application
• Interdependent and Complementary ‘Wet’/‘Dry’ Biology Research
• Example 3: Generate Pan-Cancer Models of Patient Prognosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Software Development
• Computational Research
76. 73
Example Computational Biology Experiments and Tasks:
• Example 1: Identify Variants Associated with a Predisposition to ALS
• Annotation
• Databasing
• Statistical Programming (analysis + visualization)
• Hypothesis-Generating Research
• Example 2: Develop Biomarkers for Kidney Cancer Diagnosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Direct Clinical Application
• Interdependent and Complementary ‘Wet’/‘Dry’ Biology Research
• Example 3: Generate Pan-Cancer Models of Patient Prognosis
• Statistical Programming (analysis + visualization)
• Machine Learning
• Software Development
• Computational Research