Machine Learning

Computational Analysis of High-throughput
Biological Data Using
Machine Learning Approaches
Ashok K Sharma
1220104
MetaInformatics Laboratory
IISER Bhopal

Topics Covered in the Talk
 Introduction
 Beginning of Genomics
 Current Sequencing Scenario
 Metagenomic Approaches
 The Conventional Approach for Data Analysis- Limitations
 Machine Learning Approaches and their Implementations
 SVM
 HMM
 Naive Bayes
 Random Forest
 Discuss the Work Done So Far
 Future Directions

Beginning of Genomics
First DNA isolated by Swiss
physician Friedrich
Miescher in 1869
The term genome was used
by German botanist Hans
Winker in 1920
The history of modern genomics began in 1970s
Nucleotide sequence of
bacteriophage lambda DNA
F. Sanger et. al., J Mol Biol, 1982
Whole-genome random
sequencing and assembly of
Haemophilus influenzae Rd
Fleischmann RD et. al., Science, 1995
Sequencing and analysis of the
human genome
ES Lander et. al., Nature, 2001
~48 kb ~1,800 kb
3 billion bp
3 billion USD
10 Years

Sequencer Read length ~ Cost /Mb ~ Data/run
Roche 454 400-800bp 20$ 450Mb
Ion Torrent 200bp 2$ 10Mb-1Gb
Illumina 150bp 0.50$ 600Gb
PacBio SMRT ~20kb 1.4$ 350Mb
Ion TorrentRoche 454 sequencer Illumina/Solexa Sequencer
Next Generation Sequencing Technologies
Leading to The Sequencing Era

Metagenomics: New Approach to Sequence the Unknown
•The first idea of cloning DNA directly from environmental samples was proposed by Pace in 1985
•The term “metagenome” was coined by Handelsman in 1998
The First Large Scale Metagenomics Project:
Environmental Genome Shotgun Sequencing
of the Sargasso Sea
C. Ventor et. al., Science, 2004
The First Large Scale Organismal Study:
Model Study Comparing the Gut Flora of 124
European individuals
Qin et. al., Nature, 2009
1.6 GB and 1.2 million genes 576.7 GB and 3.3 million genes
98% bacteria cannot be cultured and hence cannot be sequenced

Genomics and Metagenomics Have Exponentially
Increased the Sequence databases
Published papers on
“Metagenomics” in
PubMed
$1000
$100M
Cost per
Human Genome
180
140
100
60
20
1984 1988 1992 1996 2000 2004 2008 2012
Sequences(inmillions)
Growth of GenBank
(1984-2013)
• Metagenomic: 538
• Non-Metagenomic: 18787
https://gold.jgi-psf.org/
• Running projects:

What
How
Who
Species Diversity
What Metagenomics can Answer ?
Arcobacter
Paludibacter
Shewanella
Pseudomonas
Unknown
Species Richness

What
How
Who
Metabolic Capabilities
Functional Potential

What
How
Who

Genomics vs Metagenomics
…GGATCCATCGTACCGATTC..
…TTACAATTTA…
…CCATGGCCGAAATTTCGTA…
…AGCTAAAATTACCGGGGAT…
Community of
Microbial
Species- Mainly
Unculturable
Fragmentation
of DNA
Sequencing
Analysis
Culture a
Single
Microbe
Fragmentation of
DNA
Sequencing
…AGCTAAAATTACCGGG…
GENOMICS METAGENOMICS
The Metagenomic Challenges
• Assembly
•Taxonomic Assignment
• Metabolic Pathway Construction
• Gene Prediction
• Functional Annotation
• Comparative Analysis
Assembly
DNA Isolation

Flow of Presentation
 Introduction
 Sequencing era
 Metagenomics
 The Conventional Approach for Data Analysis- Limitations
 Machine learning approaches and their implementations
 SVM
 HMM
 Naive Bayes
 Random Forest
 Work done
 Future directions

Conventional Methods Cannot be Used for
Metagenomic Data Analysis
Database : 4.7 million Sequences
Query
Seeds
•Homology Based Approach- BLAST
•Most widely used by researchers
•Dynamic Programming is used
Each sequence is fragmented into seeds and
searched against all sequences of the database
It will take about 17 years on a Xeon 2.6 Ghz PC to carry out the BLAST of
>3 million metagenomic genes from one project
BLAST of 1000 genes against
NR database ~ 1 Day (25.5 Hrs.) ~ 2 Days (47.1 Hrs.)
~10 GB ~13 GB ~17 GB~4 GB< 1 GB
2012
2014
2013
Future
????
NCBI
NR

Flow of Presentation
 Introduction
 Sequencing era
 Metagenomics
 The Conventional Approach
 Machine learning approaches and their implementations
 SVM
 HMM
 Naive Bayes
 Random Forest
 Work done
 Future directions

Key idea: Learnfrom known data and Predicton unknown data
Machine Learning- Valuable Alternatives
Database : 4.7 million SequencesQuery
Searching One against All
Memorize the information
Processing all at Once
• From known examples or dataLearning
• Derives a hypothesis based on training examplesHypothesis
• Based on Hypothesis, predictions on unknown
query
Prediction

Machine
Learning
Techniques
Support
Vector
Machines
Random
Forest
Artificial
Neural
Networks
Clustering
Naïve
Bayesian
Hidden
Markov
Models
• Reliability / Accuracy
•Adaption to new Environment
• Sensitivity to Noise
• Ability to handle diverse data
• Speed
• Limitations
Machine Learning
•Supervised
•Unsupervised

Properties of Training Examples
• Training Dataset : Well curated and free from
noise.
• Features : Fixed length patterns
MKWMPFVGTMPLVQTKSITDLCAPLC
MMK
KW
WM
MP……………………………….......
M I W . . .
M 0.12 0.34 0.09 . . .
I 0.28 0.19 0.41 . . .
W . . 0.24 . . .
P - 0.17 - - - -

Support Vector Machines (SVM)
X2
X1
SVMs finds the maximal margin which separates two classes
Class 1
Class2

Support Vector Machines (SVM)
X1
X2
X3
( Ben-Hur, et al., 2008)
Linear
Kernel
Polynomial
Kernel d=2
Gaussian
Kernel,
sigma = 1
X2
X1
Class 1
Class2

Carbohydrate Metabolism Amino acid metabolism Nucleotide metabolism
Model 1 Model 2 Model 3
Query Sequence
Prediction
Value 1
Prediction
Value 2
Prediction
Value 3
Feature Extraction
Mapped into high-dimension
feature space
Classification Based on
Maximum Prediction Value
•Dipeptide Frequency
AD : 0.08
RH : 0.02
•Amino Acid Composition
A : 0.14
F : 0.05

GPCRpred: an SVM-based method for prediction of families and
subfamilies of G-protein coupled receptors
Bhasin M , and Raghava G P S Nucl. Acids Res. 2004;32:W383-W389
Protein Sequence
SVM
Is GPCR ?
SVM
SVMSVMSVMSVM
GPCR
Recognition
Family
Prediction
Sub-Family
Prediction
99.5%
Accuracy
G-protein coupled
receptors (GPCRs)
important targets for drug
design.
Dipeptide
frequency
used as a
feature

Hidden Markov Models (HMM)
• A powerful statistical tool widely used in modeling sequences
• Markov Chains:
AYTGGGTACC
AYT-GGTMCC
AYCGGG-MC-
Making Profiles
What is the
probability of
Rain today?

HMM Profile Database
QUERY
SEQUENCE
Prediction Based
on Best Profile
Match

The Pfam Protein Families Database
• A large collection of protein families, each represented by multiple
sequence alignments and hidden Markov models (HMMs)
• Identification of domains that occur within proteins can provide insights
into their function
Steps used for building of Pfam:
 Manually curated collection of protein families (3,071 families)
 Each curated family is represented by seed and full alignment
 Building HMM profiles using HMMER3.0
 Widely used for identification of protein structure and function
Marco Punta et. al., Nucleic Acids Res, 2011

Naive Bayes Classifier
1. Simple probabilistic classifier Based on
Bayes’ theorem
2. Goal is to determine the most probable
hypothesis
Prior probability of class
Likelihood of X given that class
X2
X1Class 1 Class 2
Kohenen J. et al. In Silico Biol 2009;9(1-2):23-34.

Algorithm: Word sizes between 6 and 9 bases
Word-specific priors: Pi = [n(wi) + 0.5]/(N +1)]
Genus-specific conditional probabilities: P(wi|G) = [m(wi) + Pi]/(M + 1)
Naive Bayesian assignment: P(G|S) = P(S|G) * P(G)/P(S)
Bootstrap confidence estimation: For each query sequence
Naive Bayesian Classifier for Rapid Assignment of
rRNA Sequences into New Bacterial Taxonomy
Qiong Wang et. al., Appl Environ Microbiol, 2007
AUGCGUCAGCUCGAUCGAUCUA
AUGCGUCA
UGCGUCAG
GCGUCAGC
CGUCAGCU

Classification and Regression Trees (CART)
X1> C1
1
No Yes
X2> C2
YesNo
2X1> C3
YesNo
2X2> C4
YesNo
1 2
X1
X2
C1 C3
C4
C2
1
2

Random Forest
• Collection of unpruned CARTs
• Bagging- avoids overfitting
• Improve prediction accuracy
• Encouraging diversity among the tree
X
Tree 1
Tree 2 Tree 3
Svetnik V. et al., J Chem Inf Comput Sci 2003 Nov-Dec;43(6):1947-58

Features of Random Forest
o Cross validation procedure is inbuilt in random forest, as each
tree in the forest has its own training (bootstrap) and test data
(OOB data)
o OOB error rate calculates the overall percentage of
misclassification
o Calculates the important features for the classification
29

MODEL
t1
ABC
AB C
A B
t2
ABC
AC
A C
B
t3
ABC
BCA
CB
Query Sequence
Classification Based on the
Majority of votes
Feature Extraction
•Dipeptide Frequency
AD : 0.08
RH : 0.02
•Amino Acid Composition
A : 0.14
F : 0.05

Prediction of protein-RNA binding site using
Random Forest
Zhi Ping Liu et. al. Bioinformatics,2010
• Protein-RNA interaction plays a key role in number of
biological processes
Dataset:
339 Protein-RNA complexes form RsiteDB
Entangle was used to define the interaction site between
protein chain and RNA
Features:
Interaction propensity, Hydrophobicity, Relative excessive
surface area, Secondary structure, Conservation score and
Side chain environment

Machine Learning methods are becoming
popular for Biological Data Analysis
0
100
200
300
400
500
600
700
1976 1993 2003 2013
Numberofpublications
Year
SVM
0
100
200
300
400
500
600
700
1976 1993 2003 2013
Year
HMM
0
100
200
300
400
500
600
700
2003 2008 2013
Year
Random Forest
http://www.ncbi.nlm.nih.gov/pubmed

Implementation of Machine Learning
for the Analysis of Metagenomic Data
in my Recent Projects
: A fast and accurate functional classifier
of genomic and metagenomic sequences

METHODOLOGY: eggNOG database was used
ORF1 ORF2
Sequencing, assembly and ORF predictionMetagenome
Routine task for metagenomic analysis
Class Group Annotation
O Cellular Processes and Signaling Serine-Type endopeptidase
J Information Storage and Processing tRNA synthetase
Functional
Class
Functional
Annotation
2.3 million sequences were
divided in to 22 Functional
Class
Dipeptide as input features
for optimization and
training of Random Forest
Final Random Forest model
was integrated with
RAPsearch2
Manuscript Submitted, 2014

Stand alone server
Query Sequence
Genomic Metagenomic
Random Forest
RAPsearch
Functional Class Prediction
Functional Annotation

: A Tool for Fast and Accurate
Taxonomic Classification of 16S rRNA
Hypervariable Regions in Metagenomic
Datasets

Metagenome
16SrRNA: Marker gene to identify microbial species
Sequencing of either HVR or Complete 16S
Taxonomic Classification
METHODOLOGY: Greengenes database was used
Sequences for hypervariable
regions were extracted and
grouped according to
taxonomic information
4-mer nucleotide
composition were used as
Input feature for training
and optimization of RF
Sequences discarded during
clustering and real
metagenomic 16S sequences
were used for the testing
Routine task for metagenomic analysis
Manuscript Submitted, 2014

Future Directions
• Analysis of metagenomic data generated from
the laboratory projects
• Implementation of machine learning in the
analyses of metagenomic data
• Metabolic pathway analysis and reconstruction

Acknowledgement
•Thesis Supervisor : Dr. Vineet Sharma
•Lab Members:
•Dr. Sanjiv Kumar
•Darshan Dhakan
•Ankit Gupta
•Rituja Saxena
•Parul Milttal
•Vishnu Prasoodanan
•Harish K
•Nikhil Chuadhary
•IISER Bhopal for providing the fellowship for doctoral
research

Machine Learning

More Related Content

What's hot

Similar to Machine Learning

Recently uploaded

Machine Learning

Editor's Notes