Beiko hpcs

(an example of)
Computing the
Microbial World
Rob Beiko
June 25, 2014

Siddique et al. (2014) Front Microbiol

Lawley et al., PLoS Genet (2012)

The Breakfast Organisms
"Bacon Fields" Author: Michael DeForge

240M “pieces”, each 150 nucleotides long
3.6 x 1010 nucleotides
~40 GB
Hundreds of “species”
Genomes between 1.5M – 6M nucleotides

150 nt x 150 nt
We know this And this
But not this

who is doing what?
Marker genes WHO
Environmental “Shotgun” WHAT
The challenge of
METAGENOME CLASSIFICATION

Clues – Sequence similarity
(homology)
150 nt x 150 nt
Reference
genes
Take the WHOLE SEQUENCE
Best
Worst

Clues – composition
150 nt x 150 nt
Reference
genome
k-mer profiles
Genome #1:
20% G & C
30% A & T
Genome #2:
24% G & C
26% A & T
Best
Worst
Take a
K-MER FREQUENCY
DECOMPOSITION

Homology >> Composition
* GGCTGGACCA
1 GACTGGACCA
2 GGCCGGACTA
But homology evidence can
mislead or be absent
Homology + Composition >
Homology alone

GGCTGGACCA
GCCTGGTCCA
GCCAGGTGCA
GCCTGTCCA
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
Query:
Subject:
Exact string search? NO
BLAST? OK, but SLOW!

A compromise: UBLAST
• BLAST seeks out very similar “anchor points”
between a pair of sequences before doing a more
thorough search
• Typically, a query is compared against all candidate DB
sequences, but most will return no hits
UBLAST:
GGCTGGACCA
GCCTGTCCA
NNNNNNNNNN
NNNNNNNNNN
GCCAGGTGCA
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
GCCTGGTCCA
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
(1) Query,
DB sequences
GGCTGGACCA
GCCTGGTCCA
GCCAGGTGCA
GCCTGTCCA
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
(3) Rank DB
based on k-mer
matching
GGCTGGACCA
GCCTGGTCCA
GCCAGGTGCA
GCCTGTCCA
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
(4) Do detailed search
until there is
no more point
X
(2) k-mer table

Compositional models
• Interpolated Markov models: adaptively generate
frequency models based on extending k-mers with
sufficiently high frequencies
• One model per genome
• Evaluate probability of each k-mer in query sequence,
given shorter k-mers in sequence
• Model construction can take a while
k = 4 k = 5 k = 6 k = 7
PhymmBL: Brady and Salzberg (2009) Nat Methods

An alternative: Naïve Bayes
• Just compute the frequency of each k-mer for a fixed
length k
• Build one frequency model for each genome
• FAST
• Assumes conditional independence – may not matter
Probability of a query
Fragment originating
from genome Gi
For all k-mers in the fragment…
The frequency of that k-mer in Gi
Parks et al. (2011) BMC Bioinformatics

RITA: Rapid Identification of
Taxonomic Assignments
UBLAST filter
MacDonald et al. (2012) Nucleic Acids Res

Evaluation set
• “Fake metagenome”: take sequences from known
genomes, randomly sample fragments of 50, 100,
200 and 1000 nt in different trials
• Build reference models from other genomes – can
leave close relatives out of reference model
• Leave out other strains within the same species – not so
hard
• Leave out other classes in the same phylum - HARD

But does it work?
Full RITA
Best class
(homology and composition agree)
DNA sequence length50
Predicting genus from different species Predicting phylum from different class

Conclusions
• Careful attention needs to be paid to the choice of
approach – simple is better
• RITA illustrates two key points in (microbial)
bioinformatics:
1. Homology: How heuristic are you willing to go?
2. Naïve Bayes: Keep it simple until told otherwise
• Technological change means that many bioinformatics
algorithms will be irrelevant in 5 years

Beiko hpcs

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Beiko hpcs

Similar to Beiko hpcs (20)

More from beiko

More from beiko (20)

Recently uploaded

Recently uploaded (20)

Beiko hpcs