Hadoop for Bioinformatics
Deepak Singh
Amazon Web Services
Hadoop World, NYC
Via Reavel under a CC-BY-NC-ND license
By ~Prescott under a CC-BY-NC license
data sets
many data sets
PFAM PDB
GENBANK ENSEMBL
Many Others
manageable
Image: Matt Wood
Human
genom
e
Image: Matt Wood
Image: Matt Wood
~100 TB/Week
Image: Matt Wood
~100 TB/Week
>2 PB/Year
Image: Matt Wood
years
days
hours
gigabytes
terabytes
petabytes
really fast
typical informatics workflow
Via Christolakis under a CC-BY-NC-ND license
Via Argonne National Labs under a CC-BY-SA license
killer app
Via Argonne National Labs under a CC-BY-SA license
Via asklar under a CC-BY license
Image: Chris Dagdigian
rethink algorithms
rethink computing
rethink data management
rethink data sharing
operational mindset
scalability
we are data geeks not data center geeks
two key trends
develop applications
distribute applications
use applications
some work
filters
some work
^
High Throughput Sequence Analysis
Mike Schatz, University of Maryland
• Read Mapping
• Mapping & SNP Discovery
• De novo Genome Assembly
Short Read Mapping
Asian Individual Genome: 3.3 Billion 35bp, 104
GB (Wang et al., 2008)
African Individual Genome: 4.0 Billion 35bp, 144
GB (Bentley et al., 2008)
Alignment > 10000 CPU hrs
Seed & Extend
Good alignments must have significant
exact alignment
Minimal exact alignment length = l/(k+1)
Seed & Extend
Good alignments must have significant
exact alignment
Minimal exact alignment length = l/(k+1)
Expensive to scale
Seed & Extend
Good alignments must have significant
exact alignment
Minimal exact alignment length = l/(k+1)
Expensive to scale
Seed & Extend
Good alignments must have significant
exact alignment
Minimal exact alignment length = l/(k+1)
Expensive to scale
Need parallelization framework
CloudBurst efficiently reports every k-difference
alignment of every read
many applications only need the best alignment
Bowtie: Ultrafast short read aligner
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human
genome. Genome Biol 10 (3): R25.
SOAPSnp: Consensus alignment and SNP calling
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human
genome. Genome Biol 10 (3): R25.
Crossbow: Rapid whole genome SNP analysis
Ben Langmead
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human
genome. Genome Biol 10 (3): R25.
Preprocessed reads
Preprocessed reads
Map: Bowtie
Preprocessed reads
Map: Bowtie
Sort: Bin and partition
Preprocessed reads
Map: Bowtie
Sort: Bin and partition
Reduce: SoapSNP
Crossbow condenses over 1,000 hours
of resequencing computation into a few
hours without requiring the user to own
or operate a computer cluster
Comparing Genomes
Estimating relative evolutionary rates
from sequence comparisons:
Identification of probable orthologs
Admissible comparisons: A or B vs. D
C vs. E
Inadmissible comparisons: A or B vs. E
C vs. D
A B C D E species tree
gene tree
S. cerevisiae C. elegans
Estimating relative evolutionary rates
from sequence comparisons:
1. Orthologs found using the Reciprocal
smallest distance algorithm
2. Build alignment between two orthologs
>Sequence C
MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-…
>Sequence E
MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…
3. Estimate distance given a substitution
matrix
Phe Ala Pro Leu Thr
Phe
Ala µπ
Pro µπ µπ µπ
Leu µπ µπ µπ µπ
A B C D E species tree
gene tree
S. cerevisiae C. elegans
RSD algorithm summary
Genome I Genome J
Ib Jc
Align sequences &
Calculate distances L Orthologs:
Align sequences &
Calculate distances H
ib - jc D = 0.1
c
vs. D=1.2 vs. D=0.2
a b a
vs. D=0.1 vs. D=0.3
c b b b
vs. D=0.9 vs. D=0.1
c c b c
Prof. Dennis Wall
Harvard Medical School
Roundup is a database of orthologs
and their evolutionary distances.
To get started, click browse. Alternatively, you can
read our documentation here.
Good luck, researchers!
PS: At least you have that picture for posterity 3 years ago
P.S. glad I got a haircut since slide #44 :-) 3 years ago