2014 khmer protocols

Making de novo assembly cheap & easy:
standardized protocols for mRNAseq and
metagenome assembly and analysis
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Jan 2014
ctb@msu.edu

My lab’s focus
 De novo assembly and efficient/effective use of

NGS, especially for non-model organism.
 Open source software engineering.

 Training and education in NGS.

There is quite a bit of life left to sequence & assem

http://pacelab.colorado.edu/

Three problems:
1.

Assembly memory & compute requirements?

2.

It’s a complex process; what are good defaults?

3.

Training is limited in opportunity, difficult for
students, not always effective.

So, we want to go from raw data:
Name
@SRR606249.17/1
GAGTATGTTCTCATAGAGGTTGGTANNNNT
+
B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score
@SRR606249.17/2
CGAANNNNNNNNNNNNNNNNNCCTGGCTCA
+
CCCF#################22@GHIJJJ

…to “assembled” original sequence.

UMD assembly primer (cbcb.umd.edu)

Practical memory measurements

Velvet measurements (Adina Howe)

Shotgun sequencing & de novo
assembly:
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness

It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness

Why are big data sets difficult?
Need to resolve errors: the more coverage there is, the
more errors there are.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set

The scaling problem
 We can cheaply gather DNA data in quantities

sufficient to swamp straightforward assembly
algorithms running on commodity hardware.
 Since ~2008:
 The field has engaged in lots of engineering

optimization…
 …but the data generation rate has consistently
outstripped Moore’s Law.

Our solution: Digital
normalization

Contig assembly now scales a lot better.

Most samples can be assembled in < 50 GB of
memory.

Diginorm is widely useful, becoming
widely used:
1. Assembly of the H. contortus parasitic nematode
genome, a “high polymorphism/variable coverage”
problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P.
marinus) transcriptome, a “big assembly” problem.
(in prep)
3. Osedax symbiont metagenome, a “contaminated
metagenome” problem (Goffredi et al, 2013; pmid

Second problem: too many choices!
Read trimming
and ﬁltering
(x100)

What
programs and
options do
you use??

Assembly
(x10)

Quantiﬁcation
(x20)

Science!
(x 10,000)

Annotation
(x20)

Third problem: training
 I teach:
 Summer NGS course (two weeks, KBS); heavily

oversubscribed.
 Many ad hoc workshops
 Fall BEACON course (intro computational science)
 Others teach:
 Summer/fall workshops (Robin Buell)
 Various genomics/bioinformatics courses (Shin-han

Shiu, Rob Britton, ???)

Overall training results:
 We can fairly easily get people over the initial

“technical” hump (here are some programs,
here’s how to use them).
 We can begin to teach people the way to think

about the problem.
 People have a really tough time connecting

generic instruction to their own research,
however!
(And people need to learn how to analyze their own

Solution? khmer-protocols
Read cleaning

 Effort to provide standard “cheap”

assembly protocols for Illumina
mRNAseq & metagenomes in the
cloud.

Diginorm

Assembly

 Entirely copy/paste; ~2-6 days from

raw reads to assembly,
annotations, and differential
expression analysis. ~$150 on
Amazon per data set.

Annotation

RSEM differential
expression

 Open, versioned, forkable, citable.

“Eel Pond” mRNAseq protocol
Adapter trim &
quality ﬁlter
Group transcripts

EBSeq
(Differential
expression
analysis)

Diginorm to C=20

Annotate x
database
Trim highcoverage reads at
low-abundance
k-mers
RSEM (Map QC
reads to count)
Assemble with
Trinity

Extracting
differentially
expressed genes
& graphing

“Kalamazoo” metagenome protocol
Adapter trim &
quality ﬁlter
Partition
graph
Map reads to
assembly
Diginorm to C=10
Too big to
assemble?
Split into "groups"

Annotate contigs
with abundances

Trim highcoverage reads at
low-abundance
k-mers
Reinﬂate groups
(optional

Diginorm to C=5

Small enough to assemble?

Assemble!!!

Prokka

Show: Web site

http://khmer-protocols.readthedocs.org/

Show: mRNAseq output
Differential expression graph

What khmer-protocols is:
 Starting point.

 Defensible initial solution to get initial results.

Works on ~80% or more of samples, guesstimated.
 Great (?) way to learn
 100% reproducible; methods section on

computational analysis is more or less written for you.
 Fairly fast and inexpensive (comparatively)

(~$100/data set)

What khmer-protocols is not:
 The One True Solution.
 The Best Solution.
 Proprietary.
 Closed.
 Slow and expensive (comparatively).

Speed up/efficiency?
Walltime to complete assemblies

RAM needed to complete assemblies

occ oases occ trinity ocu oases ocu trinity

occ oases occ trinity ocu oases ocu trinity
500

400

Total memory used (GB)

Total walltime (hrs)

75

50

25

300

200

100

0

0
DN RAW

DN RAW

DN RAW

Sample

DN RAW

DN RAW

DN RAW

DN RAW

DN RAW

Sample

Elijah Lowe

Diginorm increases sensitivity (very
slightly :)

Evaluation by homology against a reference gene

37 extra from diginorm, vs 17 lost;

64 extra from diginorm, vs 15 lost;
Elijah Lowe

Please use!
 Would love feedback: what worked? What didn’t

work?
 Cannot support khmer protocols on HPC, but can

support it in the cloud; iCER may (?) support it on
HPC -- all of the software is installed.
(We are working on better default support for HPC.)

Links & more references
 ged.msu.edu/angus/ - NGS course materials
 khmer-protocols.readthedocs.org – khmer

protocols
 Cloud computing discussion next Wed, 1/22,

2pm, iCER. Don’t e-mail me at: ctb@msu.edu

2014 khmer protocols

More Related Content

What's hot

Viewers also liked

Similar to 2014 khmer protocols

More from c.titus.brown

Recently uploaded

2014 khmer protocols

Editor's Notes