Making de novo assembly cheap & easy:
standardized protocols for mRNAseq and
metagenome assembly and analysis
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Jan 2014
ctb@msu.edu
My lab’s focus
 De novo assembly and efficient/effective use of

NGS, especially for non-model organism.
 Open source software engineering.

 Training and education in NGS.
There is quite a bit of life left to sequence & assem

http://pacelab.colorado.edu/
Three problems:
1.

Assembly memory & compute requirements?

2.

It’s a complex process; what are good defaults?

3.

Training is limited in opportunity, difficult for
students, not always effective.
First problem: lots of data!
So, we want to go from raw data:
Name
@SRR606249.17/1
GAGTATGTTCTCATAGAGGTTGGTANNNNT
+
B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score
@SRR606249.17/2
CGAANNNNNNNNNNNNNNNNNCCTGGCTCA
+
CCCF#################22@GHIJJJ
…to “assembled” original sequence.

UMD assembly primer (cbcb.umd.edu)
Practical memory measurements

Velvet measurements (Adina Howe)
Shotgun sequencing & de novo
assembly:
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness

It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
Why are big data sets difficult?
Need to resolve errors: the more coverage there is, the
more errors there are.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
The scaling problem
 We can cheaply gather DNA data in quantities

sufficient to swamp straightforward assembly
algorithms running on commodity hardware.
 Since ~2008:
 The field has engaged in lots of engineering

optimization…
 …but the data generation rate has consistently
outstripped Moore’s Law.
Our solution: Digital
normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Contig assembly now scales a lot better.

Most samples can be assembled in < 50 GB of
memory.
Diginorm is widely useful, becoming
widely used:
1. Assembly of the H. contortus parasitic nematode
genome, a “high polymorphism/variable coverage”
problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P.
marinus) transcriptome, a “big assembly” problem.
(in prep)
3. Osedax symbiont metagenome, a “contaminated
metagenome” problem (Goffredi et al, 2013; pmid
Second problem: too many choices!
Read trimming
and filtering
(x100)

What
programs and
options do
you use??

Assembly
(x10)

Quantification
(x20)

Science!
(x 10,000)

Annotation
(x20)
Third problem: training
 I teach:
 Summer NGS course (two weeks, KBS); heavily

oversubscribed.
 Many ad hoc workshops
 Fall BEACON course (intro computational science)
 Others teach:
 Summer/fall workshops (Robin Buell)
 Various genomics/bioinformatics courses (Shin-han

Shiu, Rob Britton, ???)
Overall training results:
 We can fairly easily get people over the initial

“technical” hump (here are some programs,
here’s how to use them).
 We can begin to teach people the way to think

about the problem.
 People have a really tough time connecting

generic instruction to their own research,
however!
(And people need to learn how to analyze their own
Three problems:
1.

Assembly memory & compute requirements?

2.

It’s a complex process; what are good defaults?

3.

Training is limited in opportunity, difficult for
students, not always effective.
Solution? khmer-protocols
Read cleaning

 Effort to provide standard “cheap”

assembly protocols for Illumina
mRNAseq & metagenomes in the
cloud.

Diginorm

Assembly

 Entirely copy/paste; ~2-6 days from

raw reads to assembly,
annotations, and differential
expression analysis. ~$150 on
Amazon per data set.

Annotation

RSEM differential
expression

 Open, versioned, forkable, citable.
“Eel Pond” mRNAseq protocol
Adapter trim &
quality filter
Group transcripts

EBSeq
(Differential
expression
analysis)

Diginorm to C=20

Annotate x
database
Trim highcoverage reads at
low-abundance
k-mers
RSEM (Map QC
reads to count)
Assemble with
Trinity

Extracting
differentially
expressed genes
& graphing
“Kalamazoo” metagenome protocol
Adapter trim &
quality filter
Partition
graph
Map reads to
assembly
Diginorm to C=10
Too big to
assemble?
Split into "groups"

Annotate contigs
with abundances

Trim highcoverage reads at
low-abundance
k-mers
Reinflate groups
(optional

Diginorm to C=5

Small enough to assemble?

Assemble!!!

Prokka
Show: Web site

http://khmer-protocols.readthedocs.org/
Show: mRNAseq output
Differential expression graph
Show: mRNAseq spreadsheet
Show: BLAST server
Soon: Galaxy integration
What khmer-protocols is:
 Starting point.

 Defensible initial solution to get initial results.

Works on ~80% or more of samples, guesstimated.
 Great (?) way to learn
 100% reproducible; methods section on

computational analysis is more or less written for you.
 Fairly fast and inexpensive (comparatively)

(~$100/data set)
What khmer-protocols is not:
 The One True Solution.
 The Best Solution.
 Proprietary.
 Closed.
 Slow and expensive (comparatively).
Speed up/efficiency?
Walltime to complete assemblies

RAM needed to complete assemblies

occ oases occ trinity ocu oases ocu trinity

occ oases occ trinity ocu oases ocu trinity
500

400

Total memory used (GB)

Total walltime (hrs)

75

50

25

300

200

100

0

0
DN RAW

DN RAW

DN RAW

Sample

DN RAW

DN RAW

DN RAW

DN RAW

DN RAW

Sample

Elijah Lowe
Diginorm increases sensitivity (very
slightly :)

Evaluation by homology against a reference gene

37 extra from diginorm, vs 17 lost;

64 extra from diginorm, vs 15 lost;
Elijah Lowe
Please use!
 Would love feedback: what worked? What didn’t

work?
 Cannot support khmer protocols on HPC, but can

support it in the cloud; iCER may (?) support it on
HPC -- all of the software is installed.
(We are working on better default support for HPC.)
Links & more references
 ged.msu.edu/angus/ - NGS course materials
 khmer-protocols.readthedocs.org – khmer

protocols
 Cloud computing discussion next Wed, 1/22,

2pm, iCER. Don’t e-mail me at: ctb@msu.edu

2014 khmer protocols

  • 1.
    Making de novoassembly cheap & easy: standardized protocols for mRNAseq and metagenome assembly and analysis C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Jan 2014 ctb@msu.edu
  • 2.
    My lab’s focus De novo assembly and efficient/effective use of NGS, especially for non-model organism.  Open source software engineering.  Training and education in NGS.
  • 3.
    There is quitea bit of life left to sequence & assem http://pacelab.colorado.edu/
  • 4.
    Three problems: 1. Assembly memory& compute requirements? 2. It’s a complex process; what are good defaults? 3. Training is limited in opportunity, difficult for students, not always effective.
  • 5.
  • 6.
    So, we wantto go from raw data: Name @SRR606249.17/1 GAGTATGTTCTCATAGAGGTTGGTANNNNT + B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score @SRR606249.17/2 CGAANNNNNNNNNNNNNNNNNCCTGGCTCA + CCCF#################22@GHIJJJ
  • 7.
    …to “assembled” originalsequence. UMD assembly primer (cbcb.umd.edu)
  • 8.
    Practical memory measurements Velvetmeasurements (Adina Howe)
  • 9.
    Shotgun sequencing &de novo assembly: It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 10.
    Why are bigdata sets difficult? Need to resolve errors: the more coverage there is, the more errors there are. Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  • 11.
    The scaling problem We can cheaply gather DNA data in quantities sufficient to swamp straightforward assembly algorithms running on commodity hardware.  Since ~2008:  The field has engaged in lots of engineering optimization…  …but the data generation rate has consistently outstripped Moore’s Law.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    Contig assembly nowscales a lot better. Most samples can be assembled in < 50 GB of memory.
  • 19.
    Diginorm is widelyuseful, becoming widely used: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid
  • 20.
    Second problem: toomany choices! Read trimming and filtering (x100) What programs and options do you use?? Assembly (x10) Quantification (x20) Science! (x 10,000) Annotation (x20)
  • 21.
    Third problem: training I teach:  Summer NGS course (two weeks, KBS); heavily oversubscribed.  Many ad hoc workshops  Fall BEACON course (intro computational science)  Others teach:  Summer/fall workshops (Robin Buell)  Various genomics/bioinformatics courses (Shin-han Shiu, Rob Britton, ???)
  • 22.
    Overall training results: We can fairly easily get people over the initial “technical” hump (here are some programs, here’s how to use them).  We can begin to teach people the way to think about the problem.  People have a really tough time connecting generic instruction to their own research, however! (And people need to learn how to analyze their own
  • 23.
    Three problems: 1. Assembly memory& compute requirements? 2. It’s a complex process; what are good defaults? 3. Training is limited in opportunity, difficult for students, not always effective.
  • 24.
    Solution? khmer-protocols Read cleaning Effort to provide standard “cheap” assembly protocols for Illumina mRNAseq & metagenomes in the cloud. Diginorm Assembly  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set. Annotation RSEM differential expression  Open, versioned, forkable, citable.
  • 26.
    “Eel Pond” mRNAseqprotocol Adapter trim & quality filter Group transcripts EBSeq (Differential expression analysis) Diginorm to C=20 Annotate x database Trim highcoverage reads at low-abundance k-mers RSEM (Map QC reads to count) Assemble with Trinity Extracting differentially expressed genes & graphing
  • 27.
    “Kalamazoo” metagenome protocol Adaptertrim & quality filter Partition graph Map reads to assembly Diginorm to C=10 Too big to assemble? Split into "groups" Annotate contigs with abundances Trim highcoverage reads at low-abundance k-mers Reinflate groups (optional Diginorm to C=5 Small enough to assemble? Assemble!!! Prokka
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
    What khmer-protocols is: Starting point.  Defensible initial solution to get initial results. Works on ~80% or more of samples, guesstimated.  Great (?) way to learn  100% reproducible; methods section on computational analysis is more or less written for you.  Fairly fast and inexpensive (comparatively) (~$100/data set)
  • 34.
    What khmer-protocols isnot:  The One True Solution.  The Best Solution.  Proprietary.  Closed.  Slow and expensive (comparatively).
  • 35.
    Speed up/efficiency? Walltime tocomplete assemblies RAM needed to complete assemblies occ oases occ trinity ocu oases ocu trinity occ oases occ trinity ocu oases ocu trinity 500 400 Total memory used (GB) Total walltime (hrs) 75 50 25 300 200 100 0 0 DN RAW DN RAW DN RAW Sample DN RAW DN RAW DN RAW DN RAW DN RAW Sample Elijah Lowe
  • 36.
    Diginorm increases sensitivity(very slightly :) Evaluation by homology against a reference gene 37 extra from diginorm, vs 17 lost; 64 extra from diginorm, vs 15 lost; Elijah Lowe
  • 37.
    Please use!  Wouldlove feedback: what worked? What didn’t work?  Cannot support khmer protocols on HPC, but can support it in the cloud; iCER may (?) support it on HPC -- all of the software is installed. (We are working on better default support for HPC.)
  • 38.
    Links & morereferences  ged.msu.edu/angus/ - NGS course materials  khmer-protocols.readthedocs.org – khmer protocols  Cloud computing discussion next Wed, 1/22, 2pm, iCER. Don’t e-mail me at: ctb@msu.edu

Editor's Notes

  • #19 Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression =&gt; OLC assembly.