Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2014 khmer protocols


Published on

Published in: Technology, Spiritual
  • Be the first to comment

2014 khmer protocols

  1. 1. Making de novo assembly cheap & easy: standardized protocols for mRNAseq and metagenome assembly and analysis C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Jan 2014
  2. 2. My lab’s focus  De novo assembly and efficient/effective use of NGS, especially for non-model organism.  Open source software engineering.  Training and education in NGS.
  3. 3. There is quite a bit of life left to sequence & assem
  4. 4. Three problems: 1. Assembly memory & compute requirements? 2. It’s a complex process; what are good defaults? 3. Training is limited in opportunity, difficult for students, not always effective.
  5. 5. First problem: lots of data!
  6. 6. So, we want to go from raw data: Name @SRR606249.17/1 GAGTATGTTCTCATAGAGGTTGGTANNNNT + B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score @SRR606249.17/2 CGAANNNNNNNNNNNNNNNNNCCTGGCTCA + CCCF#################22@GHIJJJ
  7. 7. …to “assembled” original sequence. UMD assembly primer (
  8. 8. Practical memory measurements Velvet measurements (Adina Howe)
  9. 9. Shotgun sequencing & de novo assembly: It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  10. 10. Why are big data sets difficult? Need to resolve errors: the more coverage there is, the more errors there are. Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  11. 11. The scaling problem  We can cheaply gather DNA data in quantities sufficient to swamp straightforward assembly algorithms running on commodity hardware.  Since ~2008:  The field has engaged in lots of engineering optimization…  …but the data generation rate has consistently outstripped Moore’s Law.
  12. 12. Our solution: Digital normalization
  13. 13. Digital normalization
  14. 14. Digital normalization
  15. 15. Digital normalization
  16. 16. Digital normalization
  17. 17. Digital normalization
  18. 18. Contig assembly now scales a lot better. Most samples can be assembled in < 50 GB of memory.
  19. 19. Diginorm is widely useful, becoming widely used: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid
  20. 20. Second problem: too many choices! Read trimming and filtering (x100) What programs and options do you use?? Assembly (x10) Quantification (x20) Science! (x 10,000) Annotation (x20)
  21. 21. Third problem: training  I teach:  Summer NGS course (two weeks, KBS); heavily oversubscribed.  Many ad hoc workshops  Fall BEACON course (intro computational science)  Others teach:  Summer/fall workshops (Robin Buell)  Various genomics/bioinformatics courses (Shin-han Shiu, Rob Britton, ???)
  22. 22. Overall training results:  We can fairly easily get people over the initial “technical” hump (here are some programs, here’s how to use them).  We can begin to teach people the way to think about the problem.  People have a really tough time connecting generic instruction to their own research, however! (And people need to learn how to analyze their own
  23. 23. Three problems: 1. Assembly memory & compute requirements? 2. It’s a complex process; what are good defaults? 3. Training is limited in opportunity, difficult for students, not always effective.
  24. 24. Solution? khmer-protocols Read cleaning  Effort to provide standard “cheap” assembly protocols for Illumina mRNAseq & metagenomes in the cloud. Diginorm Assembly  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set. Annotation RSEM differential expression  Open, versioned, forkable, citable.
  25. 25. “Eel Pond” mRNAseq protocol Adapter trim & quality filter Group transcripts EBSeq (Differential expression analysis) Diginorm to C=20 Annotate x database Trim highcoverage reads at low-abundance k-mers RSEM (Map QC reads to count) Assemble with Trinity Extracting differentially expressed genes & graphing
  26. 26. “Kalamazoo” metagenome protocol Adapter trim & quality filter Partition graph Map reads to assembly Diginorm to C=10 Too big to assemble? Split into "groups" Annotate contigs with abundances Trim highcoverage reads at low-abundance k-mers Reinflate groups (optional Diginorm to C=5 Small enough to assemble? Assemble!!! Prokka
  27. 27. Show: Web site
  28. 28. Show: mRNAseq output Differential expression graph
  29. 29. Show: mRNAseq spreadsheet
  30. 30. Show: BLAST server
  31. 31. Soon: Galaxy integration
  32. 32. What khmer-protocols is:  Starting point.  Defensible initial solution to get initial results. Works on ~80% or more of samples, guesstimated.  Great (?) way to learn  100% reproducible; methods section on computational analysis is more or less written for you.  Fairly fast and inexpensive (comparatively) (~$100/data set)
  33. 33. What khmer-protocols is not:  The One True Solution.  The Best Solution.  Proprietary.  Closed.  Slow and expensive (comparatively).
  34. 34. Speed up/efficiency? Walltime to complete assemblies RAM needed to complete assemblies occ oases occ trinity ocu oases ocu trinity occ oases occ trinity ocu oases ocu trinity 500 400 Total memory used (GB) Total walltime (hrs) 75 50 25 300 200 100 0 0 DN RAW DN RAW DN RAW Sample DN RAW DN RAW DN RAW DN RAW DN RAW Sample Elijah Lowe
  35. 35. Diginorm increases sensitivity (very slightly :) Evaluation by homology against a reference gene 37 extra from diginorm, vs 17 lost; 64 extra from diginorm, vs 15 lost; Elijah Lowe
  36. 36. Please use!  Would love feedback: what worked? What didn’t work?  Cannot support khmer protocols on HPC, but can support it in the cloud; iCER may (?) support it on HPC -- all of the software is installed. (We are working on better default support for HPC.)
  37. 37. Links & more references  - NGS course materials  – khmer protocols  Cloud computing discussion next Wed, 1/22, 2pm, iCER. Don’t e-mail me at: