Better science through
superior software
Michael R. Crusoe
Software Engineer & Bioinformatician
The GED Lab @ Michigan Sta...
Open, online science
Much of the software and approaches talked
about today are available:
khmer software:
http://github.c...
Overview
● Next-gen sequencing data deluge
● ♫How do you solve a problem like big data?♫
● Impact of khmer software
● Futu...
Problem
“The power of next-gen. sequencing: get 180x
coverage... and then watch your assemblies
never finish” - Erich Schw...
“Three types of data scientists.”
(Bob Grossman, U. Chicago, at XLDB 2012)
1. Your data gathering rate is slower than
Moor...
“Three types of data scientists.”
1. Your data gathering rate is slower than Moore’
s Law.
=> Be lazy, all will work out.
...
A software & algorithms approach: can we
develop lossy compression approaches that
1. Reduce data size & remove errors => ...
Digital normalization approach
A digital analog to cDNA library normalization,
diginorm:
● Reference free.
● Is single pas...
GED Lab’s approach: khmer
diginorm: ejects most data while retaining the
information content.
partitioning: split transcri...
TheGEDlabat MSU:
Theoretical => applied solutions.
Impact
● any biologist can use our tools in a rented
cloud computer, cheaply
● Overcome complexity: Erich Schwarz
assemble...
Future work
● targeted-gene assembly from short reads
(Fish et al., Ribosomal Database Project)
● rRNA search in shotgun d...
Interactions
khmer both builds upon existing Free and
Open-Source Software (F/OSS) and is itself
made under an open-source...
● BIG DATA grant reviewers specifically
mentioned the GED Lab’s “[...] long and
successful track-record and experience in
...
Personal Acknowledgments
C. Titus Brown for slides, employment
Acknowledgements
Labmembersinvolved Collaborators
● Adina Howe (w/Tiedje)
● Jason Pell
● Arend Hintze
● Rosangela Canino-
...
Better science through superior software
Upcoming SlideShare
Loading in …5
×

Better science through superior software

406 views
369 views

Published on

Presentation given to the BEACON 2013 Congress during the "Collaborating with Industry" sandbox

Original w/ slide notes at: https://docs.google.com/presentation/d/1mmvD0R3fLIl11TmFHij5fGcMDb9qJxy_nwENO2Rt-YI/edit?usp=sharing

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
406
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Better science through superior software

  1. 1. Better science through superior software Michael R. Crusoe Software Engineer & Bioinformatician The GED Lab @ Michigan State mcrusoe@msu.edu @biocrusoe
  2. 2. Open, online science Much of the software and approaches talked about today are available: khmer software: http://github.com/ged-lab/khmer/ Titus’s blog: http://ivory.idyll.org/blog/ Titus’s twitter: @ctitusbrown
  3. 3. Overview ● Next-gen sequencing data deluge ● ♫How do you solve a problem like big data?♫ ● Impact of khmer software ● Future work ● Being a good F/OSS community member and leading by example ● Acknowledgements
  4. 4. Problem “The power of next-gen. sequencing: get 180x coverage... and then watch your assemblies never finish” - Erich Schwarz
  5. 5. “Three types of data scientists.” (Bob Grossman, U. Chicago, at XLDB 2012) 1. Your data gathering rate is slower than Moore’s Law. 2. Your data gathering rate matches Moore’s Law. 3. Your data gathering rate exceeds Moore’s Law.
  6. 6. “Three types of data scientists.” 1. Your data gathering rate is slower than Moore’ s Law. => Be lazy, all will work out. 2. Your data gathering rate matches Moore’s Law. => You need to write good software, but all will work out. 3. Your data gathering rate exceeds Moore’s Law. => You need serious help.
  7. 7. A software & algorithms approach: can we develop lossy compression approaches that 1. Reduce data size & remove errors => efficient processing? 2. Retain all “information”? (think JPEG) If so, then we can store only the compressed data for later reanalysis. Short answer is: yes, we can.
  8. 8. Digital normalization approach A digital analog to cDNA library normalization, diginorm: ● Reference free. ● Is single pass: looks at each read only once; ● Does not “collect” the majority of errors; ● Keeps all low-coverage reads & retains all information.
  9. 9. GED Lab’s approach: khmer diginorm: ejects most data while retaining the information content. partitioning: split transcriptomic and meta {transcript,gen}omic datasets fast k-mer counting: for better preprocessing, repeat detection, and sequencing coverage estimates Reference-free variant calling - More to come -
  10. 10. TheGEDlabat MSU: Theoretical => applied solutions.
  11. 11. Impact ● any biologist can use our tools in a rented cloud computer, cheaply ● Overcome complexity: Erich Schwarz assembled H. contortus when it was previously not possible. ● Overcome data excess: 5.1 billion reads from 50 different sea lamprey tissue -> diginorm technique removed 98.7% for being redundant.
  12. 12. Future work ● targeted-gene assembly from short reads (Fish et al., Ribosomal Database Project) ● rRNA search in shotgun data ● error-correction for mRNAseq & metagenomic data ● strain variation collapse, assembly, and recovery ● Goal: make most assembly easy and all evaluation easy.
  13. 13. Interactions khmer both builds upon existing Free and Open-Source Software (F/OSS) and is itself made under an open-source license. used in curriculum: both Software Carpentry ANGUS based courses and the MSU NGS summer course
  14. 14. ● BIG DATA grant reviewers specifically mentioned the GED Lab’s “[...] long and successful track-record and experience in following rigorous but open software development processes.” -> CTB received 3- year NIH R01 support ● Transparent and public software development yielded participation from others.
  15. 15. Personal Acknowledgments C. Titus Brown for slides, employment
  16. 16. Acknowledgements Labmembersinvolved Collaborators ● Adina Howe (w/Tiedje) ● Jason Pell ● Arend Hintze ● Rosangela Canino- Koning ● Qingpeng Zhang ● Elijah Lowe ● Likit Preeyanon ● Jiarong Guo ● Tim Brom ● Kanchan Pavangadkar ● Eric McDonald ● Chris Welcher ● Jim Tiedje, MSU ● Billie Swalla, UW ● Janet Jansson, LBNL ● Susannah Tringe, JGI Funding USDA NIFA; NSF IOS; BEACON.

×