Data of Unusual Size in   Metagenomics                  C. Titus Brown                  ctb@msu.edu     Asst Professor, Mi...
Openness• Twit me! @ctitusbrown• My blog: http://ivory.idyll.org/blog/• Grants, preprints, etc: http://ged.msu.edu/• Softw...
Thanks• My lab, esp. Jason Pell, Arend Hintze, Adina  Chuang Howe, Qingpeng Zhang, and Eric  McDonald• Michigan State, USD...
“Three types of data scientists.”       (Bob Grossman, U. Chicago, at XLDB 2012)1. Your data gathering rate is slower than...
Metagenomics• Randomly sequence DNA from mixed microbial communities, e.g. soil.• DNA sequencing rates (cost/volume) have ...
Analogy: feeding libraries into a paper shredder,digitizing the shreds, and reconstructing                the books.
“Shredding libraries” is a good analogy!• Lots of copies of Dickens, “Tale of Two Cities”, and SAT  study guides, etc.• No...
Two points:1. If we feed all of thelibraries in the world intoa paper shredder andmix, how do we recoverthe book content!?
Two points:2. That’s actuallyan awful lot ofdata…
Digression: Data of Unusual Size (aka Big        Data) in Scientific Research• Research is already hard enough:   – Novel,...
The assembly problem• The N**2 approach: look at all overlapping  fragments.• The word-based approach: further  decompose ...
Shotgun sequencing   “Coverage” is simply the average number of reads that overlap                    each true base in ge...
Reducing to k-mers overlaps  Note that k-mer abundance is not properly represented here! Each             blue k-mer will...
Errors create new k-mers          Each single base error generates ~k new k-mers. Generally, erroneous k-mers show up only...
So, our k-mer data contains both true and              false k-mers.
Random sampling => deep sampling needed   Typically 10-100x needed for robust recovery (300 Gbp for human)
Can we efficiently distinguish true from false?           Conway T C , Bromage A J Bioinformatics 2011;27:479-486© The Aut...
Uneven representation complicates matters.                Since you’re sequencing at                   random, you need to...
Streaming algorithm to do so:    digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Streaming algorithm for lossy       compression of data sets.• Converts random sampling to systematic sampling by  buildin...
Separately, apply Bloom filters to storing         the information/data.      “Exact” is for best possible information-the...
Some details• This was completely intractable.• Implemented in C++ and Python; “good practice” (?)• We’ve changed scaling ...
My rules of thumb for Big Data     (for a better tomorrow)1. Write well-understood filters and   components, not monolithi...
My rules of thumb for Big Data     (for a better tomorrow)2. Throw away data as quickly as possible.
My rules of thumb for Big Data     (for a better tomorrow)3. Scripting is an extremely effective way toconnect serious sof...
My rules of thumb for Big Data     (for a better tomorrow)4. Streaming/online approaches are worth theeffort to develop th...
My rules of thumb for Big Data     (for a better tomorrow)1. Write well-understood filters and components, not   monolithi...
Upcoming SlideShare
Loading in...5
×

2013 siam-cse-big-data

1,545

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,545
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

2013 siam-cse-big-data

  1. 1. Data of Unusual Size in Metagenomics C. Titus Brown ctb@msu.edu Asst Professor, Michigan State University (Microbiology, Computer Science, and BEACON)
  2. 2. Openness• Twit me! @ctitusbrown• My blog: http://ivory.idyll.org/blog/• Grants, preprints, etc: http://ged.msu.edu/• Software: BSD, github.com/ged-lab/.
  3. 3. Thanks• My lab, esp. Jason Pell, Arend Hintze, Adina Chuang Howe, Qingpeng Zhang, and Eric McDonald• Michigan State, USDA and NSF for $$
  4. 4. “Three types of data scientists.” (Bob Grossman, U. Chicago, at XLDB 2012)1. Your data gathering rate is slower than Moore’s Law.2. Your data gathering rate matches Moore’s Law.3. Your data gathering rate exceeds Moore’s Law.
  5. 5. Metagenomics• Randomly sequence DNA from mixed microbial communities, e.g. soil.• DNA sequencing rates (cost/volume) have been outpacing Moore’s Law for ~5 years now… A terabase for ~$10k today.
  6. 6. Analogy: feeding libraries into a paper shredder,digitizing the shreds, and reconstructing the books.
  7. 7. “Shredding libraries” is a good analogy!• Lots of copies of Dickens, “Tale of Two Cities”, and SAT study guides, etc.• Not as many copies of <obscure hipster author>.• Many different editions with minor differences, + Reader’s Digest, excerpts, etc.• (Although for libraries we usually know the language)
  8. 8. Two points:1. If we feed all of thelibraries in the world intoa paper shredder andmix, how do we recoverthe book content!?
  9. 9. Two points:2. That’s actuallyan awful lot ofdata…
  10. 10. Digression: Data of Unusual Size (aka Big Data) in Scientific Research• Research is already hard enough: – Novel, fast moving, heterogeneous data types. – Unknown answers.• Big Data => scaling, requires good engineering – Apply or invent new data structures & algorithms. – Write usable, functioning, reusable software.(Hint: academics are not good at one of these things)
  11. 11. The assembly problem• The N**2 approach: look at all overlapping fragments.• The word-based approach: further decompose words into fixed-length overlapping hashable words. (Only one of these scales…)
  12. 12. Shotgun sequencing “Coverage” is simply the average number of reads that overlap each true base in genome.Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  13. 13. Reducing to k-mers overlaps Note that k-mer abundance is not properly represented here! Each blue k-mer will be present around 10 times.
  14. 14. Errors create new k-mers Each single base error generates ~k new k-mers. Generally, erroneous k-mers show up only once – errors are random.
  15. 15. So, our k-mer data contains both true and false k-mers.
  16. 16. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  17. 17. Can we efficiently distinguish true from false? Conway T C , Bromage A J Bioinformatics 2011;27:479-486© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  18. 18. Uneven representation complicates matters. Since you’re sequencing at random, you need to sequence deeply in order to be sensitive to rare hipster books. These rare hipster books may be important to understanding culture: not only best-sellers have influence!
  19. 19. Streaming algorithm to do so: digital normalization
  20. 20. Digital normalization
  21. 21. Digital normalization
  22. 22. Digital normalization
  23. 23. Digital normalization
  24. 24. Digital normalization
  25. 25. Streaming algorithm for lossy compression of data sets.• Converts random sampling to systematic sampling by building an assembly graph on the fly• Can discard up to 99.9% of data set and errors, and still retain all information necessary for assembly.• Acts as a prefilter for assemblers; ~5 lines of Python.• Each piece of data is only examined once (!)• Most errors are never collected => low memory.
  26. 26. Separately, apply Bloom filters to storing the information/data. “Exact” is for best possible information-theoretical storage. Pell et al., PNAS 2012
  27. 27. Some details• This was completely intractable.• Implemented in C++ and Python; “good practice” (?)• We’ve changed scaling behavior from data to information.• Practical scaling for ~soil metagenomics is 10-100x: need < 1 TB of RAM for ~2 TB of data. ~2 weeks.• Just beginning to explore threading, multicore, etc. (BIG DATA grant proposal)• Goal is to scale to 50 Tbp of data (~5-50 TB RAM currently)
  28. 28. My rules of thumb for Big Data (for a better tomorrow)1. Write well-understood filters and components, not monolithic programs.
  29. 29. My rules of thumb for Big Data (for a better tomorrow)2. Throw away data as quickly as possible.
  30. 30. My rules of thumb for Big Data (for a better tomorrow)3. Scripting is an extremely effective way toconnect serious software to scientists.
  31. 31. My rules of thumb for Big Data (for a better tomorrow)4. Streaming/online approaches are worth theeffort to develop them. (OK, this is obvious to this audience)
  32. 32. My rules of thumb for Big Data (for a better tomorrow)1. Write well-understood filters and components, not monolithic programs.2. Throw away data as quickly as possible.3. Scripting is an extremely effective way to connect serious software to scientists.4. Streaming/online approaches are worth the effort to develop them.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×