Hadoop for Bioinformatics

  • 12,044 views
Uploaded on

My Hadoop World presentation

My Hadoop World presentation

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • amazing
    Are you sure you want to
    Your message goes here
  • duaridhi, I am more of a observer of what people are doing, although do dabble as much as time permits (i.e. not much)
    Are you sure you want to
    Your message goes here
  • Hi Deepak. I enjoyed your presentation.. Are you currently working on this?
    Are you sure you want to
    Your message goes here
  • Data geek = sits in a dark room staring at a monitor Data center geek = Sits in a dark warehouse staring at a monitor

    PS: At least you have that picture for posterity
    Are you sure you want to
    Your message goes here
  • Hi Deepak, enjoyed the presentation. Whats the difference between a data geek and a data center geek? Yours geekily, Duncan.

    P.S. glad I got a haircut since slide #44 :-)
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
12,044
On Slideshare
0
From Embeds
0
Number of Embeds
8

Actions

Shares
Downloads
47
Comments
5
Likes
43

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop for Bioinformatics Deepak Singh Amazon Web Services Hadoop World, NYC
  • 2. Via Reavel under a CC-BY-NC-ND license
  • 3. By ~Prescott under a CC-BY-NC license
  • 4. data sets
  • 5. many data sets
  • 6. PFAM PDB GENBANK ENSEMBL Many Others
  • 7. manageable
  • 8. Image: Matt Wood
  • 9. Human genom e Image: Matt Wood
  • 10. Image: Matt Wood
  • 11. ~100 TB/Week Image: Matt Wood
  • 12. ~100 TB/Week >2 PB/Year Image: Matt Wood
  • 13. years
  • 14. days
  • 15. hours
  • 16. gigabytes
  • 17. terabytes
  • 18. petabytes
  • 19. really fast
  • 20. typical informatics workflow
  • 21. Via Christolakis under a CC-BY-NC-ND license
  • 22. Via Argonne National Labs under a CC-BY-SA license
  • 23. killer app Via Argonne National Labs under a CC-BY-SA license
  • 24. Via asklar under a CC-BY license
  • 25. Image: Chris Dagdigian
  • 26. rethink algorithms
  • 27. rethink computing
  • 28. rethink data management
  • 29. rethink data sharing
  • 30. operational mindset
  • 31. scalability
  • 32. we are data geeks not data center geeks
  • 33. two key trends
  • 34. develop applications
  • 35. distribute applications
  • 36. use applications
  • 37. some work
  • 38. filters some work ^
  • 39. High Throughput Sequence Analysis Mike Schatz, University of Maryland
  • 40. • Read Mapping • Mapping & SNP Discovery • De novo Genome Assembly
  • 41. Short Read Mapping
  • 42. Asian Individual Genome: 3.3 Billion 35bp, 104 GB (Wang et al., 2008) African Individual Genome: 4.0 Billion 35bp, 144 GB (Bentley et al., 2008)
  • 43. Alignment > 10000 CPU hrs
  • 44. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1)
  • 45. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  • 46. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  • 47. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale Need parallelization framework
  • 48. CloudBurst Catalog k-mers Collect seeds End-to-end alignment
  • 49. http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
  • 50. CloudBurst efficiently reports every k-difference alignment of every read
  • 51. many applications only need the best alignment
  • 52. Bowtie: Ultrafast short read aligner Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  • 53. SOAPSnp: Consensus alignment and SNP calling Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  • 54. Crossbow: Rapid whole genome SNP analysis Ben Langmead Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  • 55. Preprocessed reads
  • 56. Preprocessed reads Map: Bowtie
  • 57. Preprocessed reads Map: Bowtie Sort: Bin and partition
  • 58. Preprocessed reads Map: Bowtie Sort: Bin and partition Reduce: SoapSNP
  • 59. Crossbow condenses over 1,000 hours of resequencing computation into a few hours without requiring the user to own or operate a computer cluster
  • 60. Comparing Genomes
  • 61. Estimating relative evolutionary rates from sequence comparisons: Identification of probable orthologs Admissible comparisons: A or B vs. D C vs. E Inadmissible comparisons: A or B vs. E C vs. D A B C D E species tree gene tree S. cerevisiae C. elegans
  • 62. Estimating relative evolutionary rates from sequence comparisons: 1. Orthologs found using the Reciprocal smallest distance algorithm 2. Build alignment between two orthologs >Sequence C MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… >Sequence E MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL… 3. Estimate distance given a substitution matrix Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ A B C D E species tree gene tree S. cerevisiae C. elegans
  • 63. RSD algorithm summary Genome I Genome J Ib Jc Align sequences & Calculate distances L Orthologs: Align sequences & Calculate distances H ib - jc D = 0.1 c vs. D=1.2 vs. D=0.2 a b a vs. D=0.1 vs. D=0.3 c b b b vs. D=0.9 vs. D=0.1 c c b c
  • 64. Prof. Dennis Wall Harvard Medical School
  • 65. Roundup is a database of orthologs and their evolutionary distances. To get started, click browse. Alternatively, you can read our documentation here. Good luck, researchers!
  • 66. massive computational demand
  • 67. 1000 genomes = 5,994,000 processes = 23,976,000 hours
  • 68. 2737 years
  • 69. periodic task
  • 70. must scale up
  • 71. not scalability gurus
  • 72. hadoop streaming
  • 73. compared 50+ genomes
  • 74. what’s next?
  • 75. de novo assembly
  • 76. machine learning and statistics
  • 77. protein structure prediction
  • 78. docking
  • 79. trajectory analysis
  • 80. key driving factors?
  • 81. the ecosystem
  • 82. Pig
  • 83. Cascading
  • 84. Hive
  • 85. RHIPE
  • 86. domain specific libraries and tools
  • 87. http://aws.amazon.com/publicdatasets/
  • 88. http://aws.amazon.com/education/
  • 89. Thank you! deesingh@amazon.com; Twitter:@mndoci Presentation ideas from @mza, @simon and @lessig