Your SlideShare is downloading. ×
  • Like
  • Save
Hadoop for Bioinformatics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hadoop for Bioinformatics

  • 12,167 views
Published

My Hadoop World presentation

My Hadoop World presentation

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • amazing
    Are you sure you want to
    Your message goes here
  • duaridhi, I am more of a observer of what people are doing, although do dabble as much as time permits (i.e. not much)
    Are you sure you want to
    Your message goes here
  • Hi Deepak. I enjoyed your presentation.. Are you currently working on this?
    Are you sure you want to
    Your message goes here
  • Data geek = sits in a dark room staring at a monitor Data center geek = Sits in a dark warehouse staring at a monitor

    PS: At least you have that picture for posterity
    Are you sure you want to
    Your message goes here
  • Hi Deepak, enjoyed the presentation. Whats the difference between a data geek and a data center geek? Yours geekily, Duncan.

    P.S. glad I got a haircut since slide #44 :-)
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
12,167
On SlideShare
0
From Embeds
0
Number of Embeds
8

Actions

Shares
Downloads
47
Comments
5
Likes
42

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop for Bioinformatics Deepak Singh Amazon Web Services Hadoop World, NYC
  • 2. Via Reavel under a CC-BY-NC-ND license
  • 3. By ~Prescott under a CC-BY-NC license
  • 4. data sets
  • 5. many data sets
  • 6. PFAM PDB GENBANK ENSEMBL Many Others
  • 7. manageable
  • 8. Image: Matt Wood
  • 9. Human genom e Image: Matt Wood
  • 10. Image: Matt Wood
  • 11. ~100 TB/Week Image: Matt Wood
  • 12. ~100 TB/Week >2 PB/Year Image: Matt Wood
  • 13. years
  • 14. days
  • 15. hours
  • 16. gigabytes
  • 17. terabytes
  • 18. petabytes
  • 19. really fast
  • 20. typical informatics workflow
  • 21. Via Christolakis under a CC-BY-NC-ND license
  • 22. Via Argonne National Labs under a CC-BY-SA license
  • 23. killer app Via Argonne National Labs under a CC-BY-SA license
  • 24. Via asklar under a CC-BY license
  • 25. Image: Chris Dagdigian
  • 26. rethink algorithms
  • 27. rethink computing
  • 28. rethink data management
  • 29. rethink data sharing
  • 30. operational mindset
  • 31. scalability
  • 32. we are data geeks not data center geeks
  • 33. two key trends
  • 34. develop applications
  • 35. distribute applications
  • 36. use applications
  • 37. some work
  • 38. filters some work ^
  • 39. High Throughput Sequence Analysis Mike Schatz, University of Maryland
  • 40. • Read Mapping • Mapping & SNP Discovery • De novo Genome Assembly
  • 41. Short Read Mapping
  • 42. Asian Individual Genome: 3.3 Billion 35bp, 104 GB (Wang et al., 2008) African Individual Genome: 4.0 Billion 35bp, 144 GB (Bentley et al., 2008)
  • 43. Alignment > 10000 CPU hrs
  • 44. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1)
  • 45. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  • 46. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  • 47. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale Need parallelization framework
  • 48. CloudBurst Catalog k-mers Collect seeds End-to-end alignment
  • 49. http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
  • 50. CloudBurst efficiently reports every k-difference alignment of every read
  • 51. many applications only need the best alignment
  • 52. Bowtie: Ultrafast short read aligner Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  • 53. SOAPSnp: Consensus alignment and SNP calling Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  • 54. Crossbow: Rapid whole genome SNP analysis Ben Langmead Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  • 55. Preprocessed reads
  • 56. Preprocessed reads Map: Bowtie
  • 57. Preprocessed reads Map: Bowtie Sort: Bin and partition
  • 58. Preprocessed reads Map: Bowtie Sort: Bin and partition Reduce: SoapSNP
  • 59. Crossbow condenses over 1,000 hours of resequencing computation into a few hours without requiring the user to own or operate a computer cluster
  • 60. Comparing Genomes
  • 61. Estimating relative evolutionary rates from sequence comparisons: Identification of probable orthologs Admissible comparisons: A or B vs. D C vs. E Inadmissible comparisons: A or B vs. E C vs. D A B C D E species tree gene tree S. cerevisiae C. elegans
  • 62. Estimating relative evolutionary rates from sequence comparisons: 1. Orthologs found using the Reciprocal smallest distance algorithm 2. Build alignment between two orthologs >Sequence C MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… >Sequence E MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL… 3. Estimate distance given a substitution matrix Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ A B C D E species tree gene tree S. cerevisiae C. elegans
  • 63. RSD algorithm summary Genome I Genome J Ib Jc Align sequences & Calculate distances L Orthologs: Align sequences & Calculate distances H ib - jc D = 0.1 c vs. D=1.2 vs. D=0.2 a b a vs. D=0.1 vs. D=0.3 c b b b vs. D=0.9 vs. D=0.1 c c b c
  • 64. Prof. Dennis Wall Harvard Medical School
  • 65. Roundup is a database of orthologs and their evolutionary distances. To get started, click browse. Alternatively, you can read our documentation here. Good luck, researchers!
  • 66. massive computational demand
  • 67. 1000 genomes = 5,994,000 processes = 23,976,000 hours
  • 68. 2737 years
  • 69. periodic task
  • 70. must scale up
  • 71. not scalability gurus
  • 72. hadoop streaming
  • 73. compared 50+ genomes
  • 74. what’s next?
  • 75. de novo assembly
  • 76. machine learning and statistics
  • 77. protein structure prediction
  • 78. docking
  • 79. trajectory analysis
  • 80. key driving factors?
  • 81. the ecosystem
  • 82. Pig
  • 83. Cascading
  • 84. Hive
  • 85. RHIPE
  • 86. domain specific libraries and tools
  • 87. http://aws.amazon.com/publicdatasets/
  • 88. http://aws.amazon.com/education/
  • 89. Thank you! deesingh@amazon.com; Twitter:@mndoci Presentation ideas from @mza, @simon and @lessig