Hadoop for Bioinformatics
                       Deepak Singh
                    Amazon Web Services




Hadoop World, NYC
Via Reavel under a CC-BY-NC-ND license
By ~Prescott under a CC-BY-NC license
data sets
many data sets
PFAM                                PDB




       GENBANK                 ENSEMBL




                 Many Others
manageable
Image: Matt Wood
Human
                   genome




Image: Matt Wood
Image: Matt Wood
~100 TB/Week
Image: Matt Wood
~100 TB/Week
                       >2 PB/Year
Image: Matt Wood
years
days
hours
gigabytes
terabytes
petabytes
really fast
typical informatics workflow
Via Christolakis under a CC-BY-NC-ND license
Via Argonne National Labs under a CC-BY-SA license
killer app




Via Argonne National Labs under a CC-BY-SA license
Via asklar under a CC-BY license
Image: Chris Dagdigian
rethink algorithms
rethink computing
rethink data management
rethink data sharing
operational mindset
scalability
we are data geeks not data center geeks
two key trends
develop applications
distribute applications
use applications
some work
filters
some work
   ^
High Throughput Sequence Analysis
Mike Schatz, University of Maryland
• Read Mapping
• Mapping & SNP Discovery
• De novo Genome Assembly
Short Read Mapping
Asian Individual Genome: 3.3 Billion 35bp, 104
GB (Wang et al., 2008)

African Individual Genome: 4.0 Billion 35bp, 144
GB...
Alignment > 10000 CPU hrs
Seed & Extend
Good alignments must have significant
exact alignment

Minimal exact alignment length = l/(k+1)
Seed & Extend
Good alignments must have significant
exact alignment

Minimal exact alignment length = l/(k+1)



          ...
Seed & Extend
Good alignments must have significant
exact alignment

Minimal exact alignment length = l/(k+1)



          ...
Seed & Extend
Good alignments must have significant
exact alignment

Minimal exact alignment length = l/(k+1)



          ...
CloudBurst




Catalog k-mers     Collect seeds   End-to-end alignment
http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
CloudBurst efficiently reports every k-difference
           alignment of every read
many applications only need the best alignment
Bowtie: Ultrafast short read aligner




Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignme...
SOAPSnp: Consensus alignment and SNP calling




Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient...
Crossbow: Rapid whole genome SNP analysis



                                                                             ...
Preprocessed reads
Preprocessed reads



   Map: Bowtie
Preprocessed reads



     Map: Bowtie



Sort: Bin and partition
Preprocessed reads



     Map: Bowtie



Sort: Bin and partition


  Reduce: SoapSNP
Crossbow	
   condenses	
   over	
   1,000	
   hours	
   of	
  
resequencing	
   computa:on	
   into	
   a	
   few	
   hour...
Comparing Genomes
Estimating relative evolutionary rates
           from sequence comparisons:
                Identification of probable ort...
Estimating relative evolutionary rates
           from sequence comparisons:
                          1. Orthologs found ...
RSD algorithm summary
 Genome I                                            Genome J


                          Ib        ...
Prof. Dennis Wall
Harvard Medical School
Roundup is a database of orthologs
and their evolutionary distances.
To get started, click browse. Alternatively, you can
...
massive computational demand
1000 genomes = 5,994,000 processes =
         23,976,000 hours
2737 years
periodic task
must scale up
not scalability gurus
hadoop streaming
compared 50+ genomes
what’s next?
de novo assembly
machine learning and statistics
protein structure prediction
docking
trajectory analysis
key driving factors?
the ecosystem
Pig
Cascading
Hive
RHIPE
domain specific libraries and tools
http://aws.amazon.com/publicdatasets/
http://aws.amazon.com/education/
Thank you!




     deesingh@amazon.com; Twitter:@mndoci
Presentation ideas from @mza, @simon and @lessig
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Upcoming SlideShare
Loading in...5
×

Hw09 Hadoop For Bioinfomatics

1,591

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,591
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
136
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hw09 Hadoop For Bioinfomatics

  1. 1. Hadoop for Bioinformatics Deepak Singh Amazon Web Services Hadoop World, NYC
  2. 2. Via Reavel under a CC-BY-NC-ND license
  3. 3. By ~Prescott under a CC-BY-NC license
  4. 4. data sets
  5. 5. many data sets
  6. 6. PFAM PDB GENBANK ENSEMBL Many Others
  7. 7. manageable
  8. 8. Image: Matt Wood
  9. 9. Human genome Image: Matt Wood
  10. 10. Image: Matt Wood
  11. 11. ~100 TB/Week Image: Matt Wood
  12. 12. ~100 TB/Week >2 PB/Year Image: Matt Wood
  13. 13. years
  14. 14. days
  15. 15. hours
  16. 16. gigabytes
  17. 17. terabytes
  18. 18. petabytes
  19. 19. really fast
  20. 20. typical informatics workflow
  21. 21. Via Christolakis under a CC-BY-NC-ND license
  22. 22. Via Argonne National Labs under a CC-BY-SA license
  23. 23. killer app Via Argonne National Labs under a CC-BY-SA license
  24. 24. Via asklar under a CC-BY license
  25. 25. Image: Chris Dagdigian
  26. 26. rethink algorithms
  27. 27. rethink computing
  28. 28. rethink data management
  29. 29. rethink data sharing
  30. 30. operational mindset
  31. 31. scalability
  32. 32. we are data geeks not data center geeks
  33. 33. two key trends
  34. 34. develop applications
  35. 35. distribute applications
  36. 36. use applications
  37. 37. some work
  38. 38. filters some work ^
  39. 39. High Throughput Sequence Analysis Mike Schatz, University of Maryland
  40. 40. • Read Mapping • Mapping & SNP Discovery • De novo Genome Assembly
  41. 41. Short Read Mapping
  42. 42. Asian Individual Genome: 3.3 Billion 35bp, 104 GB (Wang et al., 2008) African Individual Genome: 4.0 Billion 35bp, 144 GB (Bentley et al., 2008)
  43. 43. Alignment > 10000 CPU hrs
  44. 44. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1)
  45. 45. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  46. 46. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  47. 47. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale Need parallelization framework
  48. 48. CloudBurst Catalog k-mers Collect seeds End-to-end alignment
  49. 49. http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
  50. 50. CloudBurst efficiently reports every k-difference alignment of every read
  51. 51. many applications only need the best alignment
  52. 52. Bowtie: Ultrafast short read aligner Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  53. 53. SOAPSnp: Consensus alignment and SNP calling Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  54. 54. Crossbow: Rapid whole genome SNP analysis Ben Langmead Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  55. 55. Preprocessed reads
  56. 56. Preprocessed reads Map: Bowtie
  57. 57. Preprocessed reads Map: Bowtie Sort: Bin and partition
  58. 58. Preprocessed reads Map: Bowtie Sort: Bin and partition Reduce: SoapSNP
  59. 59. Crossbow   condenses   over   1,000   hours   of   resequencing   computa:on   into   a   few   hours   without   requiring   the   user   to   own   or   operate   a   computer  cluster
  60. 60. Comparing Genomes
  61. 61. Estimating relative evolutionary rates from sequence comparisons: Identification of probable orthologs Admissible comparisons: A or B vs. D C vs. E Inadmissible comparisons: A or B vs. E C vs. D A B C D E species tree gene tree S. cerevisiae C. elegans
  62. 62. Estimating relative evolutionary rates from sequence comparisons: 1. Orthologs found using the Reciprocal smallest distance algorithm 2. Build alignment between two orthologs >Sequence C MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… >Sequence E MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL… 3. Estimate distance given a substitution matrix Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ A B C D E species tree gene tree S. cerevisiae C. elegans
  63. 63. RSD algorithm summary Genome I Genome J Ib Jc Align sequences & Calculate distances L Orthologs: Align sequences & Calculate distances H ib - jc D = 0.1 c vs. D=1.2 vs. D=0.2 a b a vs. D=0.1 vs. D=0.3 c b b b vs. D=0.9 vs. D=0.1 c c b c
  64. 64. Prof. Dennis Wall Harvard Medical School
  65. 65. Roundup is a database of orthologs and their evolutionary distances. To get started, click browse. Alternatively, you can read our documentation here. Good luck, researchers!
  66. 66. massive computational demand
  67. 67. 1000 genomes = 5,994,000 processes = 23,976,000 hours
  68. 68. 2737 years
  69. 69. periodic task
  70. 70. must scale up
  71. 71. not scalability gurus
  72. 72. hadoop streaming
  73. 73. compared 50+ genomes
  74. 74. what’s next?
  75. 75. de novo assembly
  76. 76. machine learning and statistics
  77. 77. protein structure prediction
  78. 78. docking
  79. 79. trajectory analysis
  80. 80. key driving factors?
  81. 81. the ecosystem
  82. 82. Pig
  83. 83. Cascading
  84. 84. Hive
  85. 85. RHIPE
  86. 86. domain specific libraries and tools
  87. 87. http://aws.amazon.com/publicdatasets/
  88. 88. http://aws.amazon.com/education/
  89. 89. Thank you! deesingh@amazon.com; Twitter:@mndoci Presentation ideas from @mza, @simon and @lessig
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×