Your SlideShare is downloading. ×
0
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Hw09   Hadoop For Bioinfomatics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hw09 Hadoop For Bioinfomatics

1,576

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,576
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
135
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Hadoop for Bioinformatics Deepak Singh Amazon Web Services Hadoop World, NYC
  2. Via Reavel under a CC-BY-NC-ND license
  3. By ~Prescott under a CC-BY-NC license
  4. data sets
  5. many data sets
  6. PFAM PDB GENBANK ENSEMBL Many Others
  7. manageable
  8. Image: Matt Wood
  9. Human genome Image: Matt Wood
  10. Image: Matt Wood
  11. ~100 TB/Week Image: Matt Wood
  12. ~100 TB/Week >2 PB/Year Image: Matt Wood
  13. years
  14. days
  15. hours
  16. gigabytes
  17. terabytes
  18. petabytes
  19. really fast
  20. typical informatics workflow
  21. Via Christolakis under a CC-BY-NC-ND license
  22. Via Argonne National Labs under a CC-BY-SA license
  23. killer app Via Argonne National Labs under a CC-BY-SA license
  24. Via asklar under a CC-BY license
  25. Image: Chris Dagdigian
  26. rethink algorithms
  27. rethink computing
  28. rethink data management
  29. rethink data sharing
  30. operational mindset
  31. scalability
  32. we are data geeks not data center geeks
  33. two key trends
  34. develop applications
  35. distribute applications
  36. use applications
  37. some work
  38. filters some work ^
  39. High Throughput Sequence Analysis Mike Schatz, University of Maryland
  40. • Read Mapping • Mapping & SNP Discovery • De novo Genome Assembly
  41. Short Read Mapping
  42. Asian Individual Genome: 3.3 Billion 35bp, 104 GB (Wang et al., 2008) African Individual Genome: 4.0 Billion 35bp, 144 GB (Bentley et al., 2008)
  43. Alignment > 10000 CPU hrs
  44. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1)
  45. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  46. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  47. Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale Need parallelization framework
  48. CloudBurst Catalog k-mers Collect seeds End-to-end alignment
  49. http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
  50. CloudBurst efficiently reports every k-difference alignment of every read
  51. many applications only need the best alignment
  52. Bowtie: Ultrafast short read aligner Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  53. SOAPSnp: Consensus alignment and SNP calling Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  54. Crossbow: Rapid whole genome SNP analysis Ben Langmead Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  55. Preprocessed reads
  56. Preprocessed reads Map: Bowtie
  57. Preprocessed reads Map: Bowtie Sort: Bin and partition
  58. Preprocessed reads Map: Bowtie Sort: Bin and partition Reduce: SoapSNP
  59. Crossbow   condenses   over   1,000   hours   of   resequencing   computa:on   into   a   few   hours   without   requiring   the   user   to   own   or   operate   a   computer  cluster
  60. Comparing Genomes
  61. Estimating relative evolutionary rates from sequence comparisons: Identification of probable orthologs Admissible comparisons: A or B vs. D C vs. E Inadmissible comparisons: A or B vs. E C vs. D A B C D E species tree gene tree S. cerevisiae C. elegans
  62. Estimating relative evolutionary rates from sequence comparisons: 1. Orthologs found using the Reciprocal smallest distance algorithm 2. Build alignment between two orthologs >Sequence C MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… >Sequence E MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL… 3. Estimate distance given a substitution matrix Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ A B C D E species tree gene tree S. cerevisiae C. elegans
  63. RSD algorithm summary Genome I Genome J Ib Jc Align sequences & Calculate distances L Orthologs: Align sequences & Calculate distances H ib - jc D = 0.1 c vs. D=1.2 vs. D=0.2 a b a vs. D=0.1 vs. D=0.3 c b b b vs. D=0.9 vs. D=0.1 c c b c
  64. Prof. Dennis Wall Harvard Medical School
  65. Roundup is a database of orthologs and their evolutionary distances. To get started, click browse. Alternatively, you can read our documentation here. Good luck, researchers!
  66. massive computational demand
  67. 1000 genomes = 5,994,000 processes = 23,976,000 hours
  68. 2737 years
  69. periodic task
  70. must scale up
  71. not scalability gurus
  72. hadoop streaming
  73. compared 50+ genomes
  74. what’s next?
  75. de novo assembly
  76. machine learning and statistics
  77. protein structure prediction
  78. docking
  79. trajectory analysis
  80. key driving factors?
  81. the ecosystem
  82. Pig
  83. Cascading
  84. Hive
  85. RHIPE
  86. domain specific libraries and tools
  87. http://aws.amazon.com/publicdatasets/
  88. http://aws.amazon.com/education/
  89. Thank you! deesingh@amazon.com; Twitter:@mndoci Presentation ideas from @mza, @simon and @lessig

×