• Save
Hadoop for Bioinformatics
Upcoming SlideShare
Loading in...5
×
 

Hadoop for Bioinformatics

on

  • 14,963 views

My Hadoop World presentation

My Hadoop World presentation

Statistics

Views

Total Views
14,963
Views on SlideShare
13,100
Embed Views
1,863

Actions

Likes
42
Downloads
47
Comments
5

21 Embeds 1,863

http://hongiiv.tistory.com 1017
http://www.genomeweb.com 352
http://mndoci.com 141
http://blogs.scientifik.info 122
http://socmaster.homelinux.org 76
http://www.slideshare.net 52
http://www.linkedin.com 30
http://paper.li 23
http://deepaksingh.net 12
https://www.linkedin.com 6
https://twitter.com 6
http://www.hanrss.com 6
http://www.mndoci.com 5
http://feeds2.feedburner.com 4
http://www.iweb34.com 3
http://2429-genomeweb.voxcdn.com 2
http://cdnwww.genomeweb.com 2
http://webcache.googleusercontent.com 1
http://translate.googleusercontent.com 1
http://www.jokimproduction.com 1
http://theoldreader.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • amazing
    Are you sure you want to
    Your message goes here
    Processing…
  • duaridhi, I am more of a observer of what people are doing, although do dabble as much as time permits (i.e. not much)
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi Deepak. I enjoyed your presentation.. Are you currently working on this?
    Are you sure you want to
    Your message goes here
    Processing…
  • Data geek = sits in a dark room staring at a monitor Data center geek = Sits in a dark warehouse staring at a monitor

    PS: At least you have that picture for posterity
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi Deepak, enjoyed the presentation. Whats the difference between a data geek and a data center geek? Yours geekily, Duncan.

    P.S. glad I got a haircut since slide #44 :-)
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop for Bioinformatics Hadoop for Bioinformatics Presentation Transcript

  • Hadoop for Bioinformatics Deepak Singh Amazon Web Services Hadoop World, NYC
  • Via Reavel under a CC-BY-NC-ND license
  • By ~Prescott under a CC-BY-NC license
  • data sets
  • many data sets
  • PFAM PDB GENBANK ENSEMBL Many Others
  • manageable
  • Image: Matt Wood
  • Human genom e Image: Matt Wood
  • Image: Matt Wood
  • ~100 TB/Week Image: Matt Wood
  • ~100 TB/Week >2 PB/Year Image: Matt Wood
  • years
  • days
  • hours
  • gigabytes
  • terabytes
  • petabytes
  • really fast
  • typical informatics workflow
  • Via Christolakis under a CC-BY-NC-ND license
  • Via Argonne National Labs under a CC-BY-SA license
  • killer app Via Argonne National Labs under a CC-BY-SA license
  • Via asklar under a CC-BY license
  • Image: Chris Dagdigian
  • rethink algorithms
  • rethink computing
  • rethink data management
  • rethink data sharing
  • operational mindset
  • scalability
  • we are data geeks not data center geeks
  • two key trends
  • develop applications
  • distribute applications
  • use applications
  • some work
  • filters some work ^
  • High Throughput Sequence Analysis Mike Schatz, University of Maryland
  • • Read Mapping • Mapping & SNP Discovery • De novo Genome Assembly
  • Short Read Mapping
  • Asian Individual Genome: 3.3 Billion 35bp, 104 GB (Wang et al., 2008) African Individual Genome: 4.0 Billion 35bp, 144 GB (Bentley et al., 2008)
  • Alignment > 10000 CPU hrs
  • Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1)
  • Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  • Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale
  • Seed & Extend Good alignments must have significant exact alignment Minimal exact alignment length = l/(k+1) Expensive to scale Need parallelization framework
  • CloudBurst Catalog k-mers Collect seeds End-to-end alignment
  • http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
  • CloudBurst efficiently reports every k-difference alignment of every read
  • many applications only need the best alignment
  • Bowtie: Ultrafast short read aligner Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  • SOAPSnp: Consensus alignment and SNP calling Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  • Crossbow: Rapid whole genome SNP analysis Ben Langmead Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
  • Preprocessed reads
  • Preprocessed reads Map: Bowtie
  • Preprocessed reads Map: Bowtie Sort: Bin and partition
  • Preprocessed reads Map: Bowtie Sort: Bin and partition Reduce: SoapSNP
  • Crossbow condenses over 1,000 hours of resequencing computation into a few hours without requiring the user to own or operate a computer cluster
  • Comparing Genomes
  • Estimating relative evolutionary rates from sequence comparisons: Identification of probable orthologs Admissible comparisons: A or B vs. D C vs. E Inadmissible comparisons: A or B vs. E C vs. D A B C D E species tree gene tree S. cerevisiae C. elegans
  • Estimating relative evolutionary rates from sequence comparisons: 1. Orthologs found using the Reciprocal smallest distance algorithm 2. Build alignment between two orthologs >Sequence C MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… >Sequence E MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL… 3. Estimate distance given a substitution matrix Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ A B C D E species tree gene tree S. cerevisiae C. elegans
  • RSD algorithm summary Genome I Genome J Ib Jc Align sequences & Calculate distances L Orthologs: Align sequences & Calculate distances H ib - jc D = 0.1 c vs. D=1.2 vs. D=0.2 a b a vs. D=0.1 vs. D=0.3 c b b b vs. D=0.9 vs. D=0.1 c c b c
  • Prof. Dennis Wall Harvard Medical School
  • Roundup is a database of orthologs and their evolutionary distances. To get started, click browse. Alternatively, you can read our documentation here. Good luck, researchers!
  • massive computational demand
  • 1000 genomes = 5,994,000 processes = 23,976,000 hours
  • 2737 years
  • periodic task
  • must scale up
  • not scalability gurus
  • hadoop streaming
  • compared 50+ genomes
  • what’s next?
  • de novo assembly
  • machine learning and statistics
  • protein structure prediction
  • docking
  • trajectory analysis
  • key driving factors?
  • the ecosystem
  • Pig
  • Cascading
  • Hive
  • RHIPE
  • domain specific libraries and tools
  • http://aws.amazon.com/publicdatasets/
  • http://aws.amazon.com/education/
  • Thank you! deesingh@amazon.com; Twitter:@mndoci Presentation ideas from @mza, @simon and @lessig