Talk at dnGASP workshop, April 5, 2011
Upcoming SlideShare
Loading in...5
×
 

Talk at dnGASP workshop, April 5, 2011

on

  • 398 views

 

Statistics

Views

Total Views
398
Views on SlideShare
398
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Talk at dnGASP workshop, April 5, 2011 Talk at dnGASP workshop, April 5, 2011 Presentation Transcript

  • Combining "overlap-layout- consensus" and de Brujin graph approaches for de novo genome assembly Alexey Sergushichev, Anton Alexandrov, Sergey Kazakov, Sergey Melnikov, Vladislav Isenbaev, Fedor TsarevSt. Petersburg State University of IT, Mechanics and Optics, Russia In collaboration with: Egor Prokhortchouk and Ekaterina Khrameeva Genoanalytica, Moscow, Russia Sequence Mapping and Assembly Assessment Project dnGASP workshop Barcelona, April 5th, 2011
  • Introduction• Imagine you have two computers: – 24 core (Intel Xeon 2.40GHz), 24 GB RAM – 24 core (AMD Opteron 6174 2.2GHz), 64 GB RAM• …But you don’t know about the second one ☺• You are to assemble the genome from dnGASP contest 2
  • Algorithm 3
  • Errors Correction: Reads Truncation• Scan each part of each PE-read from end until first base with quality less than 90%• Truncate each part of each read on that position 4
  • Errors Correction: Frequency Analysis• Consider all 30 character substrings of reads and reverse complements of them• Calculate number of occurrences for each of these substrings – Occurs rarely – contains error (is untrusted) – Occurs frequently – is trusted• Threshold for each case chosen manually 5
  • Errors Correction: Distribution Curve 3000000000 2500000000 2000000000 1500000000 1000000000 500000000 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47• < 4 occurrences – untrusted• Other 30-mers – trusted 6
  • Errors Correction: Buckets• Memory: – Each substring stored as a 64-bit integer – Number of occurrences – 32-bit integer – ~6·109 distinct 30-mers in all PE-reads – 72Gb• Split 30-mers to buckets according to their prefixes• Prefix of length k → 4k buckets 7
  • Errors Correction• Processing each bucket separately• Consider some untrusted 30-mer – Try to change one base in it: (30-k)·3 ways – If only one resulting 30-mer is trusted, fix the corresponding read• To fix error in prefix we can load 3k more buckets into RAM or...• Not load – consider reverse complement of 30-mer A G T A C A T A T G T A C T 8
  • Errors Correction: Results• Used machine with 24 cores and 24 GB RAM for 24 hours• Number of distinct 30-mers: – Before: 6 533 327 606 – After: 3 911 459 530 (~40% less)• Number of trusted 30-mers: – Before: 3 070 814 230 – After: 3 369 674 264 (~10% more) 9
  • Quasi-contigs Assembly• Input = set of PE reads• Goal is to fill the gap between ends From this picture… 10
  • Quasi-contigs Assembly …to this 114 114 AGCT... ~500• Construct de Brujin graph from reads• Find paths between vertices corresponding to ends of reads – with brute-force algorithm 11
  • T-Services Company• Overall performance of cluster over 20 Tflops, based on: – 2 x AMD Opteron 6174 «Magny-Cours» 2,2GHz 64 GB RAM DDR3 1333 MHz – 2 х Intel Xeon E5410 2.33 Ghz 16 Gb RAM DDR2 667 MHz – 2 х Intel Xeon E5450 3.0 Ghz 16 Gb RAM DDR2 667 MHz• Provided exclusive access to node with 64 GB of RAM 12
  • Quasi-Contigs Assembly Parameters• Used machine with 24 cores and 64 GB of RAM for 20 hours• Vertices – 30-mers• Edges – trusted 31-mers• Minimal length of quasi-contig – 334• Maximal length of quasi-contig – 550 13
  • Quasi-Contigs Assembly Results• 67% of inserts restored to quasi-contigs: – ~27% – many ways to restore – ~6% – no way to restore 14
  • Quasi-Contigs Assembly Results 1,40E-02 Pink – inserts lengths 1,20E-02 Blue – quasi-contigs lengths 1,00E-02 8,00E-03 6,00E-03 4,00E-03 2,00E-03 0,00E+00 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 26 27 28 28 29 30 31 32 32 33 34 35 36 36 37 38 39 40 40 41 42 43 44 44 45 46 47 48 48 49 50 51 52 52 53 54 15
  • Contigs & Scaffolds Assembly• Contigs assembly – Newbler – Used quasi-contigs from 24 files (of 88) – 60 hours• Scaffolds assembly – AbySS – 40 hours per library 16
  • Overall Results n mean N50 max SumNewbler: A 401257 3694 7379 6279498 1.482e9AbySS: A 422207 4635 12580 6279661 1.492e9AbySS: B 417403 4808 22788 6279463 1.516e9AbySS: C 526028 3647 14170 6279463 1.522e9AbySS: D 580217 3275 8070 6279463 1.525e9 17
  • Work in Progress• Develop a software module to replace Newbler (contig assembly from quasi- contigs)• Develop a software module to replace AbySS for scaffold assembly• Improve quality of quasi-contigs assembly• Reduce RAM requirements 18
  • Questions? 19