20131001 lab meeting
Upcoming SlideShare
Loading in...5
×
 

20131001 lab meeting

on

  • 172 views

2013 Oct. 01

2013 Oct. 01
part of slides for lab meeting

Statistics

Views

Total Views
172
Views on SlideShare
172
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

20131001 lab meeting 20131001 lab meeting Presentation Transcript

  • Error correction for next generation sequencing Wu Chihua (Gigi) Matsuyama Lab M2 Bioinformatics Group October 1st, 2013 13年11月5⽇日星期⼆二
  • Agenda Background Existing research Toy Experiment Future work References 2 13年11月5⽇日星期⼆二
  • Background why & what 3 13年11月5⽇日星期⼆二
  • DNA Sequencing Angelina Jolie tested for one gene, what about the other 20,000? 4 13年11月5⽇日星期⼆二
  • 1 20,000 full genome sequence 5 13年11月5⽇日星期⼆二
  • Genomeof DNA An organism's complete set 6 13年11月5⽇日星期⼆二
  • Chromosome = DNA
  •   +
  •   protein 7 13年11月5⽇日星期⼆二
  • A base pair G (bp) T C Chromosome = DNA
  •   +
  •   protein 8 13年11月5⽇日星期⼆二
  • Chromosome Gene 20,000+ 13年11月5⽇日星期⼆二
  • Human genome 3 billion bps Human DNA 50 ~ 250 Mbps Human gene average : 3,000 bps largest : 2,400,000 bps 10 13年11月5⽇日星期⼆二
  • Next Generation Sequencing &
  •   cheaper roughput
  •    high
  •   th rt
  •   reads utput
  •   sho o 11 13年11月5⽇日星期⼆二
  • Elaine R. Mardis. A decade’s perspective on DNA sequencing technology. Figure 1. 12 13年11月5⽇日星期⼆二
  • wikipedia. http://en.wikipedia.org/wiki/DNA_sequencing#cite_note-quail2012-37 13 13年11月5⽇日星期⼆二
  • 14 13年11月5⽇日星期⼆二
  • Error Correction highly accurate sequenced reads will likely lead to higher quality results. 15 13年11月5⽇日星期⼆二
  • Existing Research 16 13年11月5⽇日星期⼆二
  • 17 13年11月5⽇日星期⼆二
  • Possible direction To handle large genomes and larger datasets. To handle insertion and deletion errors. To correct hybrid datasets from multiple next generation platforms. To develop error correction methods for datasets in population studies. 18 13年11月5⽇日星期⼆二
  • Toy experiment 19 13年11月5⽇日星期⼆二
  • short read find similar pairs of reads by SlideSort vote each position by paired read decide the new base correct the erroneous bases 13年11月5⽇日星期⼆二
  • Slidesort • All pairs similarity search (APSS) for sequence dataset. • APSS: find all similar pairs in a dataset. • Performance of SlideSort • 13年11月5⽇日星期⼆二 • 10 minutes for 10 million reads. • 2~3G byte for 10 million reads. Complexity of SlideSort • Time: O(N+α) • Equivalence classes are found in O(N). • α is a number of neighbor pairs. 21
  • Slidesort Output Input Alignments and distances of all similar pairs. • A set of short reads • Distance threshold d ATGCATA ATTCATT ATGCTCA ATGCCCA SlideSort AAGTCGG ATGTATT AAGGTCG ATGCTTA 22 13年11月5⽇日星期⼆二 ATGCATA ed= 1 ATGCTTA ATGCATA ed= 2 ATGCTCA AAG-TCGG ed= 2 AAGGTCG-
  • Naive approach: 2) O(N ACGC.…. ATGC……. AAGT……. *Animation by Prof. Shimizu 13年11月5⽇日星期⼆二 How to reduce computational cost?
  • Naive approach: 2) O(N ACGC.…. ATGC……. AAGT……. *Animation by Prof. Shimizu 13年11月5⽇日星期⼆二 How to reduce computational cost?
  • ATGC……. AAGT……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • ATGC……. AAGT……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • ATGC……. AAGT……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • ATGC……. AAGT……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • ACGC.…. ATGC……. AAGT……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • ACGC.…. ATGC……. AAGT……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • ACGC.…. ATGC……. AAGT……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • ACGC.…. ATGC……. AAGT……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • ACGC.…. ATGC……. ATGC……. AAGT……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • ACGC.…. ATGC……. ATGC……. AAGT……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • ACGC.…. ATGC……. ATGC……. AAGT……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • ATGC……. ACGC.…. AAGT……. ATGC……. *Animation by Prof. Shimizu Basic strategy: 1. Filtering stage Find subsets sharing common substring(s) 2. Pair-wise comparison stage Compares all pairs for each subset. 13年11月5⽇日星期⼆二
  • Slidesort S1 & S2 are decomposed into m blocks. If edit distance of S1 & S2 is at most d, there exist at least (m-d) common blocks between S1&S2, at similar position. 13年11月5⽇日星期⼆二
  • Slidesort First step: • Quickly finds a subset of short reads which shares (m-d) common blocks. (k-mers) • Second step: • • • 13年11月5⽇日星期⼆二 Calculates edit-dist between all pairs included in the subset (equivalence class). Outputs pairs whose edit-dist are more than d, as well as alignments and scores. Equivalence class S1 S2 S1 S3 S2 S4 S5 S5 S6 ATGC……. S1 S2 S5
  • Toy Experiment Data: test.fasta Simulator: Stampy. (An open source that can simulate short read error.) Num of sequence : 5 Max_seq_length: 51 Min_seq_length: 51 32 13年11月5⽇日星期⼆二
  • Toy Experiment seq 0 1 ◉ 1 2 1 1 ✖ 33 13年11月5⽇日星期⼆二 4 1 △ 3 1
  • Discussion • Not sure if test data generated by Stampy is good or not. • Data set is way too small. 34 13年11月5⽇日星期⼆二
  • Future work • Proper, bigger dataset. • Select data sets from real experiments from online database instead of simulations. • Try Bayesian model 35 13年11月5⽇日星期⼆二
  • References • • • • Kana Shimizu1, Koji Tsuda. SlideSort: all pairs similarity search for short reads. Bioinformatics (2011) 27 (4): 464-470. • 13年11月5⽇日星期⼆二 Elaine R. Mardis. A decade’s perspective on DNA sequencing technology. Next Generation Sequencing (NGS) Market [Platforms (Illumina HiSeq, MiSeq, Life Technologies Ion Proton/PGM, 454 Roche), Bioinformatics (RNA-Seq, ChIP-Seq), (Pyrosequencing, SBS, SMRT), (Diagnostics, Personalized Medicine)] - Global Forecast to 2017. Michael L. Metzker. Sequencing technologies — the next generation. Xiao Yang, Sriram P. Chockalingam, Srinivas Aluru. A survey of error-correction methods for next-generation sequencing. Briefing in Bioinformatics (2013) 14 (1): 56-66.