Successfully reported this slideshow.
Upcoming SlideShare
×

# Algorithm of NGS Data

1,137 views

Published on

Published in: Technology, Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Algorithm of NGS Data

1. 1. Speaker: Eric C.Y., LEE Advisor: I-Fang Chung 2011.Mar.21Monday, March 21, 2011 1
2. 2. Outline • Motivation • Workﬂow • Result • Conclusion • My CommentMonday, March 21, 2011 2
3. 3. Motivation • High throughput sequence technology play an important role in the life science now. • Different high throughput sequence technologies are competing to be able to sequence an individual human genome for less than \$1,000 within a few years. 2006.Mar.17 Vol.311 ScienceMonday, March 21, 2011 3
4. 4. Motivation • The amount of data produced by HTS technologies creates signiﬁcant bioinformatics challenge to understand, store and share data.Monday, March 21, 2011 4
5. 5. Workﬂow Evaluate Analysis Preliminary algorithms datasets result Golomb-Rice Dataset1 For location Elias Gamma Dataset2 For mismatch MOV Dataset3 ... Huffman ... ...Monday, March 21, 2011 5
6. 6. Coding Strategy Optimal encoding of these integers from a compression standpoint depends on their distribution in order to assign shorter binary codes to more probable symbols. ~ Shannon’s Entropy Coding Theory Claude Shannoon 1916~2001Monday, March 21, 2011 6
7. 7. Encoding Strategies • Fixed Codes • Golomb-Rice Codes • Elias Gamma Codes • Monotone Value Codes • Variable Codes • Huffman CodeMonday, March 21, 2011 7
8. 8. Golomb-Rice Codes Set m=10, and try to encode 42 Encoding of quotient part Encoding of remainder part q output bits r binary output bits 0 0 0 0000 000 1 10 1 0001 001 2 110 2 0010 010 3 1110 3 0011 011 4 11110 4 0100 100 5 111110 5 0101 101 6 1111110 6 1100 1100 .. .. 7 1101 1101 N <N repetitions of 1> 8 1110 1110 n=42, n/m q=4, r=2 9 1111 1111 output is 11110010Monday, March 21, 2011 8
9. 9. Elias Gamma Codes number 2^n output 1 20+0 1 2 21+0 010 3 4 21+1 22+0 011 00100 Example 5 22+1 00101 6 22+2 00110 7 22+3 00111 42=25+10 8 23+0 0001000 9 23+1 0001001 10 23+2 0001010 11 12 23+3 23+4 0001011 0001100 00000101010 13 23+5 0001101 14 23+6 0001110 15 23+7 0001111 16 24+0 000010000 17 24+1 000010001Monday, March 21, 2011 9
10. 10. MOV Coding number 2^n output 1 20+0 1 2 3 21+0 21+1 10 11 Beginning with Elias Gamma 4 22+0 100 code’s signiﬁcant 1-bit. 5 22+1 101 6 22+2 110 7 22+3 111 Decode: 8 23+0 1000 10001 9 23+1 1001 {4bit} 10 23+2 1010 11 23+3 1011 12 23+4 1100 13 23+5 1101 24 + (0001)2 14 23+6 1110 15 23+7 1111 16 17 24+0 24+1 10000 10001 17Monday, March 21, 2011 10
11. 11. Huffman Codes “this is an example of a huffman tree”Monday, March 21, 2011 11
12. 12. Workﬂow Evaluate Analysis Preliminary algorithms datasets result Golomb-Rice Dataset1 For location Elias Gamma Dataset2 For mismatch MOV Dataset3 ... Huffman ... ...Monday, March 21, 2011 12
13. 13. Dataset1 • Retrotransposon Ty3 insertion sites in the yeast genome. • 6,439,584 reads in 19 bp. • Highly Clustered. 2 32% • High degree of repetition. 0 54% • Most two substitutions. 1 14%Monday, March 21, 2011 13
14. 14. Dataset2 • In vivo binding site locations of the neuron- restrictive silencer factor (NRSF)in humans. • Mapped to hg18. 1 2 6% • 1,697,990 reads in 25 bp. 18% • Most two substitutions. 0 76%Monday, March 21, 2011 14
15. 15. Dataset2 Nucleotide SubstitutionsMonday, March 21, 2011 15
16. 16. Dataset3 • Corresponds to a full diploid human genome sequencing experiment for an Asian individual. • Large dataset. Only mapped to chr.22. • 31,118,531 reads. 30~40bp. 2 19% 1 0 20% 61%Monday, March 21, 2011 16
17. 17. Workﬂow Evaluate Analysis Preliminary algorithms datasets result Golomb-Rice Dataset1 For location Elias Gamma Dataset2 For mismatch MOV Dataset3 ... Huffman ... ...Monday, March 21, 2011 17
18. 18. Alignment Result Example Name of read that aligned Name of reference Read sequence Value of celing sequence occurs Strand 0-bases offset into the Mismatch descriptors Read quality forward reference strand BowtieMonday, March 21, 2011 18
19. 19. Encoding Location Information • Standalone: Encoding each column independently. • Combine: Combining column of then chromosome, strand and mismatch compressing together.Monday, March 21, 2011 19
20. 20. Apply the Algorithms • Elias Gamma (EG) Absolute • Sequence can’t be sort. • Apply to Dataset3.Monday, March 21, 2011 20
21. 21. Apply the Algorithms • Elias Gamma Relative (REG) • Sequence can be sort, compression performance much better. • Sorting the location address using relative instead of absolute.Monday, March 21, 2011 21
22. 22. Apply the Algorithms • Relative Elias Gamma Indexed (REG Indexed) • Sorting and creating index ﬁle. • Combine chromosome, strand, mismatches together. Compressing them by relative location. • Can’t apply to dataset 3.Monday, March 21, 2011 22
23. 23. Apply the Algorithms • Monotone Value (MOV) • Based on chromosome and location, sorting the sequences. • Coding the absolute address.Monday, March 21, 2011 23
24. 24. Apply the Algorithms • Huffman codes • Focused on “relative” start position. • This algorithm has to storing the Huffman tree for decompression.Monday, March 21, 2011 24
25. 25. Comments for encoding location • REG is suit for the three datasets. • From dataset 1, using unique location of chromosome and counting the frequencies for coding. REG is an ideal solution for highly repetitive dataset. • Huffman code it’s not good for dataset 1.Monday, March 21, 2011 25
26. 26. Encoding Mismatch Information • Each read may contains 1 or 2 mismatch and has the nucleotide value. • Using one line to record the mismatch information. If no mismatch leave the line blank.Monday, March 21, 2011 26
27. 27. Mismatches of Dataset2 If the mismatch at 23 From start is 22. 10110 From end is 2. 10 Calculate the position from the end of the reads.Monday, March 21, 2011 27
28. 28. Nucleotide Substitution • Using number instead of characters. A: 65 1000001 C: 67 1000011 G: 71 1000111 T: 84 1010100 A: 00 C:01 G:10 T:11Monday, March 21, 2011 28
29. 29. Combining Location and Mismatch 19G Count the frequencies, coding the location and 30A mismatch together. 34T 19G: 00001010110 { 11bit } 19G: 10110 {5bit}Monday, March 21, 2011 29
30. 30. Final Encoding • Dataset1: Mismatches dominates most of space, because of it already be sorted. • Dataset2: Location is sparse, it dominates lots of storage. • Dataset3: This dataset is balanced, because of it has full coverage of genome.Monday, March 21, 2011 30
31. 31. Implementation • Based on REG indexed for location information and combined encoding for mismatch information. • Pass1: Counting the mismatches. • Pass2: Actual encoding.Monday, March 21, 2011 31
32. 32. Result Original 1,030,333,440 Best Compression 56,078,940 GenCompress 56,166,419 gzip 41,378,624 bzip2 42,233,336 7zip 30,651,664 0 275,000,000 550,000,000 825,000,000 1,100,000,000 (bytes) Dataset1Monday, March 21, 2011 32
33. 33. Result Original 353,181,920 Best Compression 35,983,322 GenCompress 36,099,480 gzip 95,688,992 bzip2 94,030,320 7zip 83,319,584 0 100000000 200000000 300000000 400000000 (bytes) Dataset2Monday, March 21, 2011 33
34. 34. Result Original 8,869,613,392 Best Compression 390,541,330 GenCompress 390,541,330 gzip 618,818,824 bzip2 955,061,616 7zip 411,811,520 0 2250000000 4500000000 6750000000 9000000000 (bytes) Dataset3Monday, March 21, 2011 34
35. 35. Conclusion • Any genome sequence can be used for mapping the reads. • From the view of time consuming, GenCompress is worth to use.Monday, March 21, 2011 35
36. 36. Compression Time 20 GenCompress gzip 10 bzip2 7zip Dataset1 78 107 5 13 Dataset2 20 77 111 70 Dataset3 422 447 0 125 250 375 500 (sec)Monday, March 21, 2011 36
37. 37. Decompression Time 2 GenCompress gzip 2 bzip2 7zip Dataset1 7 4 1 1 Dataset2 4 2 15 13 Dataset3 53 21 0 15 30 45 60 (sec)Monday, March 21, 2011 37
38. 38. Conclusion • Hard drive is not expensive, the cost is the bandwidth. • Doesn’t consider the quality score. • Read identiﬁer is also important. • Maybe mismatches are contaminants, de novo. Or the reference sequence is unﬁnished. • Only consider the best match.Monday, March 21, 2011 38
39. 39. Conclusion • Huffman tree in dataset 1 and 2.Monday, March 21, 2011 39
40. 40. My Comments • They should open source. • Hardware conﬁguration. Why RAID1?Monday, March 21, 2011 40
41. 41. Thanks for your attention!Monday, March 21, 2011 41