Speaker: Eric C.Y., LEE                         Advisor: I-Fang Chung                              2011.Mar.21Monday, Marc...
Outline                   • Motivation                   • Workflow                   • Result                   • Conclusi...
Motivation                   • High throughput sequence technology play                         an important role in the l...
Motivation                   • The amount of data produced by HTS                         technologies creates significant ...
Workflow              Evaluate       Analysis   Preliminary             algorithms      datasets      result             Go...
Coding Strategy   Optimal encoding of these integers from a   compression standpoint depends on their   distribution in or...
Encoding Strategies                   • Fixed Codes                    • Golomb-Rice Codes                    • Elias Gamm...
Golomb-Rice Codes                               Set m=10, and try to encode 42                 Encoding of quotient part  ...
Elias Gamma Codes         number           2^n      output            1             20+0       1            2             ...
MOV Coding         number          2^n      output            1            20+0         1            2            3       ...
Huffman Codes      “this is an example of a huffman tree”Monday, March 21, 2011                         11
Workflow              Evaluate       Analysis   Preliminary             algorithms      datasets      result             Go...
Dataset1                   • Retrotransposon Ty3 insertion sites in the                         yeast genome.             ...
Dataset2                   • In vivo binding site locations of the neuron-                         restrictive silencer fa...
Dataset2 Nucleotide SubstitutionsMonday, March 21, 2011                              15
Dataset3                   • Corresponds to a full diploid human                         genome sequencing experiment for ...
Workflow              Evaluate       Analysis   Preliminary             algorithms      datasets      result             Go...
Alignment Result Example  Name of read that aligned   Name of reference                                                   ...
Encoding Location                             Information                   • Standalone: Encoding each column            ...
Apply the Algorithms                   • Elias Gamma (EG) Absolute                    • Sequence can’t be sort.           ...
Apply the Algorithms                   • Elias Gamma Relative (REG)                    • Sequence can be sort, compression...
Apply the Algorithms                   • Relative Elias Gamma Indexed (REG Indexed)                    • Sorting and creat...
Apply the Algorithms                   • Monotone Value (MOV)                    • Based on chromosome and location,      ...
Apply the Algorithms                   • Huffman codes                    • Focused on “relative” start position.         ...
Comments for                           encoding location                   • REG is suit for the three datasets.          ...
Encoding Mismatch                            Information                   • Each read may contains 1 or 2 mismatch       ...
Mismatches of Dataset2   If the mismatch at 23   From start is 22.                         10110    From end is 2.        ...
Nucleotide Substitution                         • Using number instead of characters.               A: 65               10...
Combining Location                           and Mismatch                               19G       Count the frequencies,  ...
Final Encoding                   • Dataset1: Mismatches dominates most of                         space, because of it alr...
Implementation                   • Based on REG indexed for location                         information and combined enco...
Result                         Original                                                         1,030,333,440             ...
Result                         Original                                                     353,181,920               Best...
Result                         Original                                                 8,869,613,392               Best C...
Conclusion                   • Any genome sequence can be used for                         mapping the reads.             ...
Compression Time                                         20                                                               ...
Decompression Time                                         2                                                              ...
Conclusion                   • Hard drive is not expensive, the cost is the                         bandwidth.            ...
Conclusion                   • Huffman tree in dataset 1 and 2.Monday, March 21, 2011                                  39
My Comments            • They should open source.            • Hardware configuration.                              Why RAI...
Thanks for your attention!Monday, March 21, 2011                       41
Upcoming SlideShare
Loading in …5
×

Algorithm of NGS Data

659
-1

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
659
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Algorithm of NGS Data

  1. 1. Speaker: Eric C.Y., LEE Advisor: I-Fang Chung 2011.Mar.21Monday, March 21, 2011 1
  2. 2. Outline • Motivation • Workflow • Result • Conclusion • My CommentMonday, March 21, 2011 2
  3. 3. Motivation • High throughput sequence technology play an important role in the life science now. • Different high throughput sequence technologies are competing to be able to sequence an individual human genome for less than $1,000 within a few years. 2006.Mar.17 Vol.311 ScienceMonday, March 21, 2011 3
  4. 4. Motivation • The amount of data produced by HTS technologies creates significant bioinformatics challenge to understand, store and share data.Monday, March 21, 2011 4
  5. 5. Workflow Evaluate Analysis Preliminary algorithms datasets result Golomb-Rice Dataset1 For location Elias Gamma Dataset2 For mismatch MOV Dataset3 ... Huffman ... ...Monday, March 21, 2011 5
  6. 6. Coding Strategy Optimal encoding of these integers from a compression standpoint depends on their distribution in order to assign shorter binary codes to more probable symbols. ~ Shannon’s Entropy Coding Theory Claude Shannoon 1916~2001Monday, March 21, 2011 6
  7. 7. Encoding Strategies • Fixed Codes • Golomb-Rice Codes • Elias Gamma Codes • Monotone Value Codes • Variable Codes • Huffman CodeMonday, March 21, 2011 7
  8. 8. Golomb-Rice Codes Set m=10, and try to encode 42 Encoding of quotient part Encoding of remainder part q output bits r binary output bits 0 0 0 0000 000 1 10 1 0001 001 2 110 2 0010 010 3 1110 3 0011 011 4 11110 4 0100 100 5 111110 5 0101 101 6 1111110 6 1100 1100 .. .. 7 1101 1101 N <N repetitions of 1> 8 1110 1110 n=42, n/m q=4, r=2 9 1111 1111 output is 11110010Monday, March 21, 2011 8
  9. 9. Elias Gamma Codes number 2^n output 1 20+0 1 2 21+0 010 3 4 21+1 22+0 011 00100 Example 5 22+1 00101 6 22+2 00110 7 22+3 00111 42=25+10 8 23+0 0001000 9 23+1 0001001 10 23+2 0001010 11 12 23+3 23+4 0001011 0001100 00000101010 13 23+5 0001101 14 23+6 0001110 15 23+7 0001111 16 24+0 000010000 17 24+1 000010001Monday, March 21, 2011 9
  10. 10. MOV Coding number 2^n output 1 20+0 1 2 3 21+0 21+1 10 11 Beginning with Elias Gamma 4 22+0 100 code’s significant 1-bit. 5 22+1 101 6 22+2 110 7 22+3 111 Decode: 8 23+0 1000 10001 9 23+1 1001 {4bit} 10 23+2 1010 11 23+3 1011 12 23+4 1100 13 23+5 1101 24 + (0001)2 14 23+6 1110 15 23+7 1111 16 17 24+0 24+1 10000 10001 17Monday, March 21, 2011 10
  11. 11. Huffman Codes “this is an example of a huffman tree”Monday, March 21, 2011 11
  12. 12. Workflow Evaluate Analysis Preliminary algorithms datasets result Golomb-Rice Dataset1 For location Elias Gamma Dataset2 For mismatch MOV Dataset3 ... Huffman ... ...Monday, March 21, 2011 12
  13. 13. Dataset1 • Retrotransposon Ty3 insertion sites in the yeast genome. • 6,439,584 reads in 19 bp. • Highly Clustered. 2 32% • High degree of repetition. 0 54% • Most two substitutions. 1 14%Monday, March 21, 2011 13
  14. 14. Dataset2 • In vivo binding site locations of the neuron- restrictive silencer factor (NRSF)in humans. • Mapped to hg18. 1 2 6% • 1,697,990 reads in 25 bp. 18% • Most two substitutions. 0 76%Monday, March 21, 2011 14
  15. 15. Dataset2 Nucleotide SubstitutionsMonday, March 21, 2011 15
  16. 16. Dataset3 • Corresponds to a full diploid human genome sequencing experiment for an Asian individual. • Large dataset. Only mapped to chr.22. • 31,118,531 reads. 30~40bp. 2 19% 1 0 20% 61%Monday, March 21, 2011 16
  17. 17. Workflow Evaluate Analysis Preliminary algorithms datasets result Golomb-Rice Dataset1 For location Elias Gamma Dataset2 For mismatch MOV Dataset3 ... Huffman ... ...Monday, March 21, 2011 17
  18. 18. Alignment Result Example Name of read that aligned Name of reference Read sequence Value of celing sequence occurs Strand 0-bases offset into the Mismatch descriptors Read quality forward reference strand BowtieMonday, March 21, 2011 18
  19. 19. Encoding Location Information • Standalone: Encoding each column independently. • Combine: Combining column of then chromosome, strand and mismatch compressing together.Monday, March 21, 2011 19
  20. 20. Apply the Algorithms • Elias Gamma (EG) Absolute • Sequence can’t be sort. • Apply to Dataset3.Monday, March 21, 2011 20
  21. 21. Apply the Algorithms • Elias Gamma Relative (REG) • Sequence can be sort, compression performance much better. • Sorting the location address using relative instead of absolute.Monday, March 21, 2011 21
  22. 22. Apply the Algorithms • Relative Elias Gamma Indexed (REG Indexed) • Sorting and creating index file. • Combine chromosome, strand, mismatches together. Compressing them by relative location. • Can’t apply to dataset 3.Monday, March 21, 2011 22
  23. 23. Apply the Algorithms • Monotone Value (MOV) • Based on chromosome and location, sorting the sequences. • Coding the absolute address.Monday, March 21, 2011 23
  24. 24. Apply the Algorithms • Huffman codes • Focused on “relative” start position. • This algorithm has to storing the Huffman tree for decompression.Monday, March 21, 2011 24
  25. 25. Comments for encoding location • REG is suit for the three datasets. • From dataset 1, using unique location of chromosome and counting the frequencies for coding. REG is an ideal solution for highly repetitive dataset. • Huffman code it’s not good for dataset 1.Monday, March 21, 2011 25
  26. 26. Encoding Mismatch Information • Each read may contains 1 or 2 mismatch and has the nucleotide value. • Using one line to record the mismatch information. If no mismatch leave the line blank.Monday, March 21, 2011 26
  27. 27. Mismatches of Dataset2 If the mismatch at 23 From start is 22. 10110 From end is 2. 10 Calculate the position from the end of the reads.Monday, March 21, 2011 27
  28. 28. Nucleotide Substitution • Using number instead of characters. A: 65 1000001 C: 67 1000011 G: 71 1000111 T: 84 1010100 A: 00 C:01 G:10 T:11Monday, March 21, 2011 28
  29. 29. Combining Location and Mismatch 19G Count the frequencies, coding the location and 30A mismatch together. 34T 19G: 00001010110 { 11bit } 19G: 10110 {5bit}Monday, March 21, 2011 29
  30. 30. Final Encoding • Dataset1: Mismatches dominates most of space, because of it already be sorted. • Dataset2: Location is sparse, it dominates lots of storage. • Dataset3: This dataset is balanced, because of it has full coverage of genome.Monday, March 21, 2011 30
  31. 31. Implementation • Based on REG indexed for location information and combined encoding for mismatch information. • Pass1: Counting the mismatches. • Pass2: Actual encoding.Monday, March 21, 2011 31
  32. 32. Result Original 1,030,333,440 Best Compression 56,078,940 GenCompress 56,166,419 gzip 41,378,624 bzip2 42,233,336 7zip 30,651,664 0 275,000,000 550,000,000 825,000,000 1,100,000,000 (bytes) Dataset1Monday, March 21, 2011 32
  33. 33. Result Original 353,181,920 Best Compression 35,983,322 GenCompress 36,099,480 gzip 95,688,992 bzip2 94,030,320 7zip 83,319,584 0 100000000 200000000 300000000 400000000 (bytes) Dataset2Monday, March 21, 2011 33
  34. 34. Result Original 8,869,613,392 Best Compression 390,541,330 GenCompress 390,541,330 gzip 618,818,824 bzip2 955,061,616 7zip 411,811,520 0 2250000000 4500000000 6750000000 9000000000 (bytes) Dataset3Monday, March 21, 2011 34
  35. 35. Conclusion • Any genome sequence can be used for mapping the reads. • From the view of time consuming, GenCompress is worth to use.Monday, March 21, 2011 35
  36. 36. Compression Time 20 GenCompress gzip 10 bzip2 7zip Dataset1 78 107 5 13 Dataset2 20 77 111 70 Dataset3 422 447 0 125 250 375 500 (sec)Monday, March 21, 2011 36
  37. 37. Decompression Time 2 GenCompress gzip 2 bzip2 7zip Dataset1 7 4 1 1 Dataset2 4 2 15 13 Dataset3 53 21 0 15 30 45 60 (sec)Monday, March 21, 2011 37
  38. 38. Conclusion • Hard drive is not expensive, the cost is the bandwidth. • Doesn’t consider the quality score. • Read identifier is also important. • Maybe mismatches are contaminants, de novo. Or the reference sequence is unfinished. • Only consider the best match.Monday, March 21, 2011 38
  39. 39. Conclusion • Huffman tree in dataset 1 and 2.Monday, March 21, 2011 39
  40. 40. My Comments • They should open source. • Hardware configuration. Why RAID1?Monday, March 21, 2011 40
  41. 41. Thanks for your attention!Monday, March 21, 2011 41
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×