W-curve Commercial Applications


Published on

A short description of the W-curve and its application to aligning genomic sequences.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

W-curve Commercial Applications

  1. 1. Tools for the Next Generation Genomics has become Big Data. The tools for viewing, analyzing it have not: Utilities still perform single-pass alignments. Viewers still use characters. Alignment tools are still recursive. This limits research and clinical analytics.
  2. 2. Sequence alignment is the core Searching, comparing, aligning sequences. Smith-Waterman algorithm is the standard. Single pass: cannot fine-tune alignments. Recursive: not suited to map-reduce / clouds. Whole-gene: cannot align crossovers, CNV, reversals. Fragment indexing only works for stable sequences.
  3. 3. Another approach: W-curve Originally designed as a visualization technique. Convert DNA sequence to geometry. Visually compare relatively large sequences. Geometry has richer content for computational approaches: Parallel analysis. Fuzzy math for comparison.
  4. 4. W-curve geometry Each base is at the corner of a square. All curves begin at ( 0, 0, 0 ). Points are halfway to the next base's corner. Z-axis advances in steps of one.
  5. 5. Generating the next point from P to P'
  6. 6. Starting a curve: CG... From ( 0, 0, 0 ) Halfway to C ( 0.0, -0.5, 1 ) Halfway to G ( -0.5, -0.25, 2 ) First point is always 0.5 along an axis.
  7. 7. Visual analysis The best pattern recognition engine known: Our Brians. The W-curve's graphical pattern works well with people. For example: Looking at wild, drug resistant, sample HIV. Looking along the curve quickly shows differing sequences.
  8. 8. HIV-1 POL Fragment: Wild, 2 DR, Sample
  9. 9. Why it works Autoregression is a central feature of the W-curve. Curves converge quickly after a SNP or gap. But remain apart long enough to see the difference.
  10. 10. Autoregression: SNP
  11. 11. Autogregression: GAP
  12. 12. Automating comparison Geometry allows many approaches. Simple one is aligning local points. Use GIS: Store the library as points. Generate the sample as circles. Find all of the points within a circle. Cluster these to get the alignment.
  13. 13. Fuzzy Matching: Circles & Points.
  14. 14. Query your genes Find all library points “near” each sample point. Cluster them by z-offset between library, sample. These are local alignments. Local clusters be grouped to handle crossovers, gaps, insertions.
  15. 15. SQL is readable Three values using postgis: Library sequence. Library base number. Sample base offset. select b.seq_id, b.base_no, a.base_no – b.base_no as 'delta', from sample a, library b where st_contains ( a.vertex, b.vertex ) order by b.seq_id, b.delta, b.base_no ;
  16. 16. Crossover: Multiple library sequences are tiled onto the sample. Adjacent when sorted by sample base.
  17. 17. Example: crossover gp120 ... 1 6880 +93 1 6881 +93 1 6882 +93 ... 1 7665 +93 ... 2 6307 -82 2 6308 -82 2 6310 -82 2 6311 -82 ... 2 7054 -82 Two library sequences: 1, 2. Sort: ID, Offset, Base No.
  18. 18. Example: crossover gp120 1 6880 7665 +93 2 6307 7054 -82 Two sequences: 1, 2. Sort: ID, Offset, Base No. Collapse to fragments.
  19. 19. Example: crossover gp120 1 6973 7758 +93 2 6225 6972 -82 Two sequences: 1, 2. Sort: ID, Offset, Base No. Collapse to fragments. Add offset to get sample bases.
  20. 20. Example: crossover gp120 2 6225 6972 -82 1 6973 7758 +93 Two sequences: 1, 2. Sort: ID, Offset, Base No. Collapse to fragments. Add offset to get sample bases. Sort by sample base.
  21. 21. Example: crossover gp120 2 6225 6972 -82 1 6973 7758 +93 Two sequences: 1, 2. Sort: ID, Offset, Base No. Collapse to fragments. Add offset to get sample bases. Sort by sample base. Crossovers meet at 6972-3. Similar algorithms for CNV, crossover chromosomes.
  22. 22. Multi-pass alignments Messy genes: HIV-1, oconogenes. Blast, Fasta, Clustal often spread the sample out too far. Clustering provides tighter alignments. Result can be re-filtered to remove extraneous data. Scoring based on relevant portion of sequence.
  23. 23. Applying the cloud The W-curve algorithm is suitable for map-reduce: Individual triples are independent. Clustering can be carried out in stages. Each stage can accumulate clusters, fragments. Work on large genes or databases can run in parallel. Geometric data also suitable for machine-learning, fuzzy math.
  24. 24. Parallel execution: BLAST vs. W-curve Parallel execution with BLAST requires fragmenting the sequence. Separate outputs re-combined, usually manually. Boundary issues at fragment boundaries. W-curve alignment runs in parallel at all levels. Samples do not have to be broken up. No boundary issues.
  25. 25. Population, evolution, microbiome, plants: Large-scale, often fuzzy matches. Wide search in a large database. Mix of large and small fragments. Multiple alignments for each fragment. All of these can be automated.
  26. 26. Example: Clustering HIV-1 Clades for HIV-1 use the entire gp120 sequence. ~80 bases out of 1500 make up CD4 binding site. The rest is highly-variable. BLAST & Clustal assume all variation is significant. Result: The clades are based on 95% white noise.
  27. 27. Example: HIV-1 Classic HIV-1 antibody study. Clades use entire gp120 sequence. No diagonal: antibodies don't react with all of gp120. Need to base clades on subset.
  28. 28. Applying the W-curve First pass performs global alignments of gp120. Candidate CD4 binding sites can be extracted. Finer-grained alignment of 80+ bases in CD4 binding site. Result is a set of clinically useful clades.
  29. 29. Other approaches S-W cannot handle multi-base in FastQ data. Has no way to deal with quality measures. W-curve could generate multiple paths or a volume. Comparison via fuzzy math with library sequences. Result: Direct comparison of FastQ data.
  30. 30. Visualization Interactive tools can help annotation, manipulation of sequences. Tagging bases, shading common areas. Selecting matching regions interactively for study. Modifying difficult alignments by hand. All of these can benefit from a graphical, visual approach.
  31. 31. Business Case The W-curve algorithm is public. Most of the automation is in filtering, however. Different problems require special filters. Job control, database optimization also matter. Domain-specific knowledge is the real value added. Custom filters, database generation & management. Interactive user interface as either plugin or separate tool.
  32. 32. Contact Steven Lembark Workhorse Computing lembark@wrkhors.com +1 888 359 3508 Ed Bayham EXPLOR Bioventurues elb@explorbioventures.com