Bioinformatics Computation and Visualization at the University Louisville


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 72 MPixels
  • NGS: Next Generation Sequencing
  • Bioinformatics Computation and Visualization at the University Louisville

    1. 1. Bioinformatics Computation and Visualization at the University Louisville<br />Eric C. Rouchka and Adel S. Elmaghraby<br />Computer Engineering and Computer Science Department<br />November 16, 2010<br />
    2. 2. Abstract<br />Current high throughput molecular biology techniques are providing researchers with data growing at a rate equivalent and/or faster than Moore’s law. While the ability to store, manipulate and analyze this “Big Data’ requires intelligent utilization of HPC hardware and software resources. At the University of Louisville, we are specifically interested in understanding gene expressions for a variety of disease and disorder states through analysis of microarray data, next generation sequencing of transcriptomes and visualization of high resolution in-situ hybridization images of the central nervous system. We use a variety of approaches including GPU computing and a Dell Visualization Cluster to help achieve faster results.<br />HPC,GPU, Clusters<br />
    3. 3. Gene Expression Visualization<br />Analyzing in-situ hybridization images of the central nervous system.<br />
    4. 4. Statement of Problem<br />Multitude of high resolution biological image techniques available, including:<br />Magnetic resonance<br />Ultrasound<br />Computed tomography<br />X-ray<br />Histological<br />
    5. 5. University of Louisville Database<br />View genes involved in CNS<br />neurotransmitters<br />neuroreceptors<br />Ages: <br />E13.5 (Embryonic day 13.5)<br />P0 (Postnatal day 0 – newborn)<br />P7 (Postnatal day 7)<br />Typical Size (TIFF Format)<br />6000 x 6000 pixels<br />ranges from 3000 x 3000 to 30,000 x 30,000<br />30 MB to 800 MB per image<br />
    6. 6. CNS Image Types<br />Whole Brain<br />Eye (Retina)<br />Spinal Cord<br />
    7. 7. UofL In-Situ Hybridization Database<br />GOALS<br />Tie in-situ database into gene expression (microarray; rtPCR) experiments<br />Link to other existing information<br />Localize and quantify signal<br />
    8. 8. Purpose<br />Search Images of Interest<br />by gene name<br />by developmental stage<br />by tissue type<br />Share Images<br />publicly<br />private groups<br />Annotate Images<br />Store Images<br />
    9. 9.
    10. 10. Partitioning for Web Viewing<br />
    11. 11.
    12. 12. Extending the image display on multiple tiles (15,000 x 4,800 available display pixels)<br />High Resolution in-situ hybridization of mouse retina<br />Utilizing the Dell Video Wall<br />
    13. 13. Research Areas<br /><ul><li>TSS classification
    14. 14. Transcription factor detection
    15. 15. Translational control
    16. 16. SNP analysis
    17. 17. Repeat analysis
    18. 18. Alternative splicing</li></ul>Sources of <br />Variability<br />Control of <br />gene expression<br /><ul><li>2nd level microarray analysis
    19. 19. Gene interactions
    20. 20. In-situ hybridization
    21. 21. Machine learning
    22. 22. DNA computing
    23. 23. Primer design
    24. 24. Gene structure prediction</li></ul>DNA and Protein<br />Sequence Analysis<br />Other<br />FUNCTIONAL GENOMICS<br />
    25. 25. Moore’s Law vs. GenBank<br />286 Processor<br />134,000 Transistors<br />
    26. 26. Hard Drive Storage vs. NGS<br />Stein L. (2010) Genome Biology 2010. 11:207.<br />
    27. 27. The LINE-1 Retrotransposon<br />Poly-A<br />5’ UTR<br />ORF 2<br />3’ UTR<br />ORF 1<br />Adapted from: Babushok DV, Kazazian HH, Jr. Progress in understanding the biology of the human mutagen LINE-1. Hum Mutat. 2007 Jun;28(6):527-39. <br />1K<br />2K<br />3K<br />4K<br />5K<br />6K<br />Antisense Promoter<br /><ul><li>Long Interspersed Nuclear Element-1
    28. 28. A repeat sequence found pervasively throughout the genome.
    29. 29. Each copy may or may not be capable of transcription or retrotransposition.</li></li></ul><li>What is the LINE-1 Life Cycle<br /><ul><li> Comprises ~21% of the genome
    30. 30. As many as 100 copies are estimated to be active and capable of retrotransposition.
    31. 31. Most copies of LINE-1 however, are truncated at the 3’ end, or otherwise not intact, and are inactive.</li></ul>Ribosomes<br />Counts<br />ORF1: Zipper domain<br />ORF 2: endonuclease,<br />reverse transcriptase<br />Distance from 3’ end<br />
    32. 32. How does LINE-1 affect cellular function<br />5’ UTR<br />ORF 2<br />3’ UTR<br />ORF 1<br />Down regulation<br />Poly-A<br />1K<br />2K<br />3K<br />4K<br />5K<br />6K<br />Antisense Promoter<br />Splice isoforms<br />Ectopic expression<br />
    33. 33. LINE1 5’ Signature<br />GGGGAGGAGCCAAG <br />String search for exact match in fastq records<br />Reverse Complement<br />CTTGGCTCCTCCCC<br />Identify and collect those sequences that contain the 5’ LINE1 signature element and at least 25 nucleotides of additional sequence that flanks the LINE1 element<br />LINE1 element<br />Flanking sequence<br />Align against reference human sequence with BLAT<br /> SRR000921.50547 …TGAGTAAATAATGGA*GGGGAGGAGCCAAGAT…<br />SRR003709.200687 …CTTGGCTCCTCCCCC*AAAAGGAATCATTTTAAA…<br />Flanking sequence<br />LINE1 element<br />Identify and collect those sequences that whose flanking sequence maps uniquely to the genome but alignment does not extend to cover LINE1 element. Isolate flanking sequence and create blastable database with comprised of the flanks.<br />Convert all fastq to fasta and BLAST against flanking sequence database<br /> >SRR000921.50547 <br />…TGAGTAAATAATGGA<br />>SRR003709.200687 <br />AAAAGGAATCATTTTAAA…<br />
    34. 34. GPU and CUDA<br />
    35. 35. Typical Computational Operations<br />RNA Folding using Nussinov Algorithm based on Dynamic Programming which is of O(n3).<br />Clustering of gene data and textual information.<br />
    36. 36. A binary matrix representation of a secondary structure of an RNA sequence.<br />Hamada M et al. Bioinformatics 2009;25:465-473<br />© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email:<br />
    37. 37. VaIG Lab<br />Use Dell Alienware with Nvidia GPUs for<br />Hierarchical clustering of DNA microarray data, 48 times speedup over single core CPU, using Tesla C-870<br />Nussinov RNA folding, 290 times speedup, using Tesla C-2050<br />Processing of PubMed abstracts, ongoing<br />SAT (propositional logic) as applied to haplotype inference, ongoing<br />Semi-supervised support vector machine (S3VM), ongoing<br />
    38. 38. Sample Speed up using GPU<br />Compute pairwise Manhattan distance and Pearson correlation coefficient of data points with GPU<br />Dar-Jen Chang, Ahmed H. Desoky, Ming Ouyang, Eric C. Rouchka,<br />2009 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing<br />
    39. 39. University of Louisville and Dell<br />A positive experience<br />
    40. 40. J.B. Speed School Industry Affiliates<br />