Your SlideShare is downloading. ×
0
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Jillian ms defense-4-14-14-ja

90

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
90
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • To paint with a very broad brush…Science combined observation with experimentation, manual data collection and manual visualizationVery effective formula
  • To give a personal example: my grandma.Blood scientist. Took samples, looked under microscope, saw something strange, remembered things about the patient, made connections, ran down the hall to her colleagues “Eureka!”She used to tell me: Observations were what really mattered!
  • Fix slideShe kept up with her field- reads journal articles todaySays: “no one looks under a microscope. Its all dna’Worse, now it is heading toward ‘big data’How will we make observations?Big data depends on automationData collected through digital sensorsProcessed and filtered automaticallyAnalyzed with computational methods and data mining How will we make observations?What if there are errors in these processes? Need visualization to put the expert in this automated loop But accomplishing this effectively for big data is challenging
  • Define this more precisely
  • Picture: ( “Computer! Tell me the answer please.” )
  • Define orthologDefine assembly
  • Transcript

    • 1. Bacterial Gene Neighborhood Investigation Environment: A Scalable Genome Visualization for Big Displays Jillian Aurisano Master of Science Defense April 16, 2014
    • 2. Science has historically looked like this:
    • 3. Up until very recently “Observations!” Expertise explore, make observations Collect samples
    • 4. “No one looks under a microscope anymore. Its all DNA. ” How do scientists make discoveries?
    • 5. How do we bring experts into the loop? • From direct collection of data, direct observation of results direct interpretation and analysis • To automated data collection, automated filtering and automated analysis • Need visualization to bring experts into the loop • But how do we handle big data? • What’s our Big Data microscope? “ Picard: Computer; scan everything, run diagnostics, and tell us the answer.” “Computer: Results are inconclusive”
    • 6. Can Big Displays help? • Evidence suggests that these environments can have a positive impact on perception and cognition • But how do we use them to effectively address big data problems? • Can existing visualizations simply be ‘scaled- up’ to fit or are new approaches needed?
    • 7. In this thesis I will… Examine a specific big data visualization problem: comparative gene neighborhood analysis in bacterial genomics I worked closely over several years with a team of computational biologists This work has led to the design and implementation of a new visualization approach designed to scale to big data and big displays BactoGeNIE (‘Bact(o)erial Gene Neighborhood Investigation Environment’)
    • 8. Outline 1) Describe comparative bacterial gene neighborhood analysis to understand how to bring experts into the loop 2) Examine potential impact of Big Displays on Big Data visualization 3) Evaluate scalability in existing comparative genomics visualizations My work: BactoGeNIE 4/5/6) Describe my design, implementation, results 7) Think about the future In the process, learn something about scaling up visual approaches to big data and big displays
    • 9. Warning: Biology is used in this thesis!
    • 10. Genome sequencing boom • Sequencing costs decreasing faster than Moore’s Law • So, we are able to produce massive volumes of sequence data • Bacterial genomes are small, so we are generating thousands of complete bacterial genome sequences Wetterstrand K.A., DNA Sequencing Costs: Data from the NHGRI Large- Scale Genome Sequencing Program, 2012 <www.genome.gov/sequencingcosts>
    • 11. What is a genome? What is a gene? • Genomes consists of one or more long molecules of ‘DNA’ • DNA consists of chained nucleotide molecules (A, C, T, G) also called ‘base pairs’ • All the genes in an organism are in its ‘genome’ • Genes determine traits in an organism • Genes ‘code’ for proteins, and proteins do the work to make traits happen
    • 12. How are genomes sequenced? • Sequencing • Assembly • Annotation • Output: – Genome feature files – Raw sequence files Michael Schatz Cold Spring Harbor
    • 13. Lots of genome sequences-> opportunity Big challenge: Hard to figure out what a novel gene does • Traditionally: do wet-lab research to figure out – but expensive, time-consuming • Sequence the gene, and use computational methods to predict the function of the protein – If novel gene, may not provide answer • Can complete genome sequences help? • Comparative gene neighborhood analysis
    • 14. From genome structure to gene-product function • In bacteria, genes whose products are involved in similar functions often placed close to each other in the genome. • Research suggests that it is possible to predict gene-product function in bacteria based on commonly recurring gene neighbors • But, need to examine lots of genomes for statistical significance? gene1 gene2 gene3 gene4 Biological process ?
    • 15. Comparing gene neighborhoods across different genomes • Genes with similar sequences likely produce proteins with similar functions • Orthologs: similar genes from different genomes • Algorithms to compare genes between different genomes DeMeo et al. BMC Molecular Biology 2008 9:2 doi:10.1186/1471-2199-9-2
    • 16. Role for visualization in this problem • Why not use automated methods to find common sets of genes around gene targets? • Why visualization? • 3 E’s: Exploration, Expertise, Errors
    • 17. • Patterns and anomalies without knowing in advance what you are looking for Exploration Automated methods: Target: gene B Common subsequences: Strains 1, 2, 3: {A, B, C, D} Duplication Strain 1 Strain 2 Strain 3 A B D A A C CC D D B C CBB B Truncation Strain 1 Strain 2 Strain 3 A B C D A A B C D D B C Deletion Strain 1 Strain 2 Strain 3 A B C D A A C D D B B Inversion Strain 1 Strain 2 Strain 3 A B C D A A B C D D CB
    • 18. Expertise • Experts make connections that will be missed by automated methods – Not just the anomaly, but significance of the anomaly – Knowledge about strains, protein families involved in finding significant anomalies StrainA StrainB StrainC !
    • 19. Errors • Verify automated methods • Uncertainty and errors in data generation Data Strain 1 Strain 2 Strain 3 Automated methods: Common subsequences: Strains 1 and 3: {A, B, C, D} Strain 2: {A, D} Ground truth Strain 1 Strain 2 Strain 3 A B C D A B C D A A B C D D A A B C D D Data Strain 1 Strain 2 Strain 3 Automated methods: Common subsequences: Strains 1 and 3: {A, B, C, D} Strain 2: {A, B} Ground truth Strain 1 Strain 2 Strain 3 Strain 2 A B C Breaks in assembly Missed gene boundaries
    • 20. To address this problem: • Visualization must help bring experts into the data mining loop 1) Helps experts identify sources of error 2) Allows experts explore the data 3) Enable researchers to integrate expertise in data analysis So: overview visualization not enough. Need gene-neighborhood details • Visualization must scale to enable comparisons between hundreds to thousands of genomes
    • 21. Big displays: Opportunity for big data? • The question is: can these environments be used to visualize big data sets better? • Evidence suggests yes: – Physical navigation over virtual navigation • Reduced need pan and zoom • Reduced need for context switching • Utilize embodied cognition • Multiple levels-of detail accessible through physical movement – Externalize more information that can be accessed simultaneously Lance Long
    • 22. Porting from small to big displays • Maybe porting genome visualizations to these environments is sufficient? • Ruddle2013: – Export high-resolution graphical output from existing genomics visualizations – Display these large images on big display – Evidence that this had a positive impact on researcher reasoning • However, effective visualization on big displays involves more than simply scaling up the representation
    • 23. Pixel-Density Scalability • As pixel-density increases, does a visual approach take advantage of increased pixels-per-inch to show more entities, relationships or to show data at higher detail Evaluation: • High-Density Representation? • use increased pixels per inch to show more entities and relationships at higher detail? • Simultaneous detail and overview? • With increased pixel density, representation shows details and overviews at the same time, without relying on Focus+Context
    • 24. Display-Size Scalability • As display size increases, does a visual approach take advantage of the increased space to depict more entities or relationships? Evaluation • Encode big data spatially • Cluster related elements: • spatial memory • direct, visual comparisons • Physical navigation over virtual navigation: • Overviews at a distance, details up-close
    • 25. Perceptual and Analytic Task Scalability • Does a visual approach scale up to enable the performance of an analytic task across more data, more space, more pixels. • Does perception suffer if you scale the approach up? • Analytic tasks performed pre-attentively • Analytic tasks aided by visual queries • Aids to visual search for performing analytic tasks
    • 26. Examining current genomic data visualizations • Does it address this problem? • Show gene neighborhoods • Comparative • Does this visualization allow comparison between more than a few gene neighborhoods? • If you scale the visual approach up, does it: • Allow more comparisons of gene neighborhoods (Analytic Task Scalability) • Take advantage of big displays in size and pixel-density (Display Resolution Scalability and Display Size Scalability) • In the process, remain sensible to a human viewer (Perceptual scalability)
    • 27. Line-based comparative approaches • On load, align 1-2 genes to a chosen gene in a reference genome • Draw a line or a band to connect orthologs • In many cases, repurpose genome browsers to be comparative by adding comparative track • Tools: PSAT, GBrowse_syn, SynView, ACT, CGAT, Combo, MizBee, Mauve Pan, X. et al. (2005). SynBrowse: a synteny browser for comparative sequence analysis. Bioinformatics (Oxford, England). McKay et al. Using the Generic Synteny Browser (GBrowse_syn). Current protocols in Bioinformatics Hoboken, NJ, USA: John Wiley & Sons
    • 28. Line-based approaches expanded: Mauve • Like parallel coordinates • Draw lines between orthologs • Color genes by their block with that genome (not colored by orthology) • Example shows 9 genomes Darling, Aaron CE, et al. "Mauve: multiple alignment of conserved genomic sequence with rearrangements." Genome research 14.7 (2004): 1394-140
    • 29. Line-based approaches: Critique • Pixel-density scalable? – Not a high-density representation – Need space for the ‘comparative track’ • Display size scalable? – Hard to follow lines across a display – Hard to compare similar neighborhoods across the display – No overview from a distance, details up close • Perceptual scalability for comparing gene neighborhoods? – Lots of visual clutter – Comparisons not pre-attentive – No aid to visual search • Number of genomes – Published up to 9 – Private groups have adapted frameworks for 10-50 genomes on big display Darling, Aaron CE, et al. "Mauve: multiple alignment of conserved genomic sequence with rearrangements." Genome research 14.7 (2004): 1394-140
    • 30. PSAT: Color and alignment • PSAT – Orthologs encoded using color – Strand on which gene is positioned is encoded by orientation to the center line – Text is given by default Fong, Christine, et al. "PSAT: a web tool to compare genomic neighborhoods of multiple prokaryotic genomes." BMC bioinformatics 9.1 (2008): 170.
    • 31. PSAT: Critique • Pixel-Density Scalability – Not high-density representation because of text labels • Perceptual scalability for comparing gene neighborhoods? – Can’t scale to large number of genes- not enough colors Fong, Christine, et al. "PSAT: a web tool to compare genomic neighborhoods of multiple prokaryotic genomes." BMC bioinformatics 9.1 (2008): 170.
    • 32. GeneRiViT: Alignment and color • GeneRiViT – Align against arbitrary gene – Color by presence/absence – Examples show 4 genomes – Critique: • No discussion of scalability • Overview visualization • Doesn’t address our problem Price, A. et al "Gene-RiViT: A visualization tool for comparative analysis of gene neighborhoods in prokaryotes." Biological Data Visualization (BioVis), 2012 IEEE Symposium on. IEEE, 2012.
    • 33. Dot plots • Coordinates of genes in two genomes are used as x and y axis • Orthologous genes in other genomes are plotted • Each genome given a unique color • Critique: – Doesn’t provide ‘gene- neighborhood’ view – Overview tool – Hard to follow beyond a few genomes Price, A. et al "Gene-RiViT: A visualization tool for comparative analysis of gene neighborhoods in prokaryotes." Biological Data Visualization (BioVis), 2012 IEEE Symposium on. IEEE, 2012.
    • 34. Overview Visualizaiton: Sequence Surveyor • Not this domain problem, but interesting approach • Each gene is drawn as a rectangle • Several possible variables for position: Ordinal position • Several possible variables for color: – Position in one reference genome – Use a color ramp, for wide range of colors Albers,D. et al "Sequence surveyor: Leveraging overview for scalable genomic alignment visualization." Visualization and Computer Graphics, IEEE Transactions on 17.12 (2011): 2392-2401.
    • 35. Overview Visualizaiton: Sequence Surveyor • Pixel-density scalable – High-density representation – High-detail representation • Display size scalability – May be difficult to compare patterns from one side of display to another • Perceptual Scalability – Colors allow for pre-attentive identification of patterns – Avoids visual clutter Albers,D. et al "Sequence surveyor: Leveraging overview for scalable genomic alignment visualization." Visualization and Computer Graphics, IEEE Transactions on 17.12 (2011): 2392-2401.
    • 36. Copy number variations on big displays • Orchestral: – Visualization of a different data type – Effective use of color to enable pre-attentively identification of similarities across genomes – High-density representation – Details-up-close, overview from a distance Ruddle, Roy A., et al. "Leveraging wall-sized high-resolution displays for comparative genomics analyses of copy number variation." Biological Data Visualization (BioVis), 2013 IEEE Symposium on. IEEE, 2013.
    • 37. BactoGeNIE Demo
    • 38. Program details • Implemented in C++ using Qt and the QGraphicsView framework • Upload: – genome feature files – Fasta files (raw gene sequences) • Cd-hit algorithm processes sequence files to compute ortholog ‘clusters’ • MySQL database to store big datasets – Loads 1000 contigs into memory, rest stored in database • Optimized for PubMed datasets • Prototyped on E.Coli draft genomes – Capable of displaying any contigs from thousands of E.Coli draft genomes • On EVL Cyber-commons wall, around 400 contigs in view
    • 39. BactoGeNIE: High density representation • Compressed genome encoding • No text labels, instead ‘on-demand’ • No ‘comparative track’ • Encode orthology using – User applied color: pre- attentive orthology identification – Coordinated highlighting: scalable visual query – Alignment: use space to encode similarity
    • 40. Use space to encode similarity • Goals: – Make it easier to perform comparisons across many genomes (Analytic task scalability) – Accommodate increased display size (Display Size Scalability) – Make similarities and differences easy to see (Perceptual Scalability) • Sorting and Alignment – Sort by contig length – Sort by gene content – Dynamically align against any gene
    • 41. Interactivity • On hovering, contig expands in height, so easier to select genes of interest in high-density view • ‘Pop-up’ menu for each gene that gives info and allows for: – application of color: • ‘tagging’ operation • Scalable query – “targeting” operation (described next) • User can sort genomes by : – Gene target – Contig length
    • 42. ‘Gene Targeting’ Function to create high resolution, comparative ‘maps’ • User selects a gene of interest • This gene is given a base color • Two color ramps are applied to adjacent genes, one ‘upstream’ and one ‘downstream’ • Orthologous genes in related genomes are given the same colors • Contigs containing this gene are brought to the top • The target gene is centered • Orthologs are aligned to the target
    • 43. Gene targeting function • Clustering to promote direct comparisons • Overviews at a distance • Details up close • Pre-attentive identification of similarities and differences between gene neighborhoods Lance Long
    • 44. Examples
    • 45. Pixel-density Scalability BactoGeNIE fits the pixel-density scalability criteria: High-density data display, identifier display and orthology encoding
    • 46. Display Size Scalability • BactoGeNIE is the only approach to use clustering and show multiple levels of detail
    • 47. Perceptual Scalability and Analytic Tasks BactoGeNIE: • Similarity is pre- attentively accessible • Avoids visual clutter • Visual query for orthologs
    • 48. Graphical Scalability: Display Resolution vs Number of Genomes 0 100 200 300 400 500 600 700 800 900 1000 480 720 1080 1440 2160 2880 3240 4320 BactoGeNIE GeneRiViT SynBrowse SynView PSAT Geco Mauve Pixels Genomes
    • 49. Preliminary User Feedback • A version of BactoGeNIE used by computational biology team on NxN pixels and MxM inches resolution tiled display wall • “This tool has been widely used by members of the team to show the comparative analyses of genomic context for several bacterial genomes” • “Genome browsers such as JBrowse enable researchers to do comparative genome analyses for nearly 10-50 genomes. But fail to work when we are studying several hundreds of genomes of interest. • This tool is really unique and it’s the only tool that I am aware of that can scale up to any number of genome comparisons. • The ability to load multiple tracks of genomes, and the zoom in and out options with color coding, annotation tracks makes it very convenient for scientists to quickly look at patterns. • This tool has a potential to serve both for visualization as well as data mining needs.” Usage of a version without the gene targeting approach. Future study will concentrate on this feature with a wider community of users
    • 50. Summary of contributions • A novel design that is the first to enable direct comparisons between hundreds of gene neighborhoods in one view • First interactive, large-scale comparative gene neighborhood approach, with on-the-fly sorting, dynamic alignment, user-selected color and color ramps • First to show overviews with gene neighborhood- details, that can be accessed through physical movement • introduces a novel visualization approach ‘gene targeting’ that translates genomic data into high- resolution genomic maps
    • 51. What’s next? Design • Integration with different levels of detail • Multiple color ramps • Advanced ordering in y, based on similarity to target or strain phylogeny Implementation • Scalability in rendering using parallelization on the GPU • Port to SAGE Evaluation • User studies and evaluations of perceptual scalability
    • 52. Scalable Design, Big Data, Big Displays • Need visualization to provide an interface between automated analysis and the expert • Porting existing visual approaches to big data and big displays will not always work • Need to design for increased – pixel-density – display size – volume of analytical tasks
    • 53. Thanks! • Acknowledgements: – Jason Leigh, Andy Johnson, Khairi Reda, Lance Long, Uthman Shabazz, and everyone in the Electronic Visualization Laboratory – Barry Goldman, David Bush, Niran Iyer, Shawn Stricklin and the rest of the computational biology team at Monsanto

    ×