SlideShare a Scribd company logo
1 of 53
Bacterial Gene Neighborhood
Investigation Environment: A
Scalable Genome Visualization for
Big Displays
Jillian Aurisano
Master of Science Defense
April 16, 2014
Science has historically looked like this:
Up until very recently
“Observations!”
Expertise
explore,
make observations
Collect samples
“No one looks under a microscope anymore.
Its all DNA. ”
How do
scientists make
discoveries?
How do we bring experts into the
loop?
• From direct collection of
data, direct observation of
results direct interpretation
and analysis
• To automated data
collection, automated
filtering and automated
analysis
• Need visualization to bring
experts into the loop
• But how do we handle big
data?
• What’s our Big Data
microscope?
“ Picard: Computer; scan
everything, run diagnostics, and tell us
the answer.”
“Computer: Results are inconclusive”
Can Big Displays help?
• Evidence suggests that these environments
can have a positive impact on perception and
cognition
• But how do we use them to effectively
address big data problems?
• Can existing visualizations simply be ‘scaled-
up’ to fit or are new approaches needed?
In this thesis I will…
Examine a specific big data visualization problem:
comparative gene neighborhood analysis in
bacterial genomics
I worked closely over several years with a team of
computational biologists
This work has led to the design and implementation
of a new visualization approach designed to scale to
big data and big displays
BactoGeNIE
(‘Bact(o)erial Gene Neighborhood Investigation
Environment’)
Outline
1) Describe comparative bacterial gene
neighborhood analysis to understand how to
bring experts into the loop
2) Examine potential impact of Big Displays on Big
Data visualization
3) Evaluate scalability in existing comparative
genomics visualizations
My work: BactoGeNIE
4/5/6) Describe my design, implementation, results
7) Think about the future
In the process, learn something about scaling up
visual approaches to big data and big displays
Warning: Biology is used in this thesis!
Genome sequencing boom
• Sequencing costs
decreasing faster
than Moore’s Law
• So, we are able to
produce massive
volumes of
sequence data
• Bacterial genomes
are small, so we are
generating
thousands of
complete bacterial
genome sequences Wetterstrand K.A., DNA Sequencing Costs: Data from the NHGRI Large-
Scale Genome Sequencing Program, 2012
<www.genome.gov/sequencingcosts>
What is a genome? What is a gene?
• Genomes consists of one or
more long molecules of ‘DNA’
• DNA consists of chained
nucleotide molecules
(A, C, T, G) also called ‘base
pairs’
• All the genes in an organism
are in its ‘genome’
• Genes determine traits in an
organism
• Genes ‘code’ for proteins, and
proteins do the work to make
traits happen
How are genomes sequenced?
• Sequencing
• Assembly
• Annotation
• Output:
– Genome feature
files
– Raw sequence
files
Michael Schatz
Cold Spring Harbor
Lots of genome sequences->
opportunity
Big challenge: Hard to figure out what a novel gene
does
• Traditionally: do wet-lab research to figure out
– but expensive, time-consuming
• Sequence the gene, and use computational
methods to predict the function of the protein
– If novel gene, may not provide answer
• Can complete genome sequences help?
• Comparative gene neighborhood analysis
From genome structure
to gene-product function
• In bacteria, genes
whose products are
involved in similar
functions often placed
close to each other in
the genome.
• Research suggests that
it is possible to predict
gene-product function
in bacteria based on
commonly recurring
gene neighbors
• But, need to examine
lots of genomes for
statistical significance?
gene1 gene2 gene3 gene4
Biological process
?
Comparing gene neighborhoods across
different genomes
• Genes with similar sequences likely produce
proteins with similar functions
• Orthologs: similar genes from different genomes
• Algorithms to compare genes between different
genomes
DeMeo et al. BMC Molecular
Biology 2008 9:2
doi:10.1186/1471-2199-9-2
Role for visualization in this problem
• Why not use automated methods to find
common sets of genes around gene targets?
• Why visualization?
• 3 E’s: Exploration, Expertise, Errors
• Patterns and
anomalies
without
knowing in
advance what
you are
looking for
Exploration
Automated methods:
Target: gene B
Common subsequences:
Strains 1, 2, 3: {A, B, C, D}
Duplication
Strain 1
Strain 2
Strain 3
A B D
A
A
C
CC
D
D
B C
CBB
B
Truncation
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
B C
Deletion
Strain 1
Strain 2
Strain 3
A B
C
D
A
A
C
D
D
B
B
Inversion
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
CB
Expertise
• Experts make connections that will be missed by
automated methods
– Not just the anomaly, but significance of the anomaly
– Knowledge about strains, protein families involved in
finding significant anomalies
StrainA
StrainB
StrainC
!
Errors
• Verify
automated
methods
• Uncertainty
and errors in
data
generation
Data
Strain 1
Strain 2
Strain 3
Automated methods:
Common subsequences:
Strains 1 and 3: {A, B, C, D}
Strain 2: {A, D}
Ground truth
Strain 1
Strain 2
Strain 3
A B C D
A B C D
A
A B C
D
D
A
A B C
D
D
Data
Strain 1
Strain 2
Strain 3
Automated methods:
Common subsequences:
Strains 1 and 3: {A, B, C, D}
Strain 2: {A, B}
Ground truth
Strain 1
Strain 2
Strain 3
Strain 2
A B C
Breaks in assembly Missed gene boundaries
To address this problem:
• Visualization must help bring experts into the
data mining loop
1) Helps experts identify sources of error
2) Allows experts explore the data
3) Enable researchers to integrate expertise in data
analysis
So: overview visualization not enough.
Need gene-neighborhood details
• Visualization must scale to enable comparisons
between hundreds to thousands of genomes
Big displays: Opportunity for big data?
• The question is: can these environments be used to
visualize big data sets better?
• Evidence suggests yes:
– Physical navigation over virtual navigation
• Reduced need pan and zoom
• Reduced need for context switching
• Utilize embodied cognition
• Multiple levels-of detail accessible through physical movement
– Externalize more information that can be accessed
simultaneously
Lance Long
Porting from small to big displays
• Maybe porting genome visualizations to these
environments is sufficient?
• Ruddle2013:
– Export high-resolution graphical output from
existing genomics visualizations
– Display these large images on big display
– Evidence that this had a positive impact on
researcher reasoning
• However, effective visualization on big displays
involves more than simply scaling up the
representation
Pixel-Density Scalability
• As pixel-density increases, does a visual approach take
advantage of increased pixels-per-inch to show more
entities, relationships or to show data at higher detail
Evaluation:
• High-Density Representation?
• use increased pixels per inch to show more entities and
relationships at higher detail?
• Simultaneous detail and overview?
• With increased pixel density, representation shows details
and overviews at the same time, without relying on
Focus+Context
Display-Size Scalability
• As display size increases, does a visual approach take
advantage of the increased space to depict more
entities or relationships?
Evaluation
• Encode big data spatially
• Cluster related elements:
• spatial memory
• direct, visual comparisons
• Physical navigation over virtual navigation:
• Overviews at a distance, details up-close
Perceptual and Analytic Task Scalability
• Does a visual approach scale up to enable the
performance of an analytic task across more
data, more space, more pixels.
• Does perception suffer if you scale the approach
up?
• Analytic tasks performed pre-attentively
• Analytic tasks aided by visual queries
• Aids to visual search for performing analytic tasks
Examining current genomic data
visualizations
• Does it address this problem?
• Show gene neighborhoods
• Comparative
• Does this visualization allow comparison between
more than a few gene neighborhoods?
• If you scale the visual approach up, does it:
• Allow more comparisons of gene neighborhoods (Analytic
Task Scalability)
• Take advantage of big displays in size and pixel-density
(Display Resolution Scalability and Display Size Scalability)
• In the process, remain sensible to a human viewer
(Perceptual scalability)
Line-based comparative approaches
• On load, align 1-2 genes to
a chosen gene in a
reference genome
• Draw a line or a band to
connect orthologs
• In many cases, repurpose
genome browsers to be
comparative by adding
comparative track
• Tools: PSAT, GBrowse_syn,
SynView, ACT, CGAT,
Combo, MizBee, Mauve
Pan, X. et al. (2005).
SynBrowse: a synteny
browser for
comparative sequence
analysis. Bioinformatics
(Oxford, England).
McKay et al. Using
the Generic Synteny
Browser
(GBrowse_syn).
Current protocols in
Bioinformatics
Hoboken, NJ, USA:
John Wiley & Sons
Line-based approaches expanded:
Mauve
• Like parallel
coordinates
• Draw lines between
orthologs
• Color genes by their
block with that
genome (not colored
by orthology)
• Example shows 9
genomes
Darling, Aaron CE, et al. "Mauve: multiple alignment of conserved
genomic sequence with rearrangements." Genome research 14.7
(2004): 1394-140
Line-based approaches: Critique
• Pixel-density scalable?
– Not a high-density representation
– Need space for the ‘comparative track’
• Display size scalable?
– Hard to follow lines across a display
– Hard to compare similar neighborhoods
across the display
– No overview from a distance, details up
close
• Perceptual scalability for comparing
gene neighborhoods?
– Lots of visual clutter
– Comparisons not pre-attentive
– No aid to visual search
• Number of genomes
– Published up to 9
– Private groups have adapted frameworks
for 10-50 genomes on big display
Darling, Aaron CE, et al. "Mauve: multiple
alignment of conserved genomic sequence with
rearrangements." Genome research 14.7 (2004):
1394-140
PSAT: Color and alignment
• PSAT
– Orthologs encoded
using color
– Strand on which gene
is positioned is
encoded by
orientation to the
center line
– Text is given by
default
Fong, Christine, et al. "PSAT: a
web tool to compare genomic
neighborhoods of multiple
prokaryotic genomes." BMC
bioinformatics 9.1 (2008): 170.
PSAT: Critique
• Pixel-Density
Scalability
– Not high-density
representation
because of text labels
• Perceptual scalability
for comparing gene
neighborhoods?
– Can’t scale to large
number of genes- not
enough colors
Fong, Christine, et al. "PSAT: a
web tool to compare genomic
neighborhoods of multiple
prokaryotic genomes." BMC
bioinformatics 9.1 (2008): 170.
GeneRiViT: Alignment and color
• GeneRiViT
– Align against arbitrary
gene
– Color by
presence/absence
– Examples show 4 genomes
– Critique:
• No discussion of scalability
• Overview visualization
• Doesn’t address our
problem
Price, A. et al "Gene-RiViT: A visualization tool
for comparative analysis of gene
neighborhoods in prokaryotes." Biological
Data Visualization (BioVis), 2012 IEEE
Symposium on. IEEE, 2012.
Dot plots
• Coordinates of genes in
two genomes are used
as x and y axis
• Orthologous genes in
other genomes are
plotted
• Each genome given a
unique color
• Critique:
– Doesn’t provide ‘gene-
neighborhood’ view
– Overview tool
– Hard to follow beyond
a few genomes
Price, A. et al "Gene-RiViT: A visualization tool
for comparative analysis of gene
neighborhoods in prokaryotes." Biological
Data Visualization (BioVis), 2012 IEEE
Symposium on. IEEE, 2012.
Overview Visualizaiton: Sequence
Surveyor
• Not this domain
problem, but
interesting approach
• Each gene is drawn as a
rectangle
• Several possible
variables for position:
Ordinal position
• Several possible
variables for color:
– Position in one
reference genome
– Use a color ramp, for
wide range of colors
Albers,D. et al "Sequence surveyor: Leveraging overview for scalable
genomic alignment visualization." Visualization and Computer
Graphics, IEEE Transactions on 17.12 (2011): 2392-2401.
Overview Visualizaiton: Sequence
Surveyor
• Pixel-density scalable
– High-density representation
– High-detail representation
• Display size scalability
– May be difficult to compare
patterns from one side of
display to another
• Perceptual Scalability
– Colors allow for pre-attentive
identification of patterns
– Avoids visual clutter
Albers,D. et al "Sequence surveyor: Leveraging overview
for scalable genomic alignment visualization."
Visualization and Computer Graphics, IEEE Transactions
on 17.12 (2011): 2392-2401.
Copy number variations on big displays
• Orchestral:
– Visualization of a different data type
– Effective use of color to enable pre-attentively
identification of similarities across genomes
– High-density representation
– Details-up-close, overview from a distance
Ruddle, Roy A., et al. "Leveraging
wall-sized high-resolution displays for
comparative genomics analyses of
copy number variation." Biological
Data Visualization (BioVis), 2013 IEEE
Symposium on. IEEE, 2013.
BactoGeNIE Demo
Program details
• Implemented in C++ using Qt and the QGraphicsView
framework
• Upload:
– genome feature files
– Fasta files (raw gene sequences)
• Cd-hit algorithm processes sequence files to compute
ortholog ‘clusters’
• MySQL database to store big datasets
– Loads 1000 contigs into memory, rest stored in database
• Optimized for PubMed datasets
• Prototyped on E.Coli draft genomes
– Capable of displaying any contigs from thousands of E.Coli draft
genomes
• On EVL Cyber-commons wall, around 400 contigs in view
BactoGeNIE: High density
representation
• Compressed genome
encoding
• No text labels, instead
‘on-demand’
• No ‘comparative track’
• Encode orthology using
– User applied color: pre-
attentive orthology
identification
– Coordinated
highlighting: scalable
visual query
– Alignment: use space to
encode similarity
Use space to encode similarity
• Goals:
– Make it easier to perform comparisons across many
genomes (Analytic task scalability)
– Accommodate increased display size (Display Size
Scalability)
– Make similarities and differences easy to see
(Perceptual Scalability)
• Sorting and Alignment
– Sort by contig length
– Sort by gene content
– Dynamically align against any gene
Interactivity
• On hovering, contig expands in height, so easier
to select genes of interest in high-density view
• ‘Pop-up’ menu for each gene that gives info and
allows for:
– application of color:
• ‘tagging’ operation
• Scalable query
– “targeting” operation (described next)
• User can sort genomes by :
– Gene target
– Contig length
‘Gene Targeting’ Function to create
high resolution, comparative ‘maps’
• User selects a gene of interest
• This gene is given a base color
• Two color ramps are applied to adjacent genes,
one ‘upstream’ and one ‘downstream’
• Orthologous genes in related genomes are given
the same colors
• Contigs containing this gene are brought to the
top
• The target gene is centered
• Orthologs are aligned to the target
Gene targeting function
• Clustering to
promote direct
comparisons
• Overviews at a
distance
• Details up close
• Pre-attentive
identification of
similarities and
differences between
gene neighborhoods
Lance Long
Examples
Pixel-density Scalability
BactoGeNIE fits
the pixel-density
scalability
criteria:
High-density data
display, identifier
display and
orthology
encoding
Display Size Scalability
• BactoGeNIE
is the only
approach to
use
clustering
and show
multiple
levels of
detail
Perceptual Scalability and Analytic
Tasks
BactoGeNIE:
• Similarity is pre-
attentively
accessible
• Avoids visual
clutter
• Visual query for
orthologs
Graphical Scalability:
Display Resolution vs Number of
Genomes
0
100
200
300
400
500
600
700
800
900
1000
480 720 1080 1440 2160 2880 3240 4320
BactoGeNIE
GeneRiViT
SynBrowse
SynView
PSAT
Geco
Mauve
Pixels
Genomes
Preliminary User Feedback
• A version of BactoGeNIE used by computational biology team on NxN pixels
and MxM inches resolution tiled display wall
• “This tool has been widely used by members of the team to show the
comparative analyses of genomic context for several bacterial genomes”
• “Genome browsers such as JBrowse enable researchers to do comparative
genome analyses for nearly 10-50 genomes. But fail to work when we are
studying several hundreds of genomes of interest.
• This tool is really unique and it’s the only tool that I am aware of that can
scale up to any number of genome comparisons.
• The ability to load multiple tracks of genomes, and the zoom in and out
options with color coding, annotation tracks makes it very convenient for
scientists to quickly look at patterns.
• This tool has a potential to serve both for visualization as well as data mining
needs.”
Usage of a version without the gene targeting approach.
Future study will concentrate on this feature with a wider community of users
Summary of contributions
• A novel design that is the first to enable direct
comparisons between hundreds of gene
neighborhoods in one view
• First interactive, large-scale comparative gene
neighborhood approach, with on-the-fly
sorting, dynamic alignment, user-selected color
and color ramps
• First to show overviews with gene neighborhood-
details, that can be accessed through physical
movement
• introduces a novel visualization approach ‘gene
targeting’ that translates genomic data into high-
resolution genomic maps
What’s next?
Design
• Integration with different levels of detail
• Multiple color ramps
• Advanced ordering in y, based on similarity to target or
strain phylogeny
Implementation
• Scalability in rendering using parallelization on the GPU
• Port to SAGE
Evaluation
• User studies and evaluations of perceptual scalability
Scalable Design, Big Data, Big Displays
• Need visualization to provide an interface
between automated analysis and the expert
• Porting existing visual approaches to big data
and big displays will not always work
• Need to design for increased
– pixel-density
– display size
– volume of analytical tasks
Thanks!
• Acknowledgements:
– Jason Leigh, Andy Johnson, Khairi Reda, Lance
Long, Uthman Shabazz, and everyone in the
Electronic Visualization Laboratory
– Barry Goldman, David Bush, Niran Iyer, Shawn
Stricklin and the rest of the computational biology
team at Monsanto

More Related Content

Similar to Bacterial Gene Neighborhood Comparison at Scale

Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian Aurisano
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use casesGuy Coates
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
openSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesopenSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesBastian Greshake
 
Developing high content image analysis software for biologists
Developing high content image analysis software for biologistsDeveloping high content image analysis software for biologists
Developing high content image analysis software for biologistsClaire McQuin
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GenomeInABottle
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...DataScienceConferenc1
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBenjamin Good
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marcGenomeInABottle
 

Similar to Bacterial Gene Neighborhood Comparison at Scale (20)

Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
openSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesopenSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association Studies
 
Developing high content image analysis software for biologists
Developing high content image analysis software for biologistsDeveloping high content image analysis software for biologists
Developing high content image analysis software for biologists
 
Ensembl annotation
Ensembl annotationEnsembl annotation
Ensembl annotation
 
Hands-on Introduction to Machine Learning
Hands-on Introduction to Machine LearningHands-on Introduction to Machine Learning
Hands-on Introduction to Machine Learning
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
KnetMiner - EBI Workshop 2017
KnetMiner - EBI Workshop 2017KnetMiner - EBI Workshop 2017
KnetMiner - EBI Workshop 2017
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marc
 
KnetMiner Overview Oct 2017
KnetMiner Overview Oct 2017KnetMiner Overview Oct 2017
KnetMiner Overview Oct 2017
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Recently uploaded

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 

Recently uploaded (20)

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 

Bacterial Gene Neighborhood Comparison at Scale

  • 1. Bacterial Gene Neighborhood Investigation Environment: A Scalable Genome Visualization for Big Displays Jillian Aurisano Master of Science Defense April 16, 2014
  • 2. Science has historically looked like this:
  • 3. Up until very recently “Observations!” Expertise explore, make observations Collect samples
  • 4. “No one looks under a microscope anymore. Its all DNA. ” How do scientists make discoveries?
  • 5. How do we bring experts into the loop? • From direct collection of data, direct observation of results direct interpretation and analysis • To automated data collection, automated filtering and automated analysis • Need visualization to bring experts into the loop • But how do we handle big data? • What’s our Big Data microscope? “ Picard: Computer; scan everything, run diagnostics, and tell us the answer.” “Computer: Results are inconclusive”
  • 6. Can Big Displays help? • Evidence suggests that these environments can have a positive impact on perception and cognition • But how do we use them to effectively address big data problems? • Can existing visualizations simply be ‘scaled- up’ to fit or are new approaches needed?
  • 7. In this thesis I will… Examine a specific big data visualization problem: comparative gene neighborhood analysis in bacterial genomics I worked closely over several years with a team of computational biologists This work has led to the design and implementation of a new visualization approach designed to scale to big data and big displays BactoGeNIE (‘Bact(o)erial Gene Neighborhood Investigation Environment’)
  • 8. Outline 1) Describe comparative bacterial gene neighborhood analysis to understand how to bring experts into the loop 2) Examine potential impact of Big Displays on Big Data visualization 3) Evaluate scalability in existing comparative genomics visualizations My work: BactoGeNIE 4/5/6) Describe my design, implementation, results 7) Think about the future In the process, learn something about scaling up visual approaches to big data and big displays
  • 9. Warning: Biology is used in this thesis!
  • 10. Genome sequencing boom • Sequencing costs decreasing faster than Moore’s Law • So, we are able to produce massive volumes of sequence data • Bacterial genomes are small, so we are generating thousands of complete bacterial genome sequences Wetterstrand K.A., DNA Sequencing Costs: Data from the NHGRI Large- Scale Genome Sequencing Program, 2012 <www.genome.gov/sequencingcosts>
  • 11. What is a genome? What is a gene? • Genomes consists of one or more long molecules of ‘DNA’ • DNA consists of chained nucleotide molecules (A, C, T, G) also called ‘base pairs’ • All the genes in an organism are in its ‘genome’ • Genes determine traits in an organism • Genes ‘code’ for proteins, and proteins do the work to make traits happen
  • 12. How are genomes sequenced? • Sequencing • Assembly • Annotation • Output: – Genome feature files – Raw sequence files Michael Schatz Cold Spring Harbor
  • 13. Lots of genome sequences-> opportunity Big challenge: Hard to figure out what a novel gene does • Traditionally: do wet-lab research to figure out – but expensive, time-consuming • Sequence the gene, and use computational methods to predict the function of the protein – If novel gene, may not provide answer • Can complete genome sequences help? • Comparative gene neighborhood analysis
  • 14. From genome structure to gene-product function • In bacteria, genes whose products are involved in similar functions often placed close to each other in the genome. • Research suggests that it is possible to predict gene-product function in bacteria based on commonly recurring gene neighbors • But, need to examine lots of genomes for statistical significance? gene1 gene2 gene3 gene4 Biological process ?
  • 15. Comparing gene neighborhoods across different genomes • Genes with similar sequences likely produce proteins with similar functions • Orthologs: similar genes from different genomes • Algorithms to compare genes between different genomes DeMeo et al. BMC Molecular Biology 2008 9:2 doi:10.1186/1471-2199-9-2
  • 16. Role for visualization in this problem • Why not use automated methods to find common sets of genes around gene targets? • Why visualization? • 3 E’s: Exploration, Expertise, Errors
  • 17. • Patterns and anomalies without knowing in advance what you are looking for Exploration Automated methods: Target: gene B Common subsequences: Strains 1, 2, 3: {A, B, C, D} Duplication Strain 1 Strain 2 Strain 3 A B D A A C CC D D B C CBB B Truncation Strain 1 Strain 2 Strain 3 A B C D A A B C D D B C Deletion Strain 1 Strain 2 Strain 3 A B C D A A C D D B B Inversion Strain 1 Strain 2 Strain 3 A B C D A A B C D D CB
  • 18. Expertise • Experts make connections that will be missed by automated methods – Not just the anomaly, but significance of the anomaly – Knowledge about strains, protein families involved in finding significant anomalies StrainA StrainB StrainC !
  • 19. Errors • Verify automated methods • Uncertainty and errors in data generation Data Strain 1 Strain 2 Strain 3 Automated methods: Common subsequences: Strains 1 and 3: {A, B, C, D} Strain 2: {A, D} Ground truth Strain 1 Strain 2 Strain 3 A B C D A B C D A A B C D D A A B C D D Data Strain 1 Strain 2 Strain 3 Automated methods: Common subsequences: Strains 1 and 3: {A, B, C, D} Strain 2: {A, B} Ground truth Strain 1 Strain 2 Strain 3 Strain 2 A B C Breaks in assembly Missed gene boundaries
  • 20. To address this problem: • Visualization must help bring experts into the data mining loop 1) Helps experts identify sources of error 2) Allows experts explore the data 3) Enable researchers to integrate expertise in data analysis So: overview visualization not enough. Need gene-neighborhood details • Visualization must scale to enable comparisons between hundreds to thousands of genomes
  • 21. Big displays: Opportunity for big data? • The question is: can these environments be used to visualize big data sets better? • Evidence suggests yes: – Physical navigation over virtual navigation • Reduced need pan and zoom • Reduced need for context switching • Utilize embodied cognition • Multiple levels-of detail accessible through physical movement – Externalize more information that can be accessed simultaneously Lance Long
  • 22. Porting from small to big displays • Maybe porting genome visualizations to these environments is sufficient? • Ruddle2013: – Export high-resolution graphical output from existing genomics visualizations – Display these large images on big display – Evidence that this had a positive impact on researcher reasoning • However, effective visualization on big displays involves more than simply scaling up the representation
  • 23. Pixel-Density Scalability • As pixel-density increases, does a visual approach take advantage of increased pixels-per-inch to show more entities, relationships or to show data at higher detail Evaluation: • High-Density Representation? • use increased pixels per inch to show more entities and relationships at higher detail? • Simultaneous detail and overview? • With increased pixel density, representation shows details and overviews at the same time, without relying on Focus+Context
  • 24. Display-Size Scalability • As display size increases, does a visual approach take advantage of the increased space to depict more entities or relationships? Evaluation • Encode big data spatially • Cluster related elements: • spatial memory • direct, visual comparisons • Physical navigation over virtual navigation: • Overviews at a distance, details up-close
  • 25. Perceptual and Analytic Task Scalability • Does a visual approach scale up to enable the performance of an analytic task across more data, more space, more pixels. • Does perception suffer if you scale the approach up? • Analytic tasks performed pre-attentively • Analytic tasks aided by visual queries • Aids to visual search for performing analytic tasks
  • 26. Examining current genomic data visualizations • Does it address this problem? • Show gene neighborhoods • Comparative • Does this visualization allow comparison between more than a few gene neighborhoods? • If you scale the visual approach up, does it: • Allow more comparisons of gene neighborhoods (Analytic Task Scalability) • Take advantage of big displays in size and pixel-density (Display Resolution Scalability and Display Size Scalability) • In the process, remain sensible to a human viewer (Perceptual scalability)
  • 27. Line-based comparative approaches • On load, align 1-2 genes to a chosen gene in a reference genome • Draw a line or a band to connect orthologs • In many cases, repurpose genome browsers to be comparative by adding comparative track • Tools: PSAT, GBrowse_syn, SynView, ACT, CGAT, Combo, MizBee, Mauve Pan, X. et al. (2005). SynBrowse: a synteny browser for comparative sequence analysis. Bioinformatics (Oxford, England). McKay et al. Using the Generic Synteny Browser (GBrowse_syn). Current protocols in Bioinformatics Hoboken, NJ, USA: John Wiley & Sons
  • 28. Line-based approaches expanded: Mauve • Like parallel coordinates • Draw lines between orthologs • Color genes by their block with that genome (not colored by orthology) • Example shows 9 genomes Darling, Aaron CE, et al. "Mauve: multiple alignment of conserved genomic sequence with rearrangements." Genome research 14.7 (2004): 1394-140
  • 29. Line-based approaches: Critique • Pixel-density scalable? – Not a high-density representation – Need space for the ‘comparative track’ • Display size scalable? – Hard to follow lines across a display – Hard to compare similar neighborhoods across the display – No overview from a distance, details up close • Perceptual scalability for comparing gene neighborhoods? – Lots of visual clutter – Comparisons not pre-attentive – No aid to visual search • Number of genomes – Published up to 9 – Private groups have adapted frameworks for 10-50 genomes on big display Darling, Aaron CE, et al. "Mauve: multiple alignment of conserved genomic sequence with rearrangements." Genome research 14.7 (2004): 1394-140
  • 30. PSAT: Color and alignment • PSAT – Orthologs encoded using color – Strand on which gene is positioned is encoded by orientation to the center line – Text is given by default Fong, Christine, et al. "PSAT: a web tool to compare genomic neighborhoods of multiple prokaryotic genomes." BMC bioinformatics 9.1 (2008): 170.
  • 31. PSAT: Critique • Pixel-Density Scalability – Not high-density representation because of text labels • Perceptual scalability for comparing gene neighborhoods? – Can’t scale to large number of genes- not enough colors Fong, Christine, et al. "PSAT: a web tool to compare genomic neighborhoods of multiple prokaryotic genomes." BMC bioinformatics 9.1 (2008): 170.
  • 32. GeneRiViT: Alignment and color • GeneRiViT – Align against arbitrary gene – Color by presence/absence – Examples show 4 genomes – Critique: • No discussion of scalability • Overview visualization • Doesn’t address our problem Price, A. et al "Gene-RiViT: A visualization tool for comparative analysis of gene neighborhoods in prokaryotes." Biological Data Visualization (BioVis), 2012 IEEE Symposium on. IEEE, 2012.
  • 33. Dot plots • Coordinates of genes in two genomes are used as x and y axis • Orthologous genes in other genomes are plotted • Each genome given a unique color • Critique: – Doesn’t provide ‘gene- neighborhood’ view – Overview tool – Hard to follow beyond a few genomes Price, A. et al "Gene-RiViT: A visualization tool for comparative analysis of gene neighborhoods in prokaryotes." Biological Data Visualization (BioVis), 2012 IEEE Symposium on. IEEE, 2012.
  • 34. Overview Visualizaiton: Sequence Surveyor • Not this domain problem, but interesting approach • Each gene is drawn as a rectangle • Several possible variables for position: Ordinal position • Several possible variables for color: – Position in one reference genome – Use a color ramp, for wide range of colors Albers,D. et al "Sequence surveyor: Leveraging overview for scalable genomic alignment visualization." Visualization and Computer Graphics, IEEE Transactions on 17.12 (2011): 2392-2401.
  • 35. Overview Visualizaiton: Sequence Surveyor • Pixel-density scalable – High-density representation – High-detail representation • Display size scalability – May be difficult to compare patterns from one side of display to another • Perceptual Scalability – Colors allow for pre-attentive identification of patterns – Avoids visual clutter Albers,D. et al "Sequence surveyor: Leveraging overview for scalable genomic alignment visualization." Visualization and Computer Graphics, IEEE Transactions on 17.12 (2011): 2392-2401.
  • 36. Copy number variations on big displays • Orchestral: – Visualization of a different data type – Effective use of color to enable pre-attentively identification of similarities across genomes – High-density representation – Details-up-close, overview from a distance Ruddle, Roy A., et al. "Leveraging wall-sized high-resolution displays for comparative genomics analyses of copy number variation." Biological Data Visualization (BioVis), 2013 IEEE Symposium on. IEEE, 2013.
  • 38. Program details • Implemented in C++ using Qt and the QGraphicsView framework • Upload: – genome feature files – Fasta files (raw gene sequences) • Cd-hit algorithm processes sequence files to compute ortholog ‘clusters’ • MySQL database to store big datasets – Loads 1000 contigs into memory, rest stored in database • Optimized for PubMed datasets • Prototyped on E.Coli draft genomes – Capable of displaying any contigs from thousands of E.Coli draft genomes • On EVL Cyber-commons wall, around 400 contigs in view
  • 39. BactoGeNIE: High density representation • Compressed genome encoding • No text labels, instead ‘on-demand’ • No ‘comparative track’ • Encode orthology using – User applied color: pre- attentive orthology identification – Coordinated highlighting: scalable visual query – Alignment: use space to encode similarity
  • 40. Use space to encode similarity • Goals: – Make it easier to perform comparisons across many genomes (Analytic task scalability) – Accommodate increased display size (Display Size Scalability) – Make similarities and differences easy to see (Perceptual Scalability) • Sorting and Alignment – Sort by contig length – Sort by gene content – Dynamically align against any gene
  • 41. Interactivity • On hovering, contig expands in height, so easier to select genes of interest in high-density view • ‘Pop-up’ menu for each gene that gives info and allows for: – application of color: • ‘tagging’ operation • Scalable query – “targeting” operation (described next) • User can sort genomes by : – Gene target – Contig length
  • 42. ‘Gene Targeting’ Function to create high resolution, comparative ‘maps’ • User selects a gene of interest • This gene is given a base color • Two color ramps are applied to adjacent genes, one ‘upstream’ and one ‘downstream’ • Orthologous genes in related genomes are given the same colors • Contigs containing this gene are brought to the top • The target gene is centered • Orthologs are aligned to the target
  • 43. Gene targeting function • Clustering to promote direct comparisons • Overviews at a distance • Details up close • Pre-attentive identification of similarities and differences between gene neighborhoods Lance Long
  • 45. Pixel-density Scalability BactoGeNIE fits the pixel-density scalability criteria: High-density data display, identifier display and orthology encoding
  • 46. Display Size Scalability • BactoGeNIE is the only approach to use clustering and show multiple levels of detail
  • 47. Perceptual Scalability and Analytic Tasks BactoGeNIE: • Similarity is pre- attentively accessible • Avoids visual clutter • Visual query for orthologs
  • 48. Graphical Scalability: Display Resolution vs Number of Genomes 0 100 200 300 400 500 600 700 800 900 1000 480 720 1080 1440 2160 2880 3240 4320 BactoGeNIE GeneRiViT SynBrowse SynView PSAT Geco Mauve Pixels Genomes
  • 49. Preliminary User Feedback • A version of BactoGeNIE used by computational biology team on NxN pixels and MxM inches resolution tiled display wall • “This tool has been widely used by members of the team to show the comparative analyses of genomic context for several bacterial genomes” • “Genome browsers such as JBrowse enable researchers to do comparative genome analyses for nearly 10-50 genomes. But fail to work when we are studying several hundreds of genomes of interest. • This tool is really unique and it’s the only tool that I am aware of that can scale up to any number of genome comparisons. • The ability to load multiple tracks of genomes, and the zoom in and out options with color coding, annotation tracks makes it very convenient for scientists to quickly look at patterns. • This tool has a potential to serve both for visualization as well as data mining needs.” Usage of a version without the gene targeting approach. Future study will concentrate on this feature with a wider community of users
  • 50. Summary of contributions • A novel design that is the first to enable direct comparisons between hundreds of gene neighborhoods in one view • First interactive, large-scale comparative gene neighborhood approach, with on-the-fly sorting, dynamic alignment, user-selected color and color ramps • First to show overviews with gene neighborhood- details, that can be accessed through physical movement • introduces a novel visualization approach ‘gene targeting’ that translates genomic data into high- resolution genomic maps
  • 51. What’s next? Design • Integration with different levels of detail • Multiple color ramps • Advanced ordering in y, based on similarity to target or strain phylogeny Implementation • Scalability in rendering using parallelization on the GPU • Port to SAGE Evaluation • User studies and evaluations of perceptual scalability
  • 52. Scalable Design, Big Data, Big Displays • Need visualization to provide an interface between automated analysis and the expert • Porting existing visual approaches to big data and big displays will not always work • Need to design for increased – pixel-density – display size – volume of analytical tasks
  • 53. Thanks! • Acknowledgements: – Jason Leigh, Andy Johnson, Khairi Reda, Lance Long, Uthman Shabazz, and everyone in the Electronic Visualization Laboratory – Barry Goldman, David Bush, Niran Iyer, Shawn Stricklin and the rest of the computational biology team at Monsanto

Editor's Notes

  1. To paint with a very broad brush…Science combined observation with experimentation, manual data collection and manual visualizationVery effective formula
  2. To give a personal example: my grandma.Blood scientist. Took samples, looked under microscope, saw something strange, remembered things about the patient, made connections, ran down the hall to her colleagues “Eureka!”She used to tell me: Observations were what really mattered!
  3. Fix slideShe kept up with her field- reads journal articles todaySays: “no one looks under a microscope. Its all dna’Worse, now it is heading toward ‘big data’How will we make observations?Big data depends on automationData collected through digital sensorsProcessed and filtered automaticallyAnalyzed with computational methods and data mining How will we make observations?What if there are errors in these processes? Need visualization to put the expert in this automated loop But accomplishing this effectively for big data is challenging
  4. Define this more precisely
  5. Picture: ( “Computer! Tell me the answer please.” )
  6. Define orthologDefine assembly