Bacterial Gene Neighborhood Comparison at Scale

Bacterial Gene Neighborhood
Investigation Environment: A
Scalable Genome Visualization for
Big Displays
Jillian Aurisano
Master of Science Defense
April 16, 2014

Science has historically looked like this:

Up until very recently
“Observations!”
Expertise
explore,
make observations
Collect samples

“No one looks under a microscope anymore.
Its all DNA. ”
How do
scientists make
discoveries?

How do we bring experts into the
loop?
• From direct collection of
data, direct observation of
results direct interpretation
and analysis
• To automated data
collection, automated
filtering and automated
analysis
• Need visualization to bring
experts into the loop
• But how do we handle big
data?
• What’s our Big Data
microscope?
“ Picard: Computer; scan
everything, run diagnostics, and tell us
the answer.”
“Computer: Results are inconclusive”

Can Big Displays help?
• Evidence suggests that these environments
can have a positive impact on perception and
cognition
• But how do we use them to effectively
address big data problems?
• Can existing visualizations simply be ‘scaled-
up’ to fit or are new approaches needed?

In this thesis I will…
Examine a specific big data visualization problem:
comparative gene neighborhood analysis in
bacterial genomics
I worked closely over several years with a team of
computational biologists
This work has led to the design and implementation
of a new visualization approach designed to scale to
big data and big displays
BactoGeNIE
(‘Bact(o)erial Gene Neighborhood Investigation
Environment’)

Outline
1) Describe comparative bacterial gene
neighborhood analysis to understand how to
bring experts into the loop
2) Examine potential impact of Big Displays on Big
Data visualization
3) Evaluate scalability in existing comparative
genomics visualizations
My work: BactoGeNIE
4/5/6) Describe my design, implementation, results
7) Think about the future
In the process, learn something about scaling up
visual approaches to big data and big displays

Warning: Biology is used in this thesis!

Genome sequencing boom
• Sequencing costs
decreasing faster
than Moore’s Law
• So, we are able to
produce massive
volumes of
sequence data
• Bacterial genomes
are small, so we are
generating
thousands of
complete bacterial
genome sequences Wetterstrand K.A., DNA Sequencing Costs: Data from the NHGRI Large-
Scale Genome Sequencing Program, 2012
<www.genome.gov/sequencingcosts>

What is a genome? What is a gene?
• Genomes consists of one or
more long molecules of ‘DNA’
• DNA consists of chained
nucleotide molecules
(A, C, T, G) also called ‘base
pairs’
• All the genes in an organism
are in its ‘genome’
• Genes determine traits in an
organism
• Genes ‘code’ for proteins, and
proteins do the work to make
traits happen

How are genomes sequenced?
• Sequencing
• Assembly
• Annotation
• Output:
– Genome feature
files
– Raw sequence
files
Michael Schatz
Cold Spring Harbor

Lots of genome sequences->
opportunity
Big challenge: Hard to figure out what a novel gene
does
• Traditionally: do wet-lab research to figure out
– but expensive, time-consuming
• Sequence the gene, and use computational
methods to predict the function of the protein
– If novel gene, may not provide answer
• Can complete genome sequences help?
• Comparative gene neighborhood analysis

From genome structure
to gene-product function
• In bacteria, genes
whose products are
involved in similar
functions often placed
close to each other in
the genome.
• Research suggests that
it is possible to predict
gene-product function
in bacteria based on
commonly recurring
gene neighbors
• But, need to examine
lots of genomes for
statistical significance?
gene1 gene2 gene3 gene4
Biological process
?

Comparing gene neighborhoods across
different genomes
• Genes with similar sequences likely produce
proteins with similar functions
• Orthologs: similar genes from different genomes
• Algorithms to compare genes between different
genomes
DeMeo et al. BMC Molecular
Biology 2008 9:2
doi:10.1186/1471-2199-9-2

Role for visualization in this problem
• Why not use automated methods to find
common sets of genes around gene targets?
• Why visualization?
• 3 E’s: Exploration, Expertise, Errors

• Patterns and
anomalies
without
knowing in
advance what
you are
looking for
Exploration
Automated methods:
Target: gene B
Common subsequences:
Strains 1, 2, 3: {A, B, C, D}
Duplication
Strain 1
Strain 2
Strain 3
A B D
A
A
C
CC
D
D
B C
CBB
B
Truncation
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
B C
Deletion
Strain 1
Strain 2
Strain 3
A B
C
D
A
A
C
D
D
B
B
Inversion
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
CB

Expertise
• Experts make connections that will be missed by
automated methods
– Not just the anomaly, but significance of the anomaly
– Knowledge about strains, protein families involved in
finding significant anomalies
StrainA
StrainB
StrainC
!

Errors
• Verify
automated
methods
• Uncertainty
and errors in
data
generation
Data
Strain 1
Strain 2
Strain 3
Automated methods:
Strains 1 and 3: {A, B, C, D}
Strain 2: {A, D}
Ground truth
Strain 1
Strain 2
Strain 3
A B C D
A B C D
A
A B C
D
D
A
A B C
D
D
Data
Strain 1
Strain 2
Strain 3
Automated methods:
Strains 1 and 3: {A, B, C, D}
Strain 2: {A, B}
Ground truth
Strain 1
Strain 2
Strain 3
Strain 2
A B C
Breaks in assembly Missed gene boundaries

To address this problem:
• Visualization must help bring experts into the
data mining loop
1) Helps experts identify sources of error
2) Allows experts explore the data
3) Enable researchers to integrate expertise in data
analysis
So: overview visualization not enough.
Need gene-neighborhood details
• Visualization must scale to enable comparisons
between hundreds to thousands of genomes

Big displays: Opportunity for big data?
• The question is: can these environments be used to
visualize big data sets better?
• Evidence suggests yes:
– Physical navigation over virtual navigation
• Reduced need pan and zoom
• Reduced need for context switching
• Utilize embodied cognition
• Multiple levels-of detail accessible through physical movement
– Externalize more information that can be accessed
simultaneously
Lance Long

Porting from small to big displays
• Maybe porting genome visualizations to these
environments is sufficient?
• Ruddle2013:
– Export high-resolution graphical output from
existing genomics visualizations
– Display these large images on big display
– Evidence that this had a positive impact on
researcher reasoning
• However, effective visualization on big displays
involves more than simply scaling up the
representation

Pixel-Density Scalability
• As pixel-density increases, does a visual approach take
advantage of increased pixels-per-inch to show more
entities, relationships or to show data at higher detail
Evaluation:
• High-Density Representation?
• use increased pixels per inch to show more entities and
relationships at higher detail?
• Simultaneous detail and overview?
• With increased pixel density, representation shows details
and overviews at the same time, without relying on
Focus+Context

Display-Size Scalability
• As display size increases, does a visual approach take
advantage of the increased space to depict more
entities or relationships?
Evaluation
• Encode big data spatially
• Cluster related elements:
• spatial memory
• direct, visual comparisons
• Physical navigation over virtual navigation:
• Overviews at a distance, details up-close

Perceptual and Analytic Task Scalability
• Does a visual approach scale up to enable the
performance of an analytic task across more
data, more space, more pixels.
• Does perception suffer if you scale the approach
up?
• Analytic tasks performed pre-attentively
• Analytic tasks aided by visual queries
• Aids to visual search for performing analytic tasks

Examining current genomic data
visualizations
• Does it address this problem?
• Show gene neighborhoods
• Comparative
• Does this visualization allow comparison between
more than a few gene neighborhoods?
• If you scale the visual approach up, does it:
• Allow more comparisons of gene neighborhoods (Analytic
Task Scalability)
• Take advantage of big displays in size and pixel-density
(Display Resolution Scalability and Display Size Scalability)
• In the process, remain sensible to a human viewer
(Perceptual scalability)

Line-based comparative approaches
• On load, align 1-2 genes to
a chosen gene in a
reference genome
• Draw a line or a band to
connect orthologs
• In many cases, repurpose
genome browsers to be
comparative by adding
comparative track
• Tools: PSAT, GBrowse_syn,
SynView, ACT, CGAT,
Combo, MizBee, Mauve
Pan, X. et al. (2005).
SynBrowse: a synteny
browser for
comparative sequence
analysis. Bioinformatics
(Oxford, England).
McKay et al. Using
the Generic Synteny
Browser
(GBrowse_syn).
Current protocols in
Bioinformatics
Hoboken, NJ, USA:
John Wiley & Sons

Line-based approaches expanded:
Mauve
• Like parallel
coordinates
• Draw lines between
orthologs
• Color genes by their
block with that
genome (not colored
by orthology)
• Example shows 9
genomes
Darling, Aaron CE, et al. "Mauve: multiple alignment of conserved
genomic sequence with rearrangements." Genome research 14.7
(2004): 1394-140

Line-based approaches: Critique
• Pixel-density scalable?
– Not a high-density representation
– Need space for the ‘comparative track’
• Display size scalable?
– Hard to follow lines across a display
– Hard to compare similar neighborhoods
across the display
– No overview from a distance, details up
close
• Perceptual scalability for comparing
gene neighborhoods?
– Lots of visual clutter
– Comparisons not pre-attentive
– No aid to visual search
• Number of genomes
– Published up to 9
– Private groups have adapted frameworks
for 10-50 genomes on big display
Darling, Aaron CE, et al. "Mauve: multiple
alignment of conserved genomic sequence with
rearrangements." Genome research 14.7 (2004):
1394-140

PSAT: Color and alignment
• PSAT
– Orthologs encoded
using color
– Strand on which gene
is positioned is
encoded by
orientation to the
center line
– Text is given by
default
Fong, Christine, et al. "PSAT: a
web tool to compare genomic
neighborhoods of multiple
prokaryotic genomes." BMC
bioinformatics 9.1 (2008): 170.

PSAT: Critique
• Pixel-Density
Scalability
– Not high-density
representation
because of text labels
• Perceptual scalability
for comparing gene
neighborhoods?
– Can’t scale to large
number of genes- not
enough colors
Fong, Christine, et al. "PSAT: a
web tool to compare genomic
neighborhoods of multiple
prokaryotic genomes." BMC
bioinformatics 9.1 (2008): 170.

GeneRiViT: Alignment and color
• GeneRiViT
– Align against arbitrary
gene
– Color by
presence/absence
– Examples show 4 genomes
– Critique:
• No discussion of scalability
• Overview visualization
• Doesn’t address our
problem
Price, A. et al "Gene-RiViT: A visualization tool
for comparative analysis of gene
neighborhoods in prokaryotes." Biological
Data Visualization (BioVis), 2012 IEEE
Symposium on. IEEE, 2012.

Dot plots
• Coordinates of genes in
two genomes are used
as x and y axis
• Orthologous genes in
other genomes are
plotted
• Each genome given a
unique color
• Critique:
– Doesn’t provide ‘gene-
neighborhood’ view
– Overview tool
– Hard to follow beyond
a few genomes
Price, A. et al "Gene-RiViT: A visualization tool
for comparative analysis of gene
neighborhoods in prokaryotes." Biological

Overview Visualizaiton: Sequence
Surveyor
• Not this domain
problem, but
interesting approach
• Each gene is drawn as a
rectangle
• Several possible
variables for position:
Ordinal position
• Several possible
variables for color:
– Position in one
reference genome
– Use a color ramp, for
wide range of colors
Albers,D. et al "Sequence surveyor: Leveraging overview for scalable
genomic alignment visualization." Visualization and Computer
Graphics, IEEE Transactions on 17.12 (2011): 2392-2401.

Overview Visualizaiton: Sequence
Surveyor
• Pixel-density scalable
– High-density representation
– High-detail representation
• Display size scalability
– May be difficult to compare
patterns from one side of
display to another
• Perceptual Scalability
– Colors allow for pre-attentive
identification of patterns
– Avoids visual clutter
Albers,D. et al "Sequence surveyor: Leveraging overview
for scalable genomic alignment visualization."
Visualization and Computer Graphics, IEEE Transactions
on 17.12 (2011): 2392-2401.

Copy number variations on big displays
• Orchestral:
– Visualization of a different data type
– Effective use of color to enable pre-attentively
identification of similarities across genomes
– High-density representation
– Details-up-close, overview from a distance
Ruddle, Roy A., et al. "Leveraging
wall-sized high-resolution displays for
comparative genomics analyses of
copy number variation." Biological

Program details
• Implemented in C++ using Qt and the QGraphicsView
framework
• Upload:
– genome feature files
– Fasta files (raw gene sequences)
• Cd-hit algorithm processes sequence files to compute
ortholog ‘clusters’
• MySQL database to store big datasets
– Loads 1000 contigs into memory, rest stored in database
• Optimized for PubMed datasets
• Prototyped on E.Coli draft genomes
– Capable of displaying any contigs from thousands of E.Coli draft
genomes
• On EVL Cyber-commons wall, around 400 contigs in view

BactoGeNIE: High density
representation
• Compressed genome
encoding
• No text labels, instead
‘on-demand’
• No ‘comparative track’
• Encode orthology using
– User applied color: pre-
attentive orthology
identification
– Coordinated
highlighting: scalable
visual query
– Alignment: use space to
encode similarity

Use space to encode similarity
• Goals:
– Make it easier to perform comparisons across many
genomes (Analytic task scalability)
– Accommodate increased display size (Display Size
Scalability)
– Make similarities and differences easy to see
(Perceptual Scalability)
• Sorting and Alignment
– Sort by contig length
– Sort by gene content
– Dynamically align against any gene

Interactivity
• On hovering, contig expands in height, so easier
to select genes of interest in high-density view
• ‘Pop-up’ menu for each gene that gives info and
allows for:
– application of color:
• ‘tagging’ operation
• Scalable query
– “targeting” operation (described next)
• User can sort genomes by :
– Gene target
– Contig length

‘Gene Targeting’ Function to create
high resolution, comparative ‘maps’
• User selects a gene of interest
• This gene is given a base color
• Two color ramps are applied to adjacent genes,
one ‘upstream’ and one ‘downstream’
• Orthologous genes in related genomes are given
the same colors
• Contigs containing this gene are brought to the
top
• The target gene is centered
• Orthologs are aligned to the target

Gene targeting function
• Clustering to
promote direct
comparisons
• Overviews at a
distance
• Details up close
• Pre-attentive
identification of
similarities and
differences between
gene neighborhoods
Lance Long

Pixel-density Scalability
BactoGeNIE fits
the pixel-density
scalability
criteria:
High-density data
display, identifier
display and
orthology
encoding

Display Size Scalability
• BactoGeNIE
is the only
approach to
use
clustering
and show
multiple
levels of
detail

Perceptual Scalability and Analytic
Tasks
BactoGeNIE:
• Similarity is pre-
attentively
accessible
• Avoids visual
clutter
• Visual query for
orthologs

Graphical Scalability:
Display Resolution vs Number of
Genomes
0
100
200
300
400
500
600
700
800
900
1000
480 720 1080 1440 2160 2880 3240 4320
BactoGeNIE
GeneRiViT
SynBrowse
SynView
PSAT
Geco
Mauve
Pixels
Genomes

Preliminary User Feedback
• A version of BactoGeNIE used by computational biology team on NxN pixels
and MxM inches resolution tiled display wall
• “This tool has been widely used by members of the team to show the
comparative analyses of genomic context for several bacterial genomes”
• “Genome browsers such as JBrowse enable researchers to do comparative
genome analyses for nearly 10-50 genomes. But fail to work when we are
studying several hundreds of genomes of interest.
• This tool is really unique and it’s the only tool that I am aware of that can
scale up to any number of genome comparisons.
• The ability to load multiple tracks of genomes, and the zoom in and out
options with color coding, annotation tracks makes it very convenient for
scientists to quickly look at patterns.
• This tool has a potential to serve both for visualization as well as data mining
needs.”
Usage of a version without the gene targeting approach.
Future study will concentrate on this feature with a wider community of users

Summary of contributions
• A novel design that is the first to enable direct
comparisons between hundreds of gene
neighborhoods in one view
• First interactive, large-scale comparative gene
neighborhood approach, with on-the-fly
sorting, dynamic alignment, user-selected color
and color ramps
• First to show overviews with gene neighborhood-
details, that can be accessed through physical
movement
• introduces a novel visualization approach ‘gene
targeting’ that translates genomic data into high-
resolution genomic maps

What’s next?
Design
• Integration with different levels of detail
• Multiple color ramps
• Advanced ordering in y, based on similarity to target or
strain phylogeny
Implementation
• Scalability in rendering using parallelization on the GPU
• Port to SAGE
Evaluation
• User studies and evaluations of perceptual scalability

Scalable Design, Big Data, Big Displays
• Need visualization to provide an interface
between automated analysis and the expert
• Porting existing visual approaches to big data
and big displays will not always work
• Need to design for increased
– pixel-density
– display size
– volume of analytical tasks

Thanks!
• Acknowledgements:
– Jason Leigh, Andy Johnson, Khairi Reda, Lance
Long, Uthman Shabazz, and everyone in the
Electronic Visualization Laboratory
– Barry Goldman, David Bush, Niran Iyer, Shawn
Stricklin and the rest of the computational biology
team at Monsanto

Bacterial Gene Neighborhood Comparison at Scale

Recommended

Recommended

More Related Content

Similar to Bacterial Gene Neighborhood Comparison at Scale

Similar to Bacterial Gene Neighborhood Comparison at Scale (20)

Recently uploaded

Recently uploaded (20)

Bacterial Gene Neighborhood Comparison at Scale

Editor's Notes