3. Outline
• Big data and how it changes the way we do
science
• How visualization and big displays may help
• Walk through the process of building a new
visualization with scientists for big displays
• Discuss current and future work and where I
think we are headed
4. My grandma
• Lab tech at a university specializing in
blood
– Blood types, blood groups, antibodies,
transfusion
– Everything to do with blood
• Bottom-up understanding of how science
works
• Informed my understanding of science
5. My grandma’s story of science
Collect samples
Look through the
microscope
See patterns
See anomaliesCall upon
expertise
6. My grandma’s recipe for good science
Science!
- unlimited observat ions
- deep expert ise
- endless curiosity
- lots of grunt work
- a pinch of luck
- a community of
scient ists
- Know where your samples come
from
- Record observat ions
- Be careful of your assumpt ions
- Don't be in a t unnel
- Consider alternate explanat ions
Bake for 40+ years.
Serves your community
and humanity
Eureka!
7. Science and scientists
• Looking at other scientists, something in common:
– People looking, observing, thinking, exploring, communicating, making
decisions
– using human expertise, curiosity, confusion, excitement…
• Deeply human process to investigate the world and produce new
knowledge
8. Science research in college
• In college I started doing
biology research in an
immunology lab studying how
immune cells developed
• My grandma would ask me if I
looked under the microscope
and made observations
9. My story of science
Generate samples
Apply to a chip
10. My “big data” story of science
• Tradition methods: a student would focus on
collecting one or two data points
– Closer to my grandma’s experience
– Couldn’t directly observe these molecules, but you
were isolating a small picture and collecting a small
result that could be understood
• New methods: digitized data collection allowed
one student to collect thousands of data points
– Potentially more comprehensive picture
– Fast, efficient and cheap
– But very difficult to directly understand and control
11. Science not from my grandma’s recipe
Science!
- unlimited observat ions
- deep expert ise
- endless curiosity
- lots of grunt work
- a pinch of luck
- a community of
scient ists
- Know where your samples come
from
- Record observat ions
- Be careful of your assumpt ions
- Don't be in a t unnel
- Consider alternate explanat ions
Bake for 40+ years.
Serves your community
and humanity
• I was removed from generating the
data (black box)
• Few observations:
– The processes we studied were
too small to observe
– Our big data result was hard to
map into what I knew
• Assumption, tunnel thinking were
baked into every step of this
process
– By necessity
– Too hard to consider alternate
explanations
12. The opportunity and cost of big data
• Measuring more, drawing a
big picture
• But our capacity to
understand does not grow
at the same rate as our
data
• For me, for my grandma:
disorientation from losing
direct connection with
science
“No one looks under a
microscope anymore.
It is all DNA and
computers and chips.
How do we make
discoveries?”
13. • Automated systems will help with
big data
• But it is not just about computers
giving us answers
• People can build and transmit
new scientific knowledge
• We need to give scientists access
points to computational methods
• Use my grandma’s image: To do
this: we need some sort of big
data microscope
Picard: “Computer: scan everything, run
diagnostics, and tell us the answer.”
Computer: ”The answer is 42.”
Role for computational approaches in
big data and science
14. There is a computer science field which focuses
bringing scientists back ’into the loop’ and
building ways for scientists to observe, explore,
use prior knowledge, share findings…
(all the things that has made science work)
15. There is a computer science field which focuses
bringing scientists back ’into the loop’ and
building ways for scientists to observe, explore,
use prior knowledge, share findings…
(all the things that has made science work)
DATA VISUALIZATION
16. What is data visualization?
Goal: Visually representing data on interactive
devices so that users can view, explore and
analyze data and share findings with others
17. Powerful human visual system
• Around 60% of our brain is involved in
processing visual information
• Evolved to recognize visual patterns, outliers
and trends
• To bring our expertise to bear on data, we just
need good visual representations of data
18. Research in visualization
• Several international research conferences and journals on data visualization
– http://www.ieeevis.org/
• Questions:
– How to best represent data of different types
– How to design efficient algorithms for representing data
– How to help users perform different kinds of tasks
– How to design new ways to interact with visualizations
• Combines computer science, psychology, art, math/statistics, diverse application
domains (sciences, engineering, business, humanities, journalism, sports…)
19. My lucky break into the data vis world
• Just starting my MS degree in computer science, I discovered the
Electronic Visualization Lab at University of Illinois at Chicago
– Big displays, touch displays, stereoscopic displays, gesture recognition
• One day later: A group of biologists had a big data problem and believed
new visualizations and big displays could help
• No one in the lab knew biology
20. My lucky break into the data vis world
• Just starting my MS degree in computer science, I discovered the
Electronic Visualization Lab at University of Illinois at Chicago
– Big displays, touch displays, stereoscopic displays, gesture recognition
• One day later: A group of biologists had a big data problem and believed
new visualizations and big displays could help
• No one in the lab knew biology
This is amazing.
I need to work
here!
21. My lucky break into the data vis world
• Just starting my MS degree in computer science, I discovered the
Electronic Visualization Lab at University of Illinois at Chicago
– Big displays, touch displays, stereoscopic displays, gesture recognition
• One day later: A group of biologists had a big data problem and believed
new visualizations and big displays could help
• No one in the lab knew biology
This is amazing.
I need to work
here!
22. EVL history
• Founded in
1973
• art/CS lab
• Developing
new
environments
for visualizing
data and
collaborating
23. EVL today: Big displays for big data
• Big data revolution in
science
• At the same time:
display resolutions and
sizes also increasing
• Improved rendering
power from graphics
cards
• Tiled display walls
using
– Display clusters
– Single machine with
multiple graphics
cards
24. Can big displays help with big data?
• These environments are cool and futuristic and
beautiful but…
• Can they help us solve big data problems?
25. BactoGeNIE overview
• Worked with a team of biologists who had thousands of bacterial genomes and a large tiled display
wall
• We learned that we needed new visualizations that would
– Scale up to the wall
– Scale up to large data volumes
Next: the motivating problem and how I came up with the design. Example for how big displays could
help with big data.
26. My biology collaborators and their
genome sequencing boom
• In 2000 it took billions of dollars,
hundreds of researchers to
sequence the human genome
• Since then, changes in genome
sequencing technology enabled
cheap and fast genome
sequencing
• My bacterial genomics
collaborators suddenly could
sequence thousands of complete
genome sequences of closely
related bacterial strains
27. Why are bacterial genome sequences
important?• Understanding bacterial
genomes will help us
– Develop antibiotics
– Understand antibiotic
resistance
– Find genes that may be useful
in drug development and
agriculture
• Finding subtle differences
between genomes in related
strains may help us explain why
strains of bacteria have
different properties
– Eg. One is antibiotic resistant,
another is not
https://www.patricbrc.org/
portal/portal/patric/Home
28. What is a genome sequence? What
does the data look like?
• Genome: complete genetic material for an organism,
consists of a set of long sequence of nucleotide –
chemical building blocks of DNA
• Genes: a small sequence of nucleotides within a
genome that encodes a product, such as a protein,
which performs functions in an organism.
• Genomic data includes
– Sequence: is composed of a linear sequence of subunits called
nucleotides.
– Annotation: position of genes and other elements within the
genome sequence
• With the gene sequences, can identify related genes
across different genomes: Orthologs
29. Specific problem: Comparative Gene
Neighborhood Analysis
• In bacteria, a gene’s neighbors in the
genome may be involved in similar
functions.
• Sequencing genomes would allow
researchers to compare neighborhoods
around interesting genes
• This would allow my collaborators to
– Explore to find new genes
– Dig into differences between gene
neighborhoods in related bacterial strains
gene1 gene2 gene3 gene4
Biological process
?
?
30. What we needed
• We needed a visualization that would
– Show the interesting differences and similarities
around genes of interest
– Scale to lots of genomes
– Scale to big displays
• How should we ‘draw’ this genomic data to
help the researchers do their work?
31. First: looked at existing visualizations
• Could they find the features that interested
them?
• Did these scale up to larger numbers of
genomes?
• Designed for
– Small collections of genomes (2-9), small numbers of
genes
– Because the ability to sequence so many genomes is
new!
• Why didn’t they scale
- Line connections and text: visual clutter as you scale-up
- Color to show similarity- but not enough colors
• Conclusion: encodings and layouts incompatible
with large numbers of gene neighborhoods
McKay et al. Using
the Generic Synteny
Browser
(GBrowse_syn).
Current protocols in
Bioinformatics
Hoboken, NJ, USA:
John Wiley & Sons
Fong, Christine, et al. "PSAT:
a web tool to compare
genomic neighborhoods of
multiple prokaryotic
genomes." BMC
bioinformatics 9.1 (2008):
170.
33. Next: What did they want to observe?
Content Order and orientation
Context for addressing errors in data
verification
Break
Strain 1
Strain 2
Strain 3
Ground truth
Strain 1
Strain 2
Strain 3
Strain 2
B
B
B
C
C
C
D
D
D
B
B
B
C
C
C
D
D
D
Break pt
Break pt
Gap
Strain 1
Strain 2
Strain 3
Ground truth
Strain 1
Strain 2
Strain 3
A B C D
A B C D
A
A B C
D
D
A
A B C
D
D
A
A
A
A
A
A
35. How to make a high density
design?
• Traditional visualizations use lines and text to
indicate related genes in different genomes
– Low density
– Lots of visual clutter
– Hard to follow on compressed, high pixel-
density displays
• Our solution: High density encoding
– Color to encode similarity
– Removed the text, made it available ‘on
demand’
gene id
gene id
gene id
gene id
gene
id
gene
id
gene
id
gene
id
gene id
gene id
gene id
gene id
gene
id
gene
id
gene
id
gene
id
gene id
gene id
gene id
gene id
gene
id
gene
id
gene
id
gene
id
high-density display
color, not orthology lines
identification on-demand
Existing,
low-density
approaches:
>100 pixels
BactoGeNIE
8-16 pixels
36. How to design for large displays?
• Goal: consider whether design scales-up spatially across a big display.
• An increase in display size could hamper the perception of data and relationships.
– When related entities are on opposite ends of the display, preventing direct
comparison
37. How to design for large displays?
• Solution: Features to enable clustering, grouping and
alignment features to bring related genes and their
neighborhoods together to enable comparisons
38. Design to target perceptual scalability
• Perceptual scalability:
– Allow someone to look across a large
number of entities on a big display
surface and see patterns
• Interaction design: gene targeting function:
– User selects a gene of interest
– Scene is reconfigured
– A gradient is applied to neighbors and
orthologs
• Upstream (yellow to green)
• Downstream (yellow to blue)
• Encodes distance to target, order and
orientation
• Outcome: Make priority features stand out
39. Case Study: Neighborhood around orthologs to a hypothetical
protein in 673 draft genomes from E.coli.
42. Multiple views showing different kinds of biological
data (mock ups)
• Biologists don’t typically examine just one data type,
but many at once
• Difficult to do on small displays
46. 5 years from now? 10 years from
now?
• Digital wall paper
• Naturalistic inputs to visualization
• ‘smart’ systems to track your behavior
• Advances in graphics
• Enable: High resolution ‘smart rooms’ for science?
47. What excites me
• At its best, scientists are expressing curiosity, passion,
interest, joy through their work
• Big data presents a fantastic opportunity, but needs
data visualization to keep scientists in the loop
• Need data visualization so we can make big decisions
about how to use science
• I hope to see a more visually rich and beautiful world
of technology for science
My grandma was a lab tech at a university and she specialized in everything to do with blood
Blood types, blood groups, antibodies, transfusions
If there was something we know about blood, my grandma was all about it
She had wanted to become a doctor, but it was the 50s and 60s and she faces some discouragement
But, her career gave her this deep bottom-up perspective on how science worked
And she shared this understanding with me when I was a kid
Though it was odd to grow up with lots of strories about blood and someone deeply deeply invested in blood
She told me lots of stories about the things she found and how her job went, and it generally went something like this:
When she would tell me these stories, it usually ended with some advice for me. And this advice could be expressed as a recipe for good science.
Some grandma’s pass along recipes for oat meal cookies. Others pass along how to do science.
Know where your data comes from
Observations matter
Assumptions are the root of all problems
Try to avoid tunnel thinking
Think about alternate explanations
and of course bake for 50 years at room temperature
So, the core of this recipe, the core of how my grandma did science, was my grandma.
She was the one looking, observing, noticing oddities, following up, bringing all her curiosity and interest and persistence and… all these distinctively human things to the process.
And this is how science has historically been done. My grandma in her community was echoing the same amazing, crazy, passionate insanity that has had such an immense impact on our world.
There are lots of visualizations that pertain to this problem, but it became clear, when discussing with the scientists that these ideas did not scale visually. Fundamentally not designed for comparative tasks across large collections of genomes.
There has been fantastic work on large displays by X and Y, but not to address this domain problem.