WHAT TO EXPECT
WHEN YOU ARE
VISUALIZING
Krist Wongsuphasawat / @kristw
Based on true stories
Forever querying
Never-ending cleaning
Hopelessly prototyping
Last minute coding
and many more…
Computer Engineer
Bangkok, Thailand
Chulalongkorn University
Krist Wongsuphasawat / @kristw
Programming + Soccer
Computer Engineer
Bangkok, Thailand
Krist Wongsuphasawat / @kristw
Programming + Soccer
Computer Engineer
Bangkok, Thailand
Krist Wongsuphasawat / @kristw
(P.S. These are actually not my robots, but our competitors’.)
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
PhD in Computer Science
Information Visualization
Univ. of Maryland
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
IBM
Microsoft
PhD in Computer Science
Information Visualization
Univ. of Maryland
PhD in Computer Science
Information Visualization
Univ. of Maryland
IBM
Microsoft
Data Visualization Scientist
Twitter
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
#interactive visualizations
Open-source projects
Visual Analytics Tools
DATA =ME+ VIS
Me
clients, data, requirements, etc.
WHAT TO EXPECT?
1. EXPECT POTENTIAL MISMATCHES
INPUT (DATA)
What clients think they have
INPUT (DATA)
What clients think they have What they usually have
YOU
What clients think you are
YOU
What clients think you are What they will get
OUTPUT (VIS)
What clients ask for
OUTPUT (VIS)
What clients ask for What they really need
COMMUNICATE
I need this. Take this.
I need this. Here you are.
I need this. Take this.
& COMPROMISE
2. EXPECT DIFFERENT REQUIREMENTS
DIFFERENT GOALS
Present
Communicate information effectively
Explore
Exploratory analysis, Reusable tools for exploration
Explore + Present
Analyze data + tell story
Enjoy
More flexible
DIFFERENT GOALS
Present
Communicate information effectively
Explore
Exploratory analysis, Reusable tools for exploration
Explore + Present
Analyze data + tell story
Enjoy
More flexible
3. EXPECT TO CLEAN DATA
DATA SOURCES
Open data
Publicly available
Internal data
Private, owned by clients’ organization
Self-collected data
Manual, site scraping, etc.
Combine the above
MANY FORMS OF DATA
Standalone files
txt, csv, tsv, json, Google Docs, …, pdf*
APIs
better quality with more overhead
Databases
doesn’t necessary mean they are organized
Big data
bigger pain
HAVING ALL TWEETS
How people think I feel.
How people think I feel. How I really feel.
HAVING ALL TWEETS
CHALLENGES
Get relevant Tweets
hashtag: #oscars
keywords: “spotlight” (movie name)
Too big
Need to aggregate & reduce size
Slow
Long processing time (hours)
Hadoop Cluster
GETTING BIG DATA
Data Storage
Pig / Scalding (slow)
GETTING BIG DATA
Hadoop Cluster
Data Storage
Tool
Hadoop Cluster
Pig / Scalding (slow)
GETTING BIG DATA
Data Storage
Tool
Pig / Scalding (slow)
GETTING BIG DATA
Hadoop Cluster
Data Storage
Tool
Your laptop Smaller dataset
Hadoop Cluster
Pig / Scalding (slow)
Data Storage
Tool
Final dataset
Tool node.js / python / excel (fast)
Your laptop
GETTING BIG DATA
Smaller dataset
CLEANING
Data come in different formats.
tsv to json
Quality of data collection.
null, missing data, typos, timestamp
Filter
Remove unnecessary data
Conversion
Change country code from 3-letter (USA) to 2-letter (US)
Correct time of day based on users’ timezone
Convert lat/lon to county
etc.
4. EXPECT TO CLEAN DATA A LOT
70-80% of time cleaning data
“DATA JANITOR”
WHY?
Definition of “clean” depends on the task.
e.g. Restaurant reviews
USER RESTAURANT RATING
========================
A MCDONALD’S 3
B MCDONALDS 3
C MCDONALD 4
D MCDONALDS 5
E IHOP 4
F SUBWAY 4
WHY?
Definition of “clean” depends on the task.
e.g. Restaurant reviews
Data issue can present itself anytime.
in the project timeline
RAMSAY & RAMSEY
WHY?
Definition of “clean” depends on the task.
e.g. Restaurant reviews
Data issue can present itself anytime.
in the project timeline
It takes time to process data.
Run. Wait… Oops! Re-run. Wait…
RECOMMENDATIONS
Always think that you will have to do it again
document the process, automation
Reusable scripts
break a gigantic do-it-all function into smaller ones
Reusable data
keep for future project
5. EXPECT TO TRY AND BREAK THINGS
https://twitter.com/hashtag/
d3brokeandmadeart
#D3BROKEANDMADEART
6. EXPECT TO ITERATE UNTIL IT WORKS
7. EXPECT DEADLINE
EXAMPLE PROJECTS
EXAMPLE 1: STORYTELLING
WHAT TO EXPECT
timely
Deadline is strict. Also can be unexpected events.
wide audience
easy to explain and understand, multi-device support
one-off projects
content screening
from fans’ conversations
Reveal the talking points
of every episode of
Problem is coming.
CHAPTER I
Problem
Want to know what the audience
talk about a TV show
from Tweets
HBO’s Game of Thrones
Based on a book series “A Song of Ice and Fire”
Medieval Fantasy. Knights, magic and dragons.
Brief Story
A King dies. 
A lot of contenders wage a war
to reclaim the throne.
Minor characters with no claim to the throne
set their own plans in action to gain power
when all the major characters end up killing each other.
Brave/Honest/Honorable characters die.
Intelligent but shady characters
and characters who know nothing
continue to live.
While humans are busy killing each other,
ice zombies “White walkers” are invading from the North.
The only group who seems to care about this
is neutral group called the Night’s Watch.
HBO’s Game of Thrones
Based on a book series “A Song of Ice and Fire”
Medieval Fantasy. Knights, magic and dragons.
Many characters.
Anybody can die.
6 seasons (60 episodes) so far
Multiple storylines in each episode
Problem
Want to know what the audience
talk about a TV show
from Tweets
Ideas
Common words
Too much noise
Ideas
Common words
Too much noise
Characters
How o!en each character were mentioned?
I demand a trial by prototyping.
CHAPTER II
Prototyping
Pull sample data
from Twitter API
Entity recognition and counting
naive approach
List of names
Daenerys Targaryen,Khaleesi
Jon Snow
Sansa Stark
Tyrion Lannister
Arya Stark
Cersei Lannister
Khal Drogo
Gregor Clegane,Mountain
Margaery Tyrell
Joffrey Baratheon
Bran Stark
Theon Greyjoy
Jaime Lannister
Brienne
Eddard Stark,Ned Stark
Ramsay Bolton
Sandor Clegane,Hound
Ygritte
Stannis Baratheon
Petyr Baelish,Little Finger
Robb Stark
Bronn
Varys
Catelyn Stark
Oberyn Martell
Daario Naharis
Davos Seaworth
Jorah Mormont
Melisandre
Myrcella Baratheon
Tywin Lannister
Tommen Baratheon
Grey Worm
Tyene Sand
Rickon Stark
Missandei
Roose Bolton
Robert Baratheon
Jojen Reed
Jeor Mormont
Tormund Giantsbane
Lysa Arryn
Yara Greyjoy,Asha Greyjoy
Samwell Tarly,Sam
Hodor
Victarion Greyjoy
High Sparrow
Dragon
Winter
Dothraki
Sample Tweet
Sample Tweet
Sample data
Character Count
Hodor 10000
Jon Snow 5000
Daenerys 4000
Bran Stark 3000
… …
*These numbers are made up for presentation, not real data.
When you play the game of vis,
you iterate or you die.
CHAPTER III
Where to go from here?
+ episodes
The Guardian & Google Trends

http://www.theguardian.com/news/datablog/ng-interactive/2016/apr/22/game-of-thrones-the-most-googled-characters-episode-by-episode
+ emotion
+ connections
+ connections
Gain insights from a single episode
emotion & connections
Sample data
Character Count
Jon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character Count
Hodor 10000
Jon Snow 5000
Daenerys 4000
… …
INDIVIDUALS CONNECTIONS
+ top emojis + top emojis
*These numbers are made up for presentation, not real data.
Graph
NODES LINKS
+ top emojis + top emojis
Character Count
Jon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character Count
Hodor 1000
Jon Snow 500
Daenerys 400
… …
*These numbers are made up for presentation, not real data.
Network Visualization
Node-link diagram
Force-directed layout
http://blockbuilder.org/kristw/762b680690e4b2b2666dfec15838a384
Issue: Hairball
Why?
Too many nodes & edges
nodes = nodes.filter(n => n.count > 100)
links = links.filter(l => l.count > 100)
The force is (too) strong.
force
.charge(…)
.gravity(…)
.linkDistance(…)
.linkStrength(…)
Issue: Occlusions
Tried: Fixed positions
+ Collision Detection
http://blockbuilder.org/kristw/2850f65d6329c5fef6d5c9118f1de6e6
+ Community Detection
https://github.com/upphiminn/jLouvain
+ Collision Detection (with clusters)
https://bl.ocks.org/mbostock/7881887
Tormund + Brienne
Issue: Convex hull
http://bl.ocks.org/mbostock/4341699
d3.geom.hull(vertices)
x & y only, no radius
Example
Fix it
Fix it
Let’s get other episodes.
Hadoop remembers.
CHAPTER IV
More data
Hadoop
Rewrite the scripts in Scalding
to get archived data
How much data do we need?
Whole week?
5 days?
2 days?
A day?
etc.
How much data do we need?
Transitions
not so smooth
A#er switching episode
1. Store old positions for existing objects.
2. Assign new initial positions.*
Initial positions
Default: random
Better starting points
Heuristics based on degree of nodes
A#er switching episode
1. Store old positions for existing objects.
2. Assign new initial positions.*
3. Run simulation without updating <svg> for n rounds
4. Animate objects from old to new positions.
5. Resume simulation and update <svg> every tick.
Animate Nodes & Links
Remove
delay
Move & Change size/thickness
Add new
const selection = svg.selectAll('g.node')
.data(nodes, d => d.entity.id);
selection.exit()
.transition()
.duration(1000)
.style('opacity', 0)
.remove();
const sEnter = selection.enter().append('g')
.classed('node', true)
.attr('transform', d => `translate(${d.x},${d.y})`)
.style('opacity', 0)
.call(force.drag);
sEnter.append('circle')
.attr('r', d=>d.r)
.style('fill', d => options.colorScale(d.entity.group));
const sTrans = selection.transition()
.delay(1000)
.duration(2000)
.attr('transform', d => `translate(${d.x},${d.y})`)
.style('opacity', 1)
sTrans.select('circle')
.attr('r', d=>d.r)
Add “enter” nodes
with opacity 0
After 1s delay,
use transition to move nodes
and fade in new nodes
Fade “exit” nodes to opacity 0
and remove
Create selection
Animate Communities
Remove
delay
Move & Change shape*
Add new
http://blockbuilder.org/kristw/f9ffe87dd8b4038b5867e853c27cebb7
Default
t=0 t=1
Smoother
t=0 t=1t=0.5 t=0.51
Code
// original
path.attr('d', hull);
// with custom interpolation
path.attrTween('d', (d,i,currentAttr) =>
interpolateHull(d, currentAttr)
)
Colors
Default: d3.category10()
Distinct but nothing about the context
Custom palette
Colors related to the groups/houses.
Black = Night’s Watch
Blue = North
Red = Daenerys
Gold = Lannister
…
Hold the vis.
CHAPTER V
The vis is not enough.
Legend
Navigation
Top 3
Adjust threshold
Recap
Filtered Recap
Tooltip
Demo
https://interactive.twitter.com/game-of-thrones
Mobile Support
A visualizer always evaluates his work.
CHAPTER VI
“Feedback is the breakfast of champion.”
— Ken Blanchard
Self & Peer
Does it solve the problem?
Google Analytics
Pageviews
Visitors
Actions
Referrals
Sites/Social
Feedback
Feedback
EXAMPLE 2: VISUAL ANALYTICS TOOLS
Data sources
Output
explore
analyze
present
get
*
*
Data sources
Output
explore
analyze
present
get
*
*
ad-hoc scripts
Data sources
Output
explore
analyze
present
get
*
*
ad-hoc scripts tools for exploration
WHAT TO EXPECT
richer, more features
to support exploration of complex data
more technical audience
product managers, engineers, data scientists
accuracy
designed for dynamic input
long-term projects
USER ACTIVITY LOGS
UsersUseTwitter
UsersUse
Product Managers
Curious
Twitter
UsersUse
Curious
Engineers
Log data
in Hadoop
Write Twitter
Instrument
Product Managers
WHAT ARE BEING LOGGED?
tweet
activities
WHAT ARE BEING LOGGED?
tweet from home timeline on twitter.com
tweet from search page on iPhone
activities
WHAT ARE BEING LOGGED?
tweet from home timeline on twitter.com
tweet from search page on iPhone
sign up
log in
retweet
etc.
activities
ORGANIZE?
LOG EVENT A.K.A. “CLIENT EVENT”
[Lee et al. 2012]
LOG EVENT A.K.A. “CLIENT EVENT”
client : page : section : component : element : action
web : home : timeline : tweet_box : button : tweet
1) User ID
2) Timestamp
3) Event name
4) Event detail
[Lee et al. 2012]
LOG DATA
UsersUse
Curious
Engineers
Log data
in Hadoop
Twitter
Instrument
Write
Product Managers
bigger than
Tweet data
UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Ask
Twitter
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find
Ask
Twitter
Instrument
Write
Product Managers
LOG DATA
UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find, Clean
Ask
Twitter
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find, Clean
Ask
Monitor
Twitter
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find, Clean, Analyze
Ask
Monitor
Twitter
Instrument
Write
Product Managers
Log data
EngineersData Scientists
Usersin Hadoop
Find, Clean, Analyze
Use
Monitor
Ask
Curious
1 2
Twitter
Instrument
Write
Product Managers
Scribe Radar
Project / Find & Monitor client events
GOALS
Search for client events
Explore client event collection
Monitor changes
CLIENT EVENT HIERARCHY
iphone home -
- - impression
tweet tweet click
iphone:home:-:-:-:impression
iphone:home:-:tweet:tweet:click
DETECT CHANGES
iphone home -
- - impression
tweet tweet click
iphone home -
- - impression
tweet tweet click
TODAY
7 DAYS AGO
compared to
CALCULATE CHANGES
+5% +5% +5%
+10% +10% +10%
-5% -5% -5%
DIFF
DISPLAY CHANGES
iphone home -
- - impression
tweet tweet click
Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]
DISPLAY CHANGES
home -
- - impression
tweet tweet click
iphone
Demo Demo Demo
Demo / Scribe Radar
Twitter for Banana
WORKFLOW
Requested / Identify needs
Design & Prototype
Make it work for sample dataset
Refine & Generalize
Productionize
Document & Release
Maintain & Support
Keep it running, Feature requests & Bugs fix
8. EXPECT TO REFINE AND POLISH
REFINE & POLISH
UX / UI
Color
Animation
Mobile support
Performance
Loading time, Data file size
“The little of visualisation design” by Andy Kirk
http://www.visualisingdata.com/2016/03/little-visualisation-design/
9. EXPECT TO GET FEEDBACK
FEEDBACK
Logging
User study
Forum, User group
Office hours
10. EXPECT TO IMPROVE
HOW TO BE BETTER?
Time is limited.

Grow the team
Expand skills
Improve tooling
Solve a problem once and for all
Automate repetitive tasks
http://twitter.github.io/labella.js
Demo / Labella.js
https://github.com/twitter/d3kit
Demo / d3Kit
http://www.slideshare.net/kristw/d3kit
yeoman.io
Demo / Yeoman
SUMMARY
INPUT YOU OUTPUT
EXPECT
1) potential mismatches
2) different requirements
3) to clean data
4) to clean data a lot
5) to try and break things
Krist Wongsuphasawat / @kristw
kristw.yellowpigz.com
6) to iterate until it works
7) deadline
8) to refine and polish
9) to get feedback
10) to improve
#VOTE
Nicolas Garcia Belmonte, Robert Harris, Miguel Rios,
Simon Rogers, Jimmy Lin, Linus Lee, Chuang Liu,
and many colleagues at Twitter.
ACKNOWLEDGEMENT
RESOURCES
Images
Banana phone http://goo.gl/GmcMPq
Bar chart https://goo.gl/1G1GBg
Boss https://goo.gl/gcY8Kw
Champions League http://goo.gl/DjtNKE
Database http://goo.gl/5N7zZz
Fishing shark http://goo.gl/2fp4zW
Globe visualization http://goo.gl/UiGMMj
Harry Potter http://goo.gl/Q9Cy64
Holding phone http://goo.gl/It2TzH
Kiwi orange http://goo.gl/ejQ73y
Kiwi http://goo.gl/9yk7o5
Library https://goo.gl/HVeE6h
Library earthquake http://goo.gl/rBqBrs
Minion http://goo.gl/I19Ijg
NBA http://goo.gl/p7HBdG
NFL http://goo.gl/feQMZs
Orange & Apple http://goo.gl/NG6RIL
Pile of paper http://goo.gl/mGLQTx
Premier League http://goo.gl/AqIINO
Scrooge McDuck https://goo.gl/aKv8D7
The Sound of Music https://goo.gl/dqHlzj
Trash pile http://goo.gl/OsFfo3
Tyrion http://goo.gl/WaBonl
Watercolor Map by Stamen Design
THANK YOU
QUESTIONS?

What to expect when you are visualizing (v.2)