Krist Wongsuphasawat / @kristw
6 THINGS TO EXPECT
WHEN YOU ARE
VISUALIZING
6 THINGS TO EXPECT
WHEN YOU ARE
VISUALIZING
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
Chulalongkorn University
Krist Wongsuphasawat / @kristw
Programming + Soccer
Computer Engineer
Bangkok, Thailand
Krist Wongsuphasawat / @kristw
Programming + Soccer
Computer Engineer
Bangkok, Thailand
Krist Wongsuphasawat / @kristw
(P.S. These are actually not my robots, but our competitors’.)
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
PhD in Computer Science
Information Visualization
Univ. of Maryland
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
IBM
Microsoft
PhD in Computer Science
Information Visualization
Univ. of Maryland
PhD in Computer Science
Information Visualization
Univ. of Maryland
IBM
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
Data Scientist
Analytics, Experiment
Twitter
Microsoft
PhD in Computer Science
Information Visualization
Univ. of Maryland
IBM
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
Engineering Manager
Data Experience
Airbnb
Microsoft
Twitter
#interactive visualizations
Open-source projects
Visual Analytics Tools
interactive.twitter.com
Apache Superset committer
labella.js (3000+ stars)
react-vega
Internal tools
Academic paperskristw.yellowpigz.com
DATA =ME+ VIS
Data, I’m ready!
Data, I’m ready!
Here I come!
WHAT TO EXPECT?
1. EXPECT TO FIND THE REAL NEED
INPUT (DATA)
What clients think they have
INPUT (DATA)
What clients think they have What they usually have
YOU
What clients think you are
YOU
What clients think you are What they will get
OUTPUT (VIS)
What clients ask for
OUTPUT (VIS)
What clients ask for What they really need
COMMUNICATE
GOALS
Present data
Communicate information effectively
Analyze data
Exploratory data analysis
Tools to analyze data
Reusable tools for exploration
Enjoy
Combination of above
GOALS
Present data
Communicate information effectively
Analyze data
Exploratory data analysis
Tools to analyze data
Reusable tools for exploration
Enjoy
Combination of above
Who are the audience?
What do you want to tell?
What are the questions?
Who will use this?
What would they use this for?
Who are the audience?
I need this. Take this.
I need this. Here you are.
I need this. Take this.
& COMPROMISE
2. EXPECT TO CLEAN DATA
2. EXPECT TO CLEAN DATA A LOT
70-80% of time cleaning data
“DATA JANITOR”
Collect + Clean + Transform
DATA WRANGLING
WHY DOES IT TAKE SO MUCH TIME?
2.1 Many sources and data format
DATA SOURCES
Open data
Publicly available
Internal data
Private, owned by clients’ organization
Self-collected data
Manual, site scraping, etc.
Combine the above
DATA FORMAT
Standalone files
txt, csv, tsv, json, Google Docs, …, pdf*
Databases
doesn’t necessary mean they are organized
API
better quality with more overhead
Website
Big data*
NEED TO…
Change format
e.g. tsv => json
Combine data
Resolve multiple sources of truth
2.2 Data transformation is needed.
EXAMPLES
Convert latitude/longitude into zip code
Change country code from 3-letter (USA) to 2-letter (US)
Correct time of day based on users’ timezone
etc.
2.3 Data collection issues
EXAMPLES
Typos
Incorrect values
Incorrect timestamps
Missing data
2.4 Definition of “clean” data
IS THIS CLEAN?
USER RESTAURANT RATING
========================
A MCDONALD’S 3
B MCDONALDS 3
C MCDONALD 4
D MCDONALDS 5
E IHOP 4
F SUBWAY 4
IS THIS CLEAN?
USER RESTAURANT RATING
========================
A MCDONALD’S 3
B MCDONALDS 3
C MCDONALD 4
D MCDONALDS 5
E IHOP 4
F SUBWAY 4
How many reviews are there?
Clean.
How many restaurants are there?
Not clean.
McDonald, McDonald’s, McDonalds
2.5 Bigger data, bigger problems
HAVING ALL TWEETS
How people think I feel.
How people think I feel. How I really feel.
HAVING ALL TWEETS
Hadoop Cluster
GETTING BIG DATA
Data Storage
Scalding (slow)
GETTING BIG DATA
Hadoop Cluster
Data Storage
Tool
Scalding (slow)
GETTING BIG DATA
Hadoop Cluster
Data Storage
Tool
Your laptop Smaller dataset
Hadoop Cluster
Scalding (slow)
Data Storage
Tool
Final dataset
Tool node.js / python / excel (fast)
Your laptop
GETTING BIG DATA
Smaller dataset
CHALLENGES
Slow
Long processing time (hours)
Get relevant Tweets
hashtag: #oscars
keywords: “parasite” (movie name)
Too big
Need to aggregate & reduce size
Harder to spot problems
CHALLENGES
Slow
Long processing time (hours)
Get relevant Tweets
hashtag: #oscars
keywords: “parasite” (movie name)
Too big
Need to aggregate & reduce size
Harder to spot problems
RAMSAY & RAMSEY
2.6 New issues can show up any time.
RECOMMENDATIONS
Always think that you will have to do it again
document the process, automation
Reusable scripts
break a gigantic do-it-all function into smaller ones
Reusable data
keep for future project
3. PREPARE TO ITERATE
It was a great idea … until I actually tried it.
Celebrate your failures
#D3BrokeAndMadeArt
TIPS
Don’t give up.
If stuck, look for inspirations.
The vis that gives you insights may or may not be the best vis for sharing.
Exploration vs. Communication
Keep it as simple as possible
but not simpler.
“Necessity is the mother of invention.”
— English Proverb
“Necessity is the mother of invention.”
— English Proverb
DEADLINE
TIPS
Don’t give up.
If stuck, look for inspirations.
The vis that gives you insights may or may not be the best vis for sharing.
Exploration vs. Communication
Keep it as simple as possible
but not simpler.
Set milestones and deadline.
PROJECTS
STORYTELLING PROJECTS
timely
Deadline is strict. Also can be unexpected events.
wide audience
easy to explain and understand, multi-device support
one-off project
scope
analyze data to find stories and find best way to present them
HAPPY NEW YEAR AROUND THE WORLD
[ PROJECT ]
HAPPY NEW YEAR 2013
twitter.github.io/interactive/newyear2014/
BOBA SCIENCE
[ PROJECT ]
https://medium.com/s/story/boba-science-
how-can-i-drink-a-bubble-tea-to-ensure-that-
i-dont-finish-the-tea-before-the-
bobas-7fc5fd0e442d
GAME OF THRONES
[ PROJECT ]
from fans’ conversations
Reveal the talking points
of every episode of
Problem is coming.
Problem
Want to know what the audience
talk about a TV show
from Tweets
HBO’s Game of Thrones
Based on a book series “A Song of Ice and Fire”
Medieval Fantasy. Knights, magic and dragons.
Brief Story
A King dies. 
A lot of contenders wage a war
to reclaim the throne.
Minor characters with no claim to the throne
set their own plans in action to gain power
when all the major characters end up killing each other.
Brave/Honest/Honorable characters die.
Intelligent but shady characters
and characters who know nothing
continue to live.
While humans are busy killing each other,
ice zombies “White walkers” are invading from the North.
The only group who seems to care about this
is neutral group called the Night’s Watch.
HBO’s Game of Thrones
Based on a book series “A Song of Ice and Fire”
Medieval Fantasy. Knights, magic and dragons.
Many characters.
Anybody can die.
8 seasons
Multiple storylines in each episode
Problem
Want to know what the audience
talk about a TV show
from Tweets
Ideas
Common words
Too much noise
Ideas
Common words
Too much noise
Characters
How o!en each character were mentioned?
Prototyping
Pull sample data
from Twitter API
Entity recognition and counting
naive approach
List of names
Daenerys Targaryen,Khaleesi
Jon Snow
Sansa Stark
Tyrion Lannister
Arya Stark
Cersei Lannister
Khal Drogo
Gregor Clegane,Mountain
Margaery Tyrell
Joffrey Baratheon
Bran Stark
Theon Greyjoy
Jaime Lannister
Brienne
Eddard Stark,Ned Stark
Ramsay Bolton
Sandor Clegane,Hound
Ygritte
Stannis Baratheon
Petyr Baelish,Little Finger
Robb Stark
Bronn
Varys
Catelyn Stark
Oberyn Martell
Daario Naharis
Davos Seaworth
Jorah Mormont
Melisandre
Myrcella Baratheon
Tywin Lannister
Tommen Baratheon
Grey Worm
Tyene Sand
Rickon Stark
Missandei
Roose Bolton
Robert Baratheon
Jojen Reed
Jeor Mormont
Tormund Giantsbane
Lysa Arryn
Yara Greyjoy,Asha Greyjoy
Samwell Tarly,Sam
Hodor
Victarion Greyjoy
High Sparrow
Dragon
Winter
Dothraki
Sample Tweet
Sample Tweet
Sample data
Character Count
Hodor 10000
Jon Snow 5000
Daenerys 4000
Bran Stark 3000
… …
*These numbers are made up for presentation, not real data.
Where to go from here?
+ episodes
The Guardian & Google Trends
http://www.theguardian.com/news/datablog/ng-interactive/2016/apr/22/game-of-thrones-the-most-googled-characters-episode-by-episode
+ emotion
+ connections
+ connections
Gain insights from a single episode
emotion & connections
Sample data
Character Count
Jon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character Count
Hodor 10000
Jon Snow 5000
Daenerys 4000
… …
INDIVIDUALS CONNECTIONS
+ top emojis + top emojis
*These numbers are made up for presentation, not real data.
Graph
NODES LINKS
+ top emojis + top emojis
Character Count
Jon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character Count
Hodor 1000
Jon Snow 500
Daenerys 400
… …
*These numbers are made up for presentation, not real data.
Network Visualization
Node-link diagram
Force-directed layout
http://blockbuilder.org/kristw/762b680690e4b2b2666dfec15838a384
Issue: Hairball
Issue: Occlusions
Tried: Fixed positions
+ Collision Detection
http://blockbuilder.org/kristw/2850f65d6329c5fef6d5c9118f1de6e6
+ Community Detection
https://github.com/upphiminn/jLouvain
+ Collision Detection (with clusters)
https://bl.ocks.org/mbostock/7881887
Tormund + Brienne
Let’s get other episodes.
More data
Hadoop
Rewrite the scripts in Scalding
to get archived data
How much data do we need?
Whole week?
5 days?
2 days?
A day?
etc.
How much data do we need?
Transitions
Changing episode
Community transition
t=0 t=1
Smoother
t=0 t=1t=0.5 t=0.51
Colors
Default: D3 category10
Distinct but nothing about the context
Custom palette
Colors related to the groups/houses.
Black = Night’s Watch
Blue = North
Red = Daenerys
Gold = Lannister
…
The vis is not enough.
Legend
Navigation
Top 3
Adjust threshold
Recap
Filtered Recap
Tooltip
Demo
https://interactive.twitter.com/game-of-thrones
Mobile Support
Self & Peer
Does it solve the problem?
Google Analytics
Pageviews
Visitors
Actions
Referrals
Sites/Social
Feedback
Feedback
ANALYTICS TOOLS
VISUAL ANALYTICS TOOL PROJECTS
richer, more features
to support exploration of complex data
more technical audience
product managers, engineers, data scientists
accuracy
designed for dynamic input
long-term projects
PROJECT LIFECYCLE
Identify needs
Design and prototype
Make it work for sample dataset
Refine, generalize and productionize
Make it work for other cases
Document and release
Maintain and support
Keep it running, Feature requests & Bugs fix
VISUAL ANALYTICS FOR LOG EVENTS
[ PROJECT ]
USER ACTIVITY LOGS
UsersUseTwitter
UsersUse
Product Managers
Curious
Twitter
UsersUse
Curious
Engineers
Log data
in Hadoop
Write Twitter
Instrument
Product Managers
WHAT ARE BEING LOGGED?
tweet
Activities
WHAT ARE BEING LOGGED?
tweet from home timeline on twitter.com
tweet from search page on iPhone
Activities
WHAT ARE BEING LOGGED?
tweet from home timeline on twitter.com
tweet from search page on iPhone
sign up
log in
retweet
etc.
Activities
ORGANIZE?
LOG EVENT A.K.A. “CLIENT EVENT”
[Lee et al. 2012]
LOG EVENT A.K.A. “CLIENT EVENT”
client : page : section : component : element : action
web : home : timeline : tweet_box : button : tweet
1) User ID
2) Timestamp
3) Event name
4) Event detail
[Lee et al. 2012]
LOG DATA
UsersUse
Curious
Engineers
Log data
in Hadoop
Twitter
Instrument
Write
Product Managers
bigger than
Tweet data
UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Ask
Twitter
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find
Ask
Twitter
Instrument
Write
Product Managers
LOG DATA
UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find, Clean
Ask
Twitter
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find, Clean
Ask
Monitor
Twitter
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find, Clean, Analyze
Ask
Monitor
Twitter
Instrument
Write
Product Managers
Log data
EngineersData Scientists
Usersin Hadoop
Find, Clean, Analyze
Use
Monitor
Ask
Curious
1 2
Twitter
Instrument
Write
Product Managers
client page section component element action
Event
50,000+ event types
client page section component element action
Event
50,000+ event types
one graph / event
x 50,000
DESIGN
CLIENT EVENT HIERARCHY
iphone home -
- - impression
tweet tweet click
iphone:home:-:-:-:impression
iphone:home:-:tweet:tweet:click
DETECT CHANGES
iphone home -
- - impression
tweet tweet click
iphone home -
- - impression
tweet tweet click
TODAY
7 DAYS AGO
compared to
CALCULATE CHANGES
+5% +5% +5%
+10% +10% +10%
-5% -5% -5%
DIFF
DISPLAY CHANGES
iphone home -
- - impression
tweet tweet click
Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]
DISPLAY CHANGES
home -
- - impression
tweet tweet click
iphone
Demo Demo Demo
Demo / Scribe Radar
Twitter for Banana
Details separate good and great work
4. RESERVE TIME FOR REFINEMENT
“The first 90% of the code
accounts for the first 90% of the development time.
The remaining 10% of the code
accounts for the other 90% of the development time.”
— Tom Cargill, Bell Labs
REFINE & POLISH
UX / UI + Mobile Support
Color
Animation / Transition
Metadata for SEO
Social media preview images
Performance
Loading time, Data file size
“The little of visualisation design” by Andy Kirk
http://www.visualisingdata.com/2016/03/little-visualisation-design/
Example
Issue: Convex hull
http://bl.ocks.org/mbostock/4341699
x & y only, no radius
Fix it
Fix it
Flatten the curve
https://www.fastcompany.com/90476143/the-story-behind-flatten-the-curve-the-defining-chart-of-the-coronavirus
THE ORIGIN
From a paper “Interim pre-
pandemic planning guidance:
community strategy for pandemic
influenza mitigation in the United
States: early, targeted, layered use
of nonpharmaceutical
interventions”
published in 2007 by the CDC
https://stacks.cdc.gov/view/cdc/11425
REVIVAL
Rosamund Pearce, a data journalist
at The Economist, rebuild it for a
piece about COVID-19.
Changed the labeling scheme to
assist colorblind readers.
https://www.economist.com/briefing/2020/02/29/covid-19-is-now-in-50-countries-and-things-will-get-worse
THE LINE
Drew Harris, an assistant
professor at the Thomas
Jefferson University, came across
the graphic in The Economist.
He recalled using it a decade
earlier as a pandemic
preparedness trainer. 
So he added the dotted line
“healthcare system capacity”
https://www.nytimes.com/article/flatten-curve-coronavirus.html
THEN IT WENT VIRAL
or find ways to get some
5. PLAN FOR FEEDBACK
“Feedback is the breakfast of champion.”
— Ken Blanchard
FEEDBACK
During development
Feedback sessions with clients/potential users
After release
Logging
User study
Forum, User group
Office hours
6. LOOK BACK FOR IMPROVEMENT
HOW TO BE BETTER?
Retrospective
What could have been better?
Wishlist
Expand skillset
Learning opportunities
Get help
Grow the team
Improve tooling
Solve a problem once and for all
Automate repetitive tasks
REUSABLE WORK
LABELLA.JS
[ PROJECT ]
GRID MAP
[ PROJECT ]
COVID-19 Situation in Thailand
by province
VX = REACT + D3
[ PROJECT ]
SUMMARY
6 STEPS
1.
2.
3.
4.
5.
6.
Krist Wongsuphasawat / @kristw
kristw.yellowpigz.com
Expect to find the real need
Expect to clean data a lot
Prepare to iterate
Reserve time for refinement
Plan for feedback
Look back for improvement
My former and current colleagues at Twitter and Airbnb
for their collaboration and support in these projects;
and my wife for taking care of our two kids
while I make these slides.
ACKNOWLEDGEMENT
THANK YOU
QUESTIONS?

6 things to expect when you are visualizing (2020 Edition)