This document discusses expectations when visualizing data and creating visualizations. It covers 6 main points:
1. Expect to find the real need by understanding audience goals, questions, and intended use of the visualization. Compromise may be needed.
2. Expect to spend significant time (70-80%) cleaning data due to issues like multiple data sources and formats, missing values, and errors.
3. Expect trials and errors in the prototyping process to solve problems and meet deadlines. Iteration is important.
4. For larger datasets, expect challenges in processing, analyzing, and reducing size to find relevant insights. Tools like Hadoop can help handle bigger data.
5.
6. (P.S. These are actually not my robots, but our competitors’.)
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
7. Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
PhD in Computer Science
Information Visualization
Univ. of Maryland
8. Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
IBM
Microsoft
PhD in Computer Science
Information Visualization
Univ. of Maryland
9. PhD in Computer Science
Information Visualization
Univ. of Maryland
IBM
Microsoft
Data Visualization Scientist
Twitter
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
23. GOALS
Present data
Communicate information effectively
Analyze data
Exploratory data analysis
Tools to analyze data
Reusable tools for exploration
Enjoy
Combination of above
24. GOALS
Present data
Communicate information effectively
Analyze data
Exploratory data analysis
Tools to analyze data
Reusable tools for exploration
Enjoy
Combination of above
Who are the audience?
What do you want to tell?
What are the questions?
Who will use this?
What would they use this for?
Who are the audience?
34. DATA SOURCES
Open data
Publicly available
Internal data
Private, owned by clients’ organization
Self-collected data
Manual, site scraping, etc.
Combine the above
35. DATA FORMAT
Standalone files
txt, csv, tsv, json, Google Docs, …, pdf*
Databases
doesn’t necessary mean they are organized
API
better quality with more overhead
Website
Big data*
42. IS THIS CLEAN?
USER RESTAURANT RATING
========================
A MCDONALD’S 3
B MCDONALDS 3
C MCDONALD 4
D MCDONALDS 5
E IHOP 4
F SUBWAY 4
43. IS THIS CLEAN?
USER RESTAURANT RATING
========================
A MCDONALD’S 3
B MCDONALDS 3
C MCDONALD 4
D MCDONALDS 5
E IHOP 4
F SUBWAY 4
How many reviews are there?
Clean.
How many restaurants are there?
Not clean.
McDonald, McDonald’s, McDonalds
50. Hadoop Cluster
Scalding (slow)
Data Storage
Tool
Final dataset
Tool node.js / python / excel (fast)
Your laptop
GETTING BIG DATA
Smaller dataset
51. CHALLENGES
Slow
Long processing time (hours)
Get relevant Tweets
hashtag: #oscars
keywords: “moonlight” (movie name)
Too big
Need to aggregate & reduce size
Harder to spot problems
55. RECOMMENDATIONS
Always think that you will have to do it again
document the process, automation
Reusable scripts
break a gigantic do-it-all function into smaller ones
Reusable data
keep for future project
65. WHAT TO EXPECT
timely
Deadline is strict. Also can be unexpected events.
wide audience
easy to explain and understand, multi-device support
one-off project
scope
analyze data to find stories and find best way to present them
74. While humans are busy killing each other,
ice zombies “White walkers” are invading from the North.
The only group who seems to care about this
is neutral group called the Night’s Watch.
75. HBO’s Game of Thrones
Based on a book series “A Song of Ice and Fire”
Medieval Fantasy. Knights, magic and dragons.
Many characters.
Anybody can die.
6 seasons (60 episodes) so far
Multiple storylines in each episode
84. Sample data
Character Count
Hodor 10000
Jon Snow 5000
Daenerys 4000
Bran Stark 3000
… …
*These numbers are made up for presentation, not real data.
85. When you play the game of vis,
you iterate or you die.
CHAPTER III
87. + episodes
The Guardian & Google Trends
http://www.theguardian.com/news/datablog/ng-interactive/2016/apr/22/game-of-thrones-the-most-googled-characters-episode-by-episode
92. Sample data
Character Count
Jon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character Count
Hodor 10000
Jon Snow 5000
Daenerys 4000
… …
INDIVIDUALS CONNECTIONS
+ top emojis + top emojis
*These numbers are made up for presentation, not real data.
93. Graph
NODES LINKS
+ top emojis + top emojis
Character Count
Jon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character Count
Hodor 1000
Jon Snow 500
Daenerys 400
… …
*These numbers are made up for presentation, not real data.
117. Colors
Default: D3 category10
Distinct but nothing about the context
Custom palette
Colors related to the groups/houses.
Black = Night’s Watch
Blue = North
Red = Daenerys
Gold = Lannister
…
135. WHAT TO EXPECT
richer, more features
to support exploration of complex data
more technical audience
product managers, engineers, data scientists
accuracy
designed for dynamic input
long-term projects
163. See
HOW TO VISUALIZE?
narrow down
Client event collection
Engineers & Data Scientists
Interactions
search box => filter
164. See
Client event collection
Engineers & Data Scientists
client : page : section : component : element : action
HOW TO VISUALIZE?
narrow down
Interactions
search box => filter
175. RUN AN EXPERIMENT
Develop feature
Track metrics
1. No. of Tweets read
2. No. of Tweets sent
3. No. of Users
4. …
Set bucket size
How many users?
176. RETROSPECTIVE ANALYSIS
Data scientist analyzed 100+ past experiments.
Many useful insights.
- We could move metric A by X% on average.
- Experiment 18 moved metric A the most
- Which team was able to move metric A successfully?
- etc.
177. RETROSPECTIVE ANALYSIS
Data scientist analyzed 100+ past experiments.
Many useful insights.
- We could move metric A by X% on average.
- Experiment 18 moved metric A the most
- Which team was able to move metric A successfully?
- etc.
Amount of knowledge transfer = slide deck + wiki page.
Reproduce for recent experiments? Manually.
178. RETROSPECTIVE ANALYSIS
Data scientist analyzed 100+ past experiments.
Many useful insights.
- We could move metric A by X% on average.
- Experiment 18 moved metric A the most
- Which team was able to move metric A successfully?
- etc.
Amount of knowledge transfer = slide deck + wiki page.
Reproduce for recent experiments? Manually.
Make results more accessible
and convenient to use.
179. RETROSPECTIVE ANALYSIS
Data scientist analyzed 100+ past experiments.
Many useful insights.
- We could move metric A by X% on average.
- Experiment 18 moved metric A the most
- Which team was able to move metric A successfully?
- etc.
Amount of knowledge transfer = slide deck + wiki page.
Reproduce for recent experiments? Manually.
Make results more accessible
and convenient to use.
Automatic
180. Metric Mover
I like to move it, move it
Krist Wongsuphasawat, Joseph Liu,
Matthew Schreiner, Andy Schlaikjer, Lucile Lu and Busheng Lou
183. Implement a feature
Set OKRs
Interpret results
Process
Run experiment
+1.0%
Setup experiment
# of posts
# of posts
184. Implement a feature
Set OKRs
Interpret results
Process
Run experiment
+1.0%
Setup experiment
How easy/hard it is to move this metric?
How much change to aim for?
Challenges
# of posts
# of posts
185. Implement a feature
Set OKRs
Interpret results
Process
Run experiment
+1.0%
How much to expect from one experiment?
What were the successful features?
Who had experience with this?Setup experiment
How easy/hard it is to move this metric?
How much change to aim for?
Challenges
# of posts
# of posts
186. Implement a feature
Set OKRs
Interpret results
Process
Run experiment
+1.0%
How much to expect from one experiment?
What were the successful features?
Who had experience with this?Setup experiment
How easy/hard it is to move this metric?
How much change to aim for?
How good is this?
Challenges
# of posts
# of posts
196. Users who watch cat GIFs Users who like cat GIFs Users who post cat GIFs
**These are fake data.**
197. WORKFLOW
Identify needs
Design and prototype
Make it work for sample dataset
Refine, generalize and productionize
Make it work for other cases
Document and release
Maintain and support
Keep it running, Feature requests & Bugs fix
199. REFINE & POLISH
UX / UI
+ Mobile Support
Color
Animation / Transition
Performance
Loading time, Data file size
“The little of visualisation design” by Andy Kirk
http://www.visualisingdata.com/2016/03/little-visualisation-design/
200. “The first 90% of the code
accounts for the first 90% of the development time.
The remaining 10% of the code
accounts for the other 90% of the development time.”
— Tom Cargill, Bell Labs
205. HOW TO BE BETTER?
Time is limited.
Learn from the past
Expand skills
Get help / Grow the team
Improve tooling
Solve a problem once and for all
Automate repetitive tasks
210. EXPECT…
1. to find the real need
2. to clean data a lot
3. trials and errors
4. time for refinement
5. feedback
6. to improve
Krist Wongsuphasawat / @kristw
kristw.yellowpigz.com
213. My colleagues at Twitter for their collaboration
and support in these projects;
and my wife for taking care of the baby
while I make these slides.
ACKNOWLEDGEMENT
214. RESOURCES
Images
Banana phone http://goo.gl/GmcMPq
Bar chart https://goo.gl/1G1GBg
Boss https://goo.gl/gcY8Kw
Champions League http://goo.gl/DjtNKE
Database http://goo.gl/5N7zZz
Fishing shark http://goo.gl/2fp4zW
Frustrated programmer https://goo.gl/ZLDNny
Globe visualization http://goo.gl/UiGMMj
Harry Potter http://goo.gl/Q9Cy64
Holding phone http://goo.gl/It2TzH
Jon Snow https://goo.gl/CACWxE
Jon Snow lightsaber https://goo.gl/CJt1Tn
Kiwi orange http://goo.gl/ejQ73y
Kiwi http://goo.gl/9yk7o5
Library https://goo.gl/HVeE6h
Library earthquake http://goo.gl/rBqBrs
Minion http://goo.gl/I19Ijg
Nemo https://goo.gl/m0pmzC
Orange & Apple http://goo.gl/NG6RIL
Pile of paper http://goo.gl/mGLQTx
Scrooge McDuck https://goo.gl/aKv8D7
Trash pile http://goo.gl/OsFfo3
Watercolor Map by Stamen Design
Yes GIF https://goo.gl/agvlAE