9. IBM
Krist Wongsuphasawat / @kristw
Engineering Manager
Data Experience
Airbnb
Microsoft
Twitter
PhD in Computer Science
Information Visualization
Univ. of Maryland
Computer Engineer
Bangkok, Thailand
23. GOALS
Present data
Communicate information effectively
Analyze data
Exploratory data analysis
Tools to analyze data
Reusable tools for exploration
Enjoy
Combination of above
24. GOALS
Present data
Communicate information effectively
Analyze data
Exploratory data analysis
Tools to analyze data
Reusable tools for exploration
Enjoy
Combination of above
Who are the audience?
What do you want to tell?
What are the questions?
Who will use this?
What would they use this for?
Who are the audience?
34. DATA SOURCES
Open data
Publicly available
Internal data
Private, owned by clients’ organization
Self-collected data
Manual, site scraping, etc.
Combine the above
35. DATA FORMAT
Standalone files
txt, csv, tsv, json, Google Docs, …, pdf*
Databases
doesn’t necessary mean they are organized
API
better quality with more overhead
Big data*
42. IS THIS CLEAN?
USER RESTAURANT RATING
========================
A MCDONALD’S 3
B MCDONALDS 3
C MCDONALD 4
D MCDONALDS 5
E IHOP 4
F SUBWAY 4
43. IS THIS CLEAN?
USER RESTAURANT RATING
========================
A MCDONALD’S 3
B MCDONALDS 3
C MCDONALD 4
D MCDONALDS 5
E IHOP 4
F SUBWAY 4
How many reviews are there?
Clean.
How many restaurants are there?
Not clean.
McDonald, McDonald’s, McDonalds
48. Spark, Hadoop, etc. (slow)
GETTING BIG DATA
Tool
Lots of machines
Data Warehouse
49. GETTING BIG DATA
Tool
Your laptop Smaller dataset
Spark, Hadoop, etc. (slow)
Lots of machines
Data Warehouse
50. Tool
Final dataset
Tool node.js / python / excel (fast)
Your laptop
GETTING BIG DATA
Smaller dataset
Spark, Hadoop, etc. (slow)
Lots of machines
Data Warehouse
51. CHALLENGES
Slow
Long processing time (hours)
Get relevant Tweets
keywords: “parasite” (movie name)
Too big
Need to aggregate & reduce size
Harder to spot problems
52. CHALLENGES
Slow
Long processing time (hours)
Get relevant Tweets
keywords: “parasite” (movie name)
Too big
Need to aggregate & reduce size
Harder to spot problems
55. RECOMMENDATIONS
Always think that you will have to do it again
document the process, automation
Reusable scripts
break large script into smaller ones
Reusable data
keep for future project
59. TIPS
Don’t give up.
If stuck, take a break. Look for inspirations.
The vis that gives you insights may or may not be the vis for sharing.
Exploration vs. Communication
Keep it as simple as possible
but not simpler.
62. TIPS
Don’t give up.
If stuck, take a break. Look for inspirations.
The vis that gives you insights may or may not be the vis for sharing.
Exploration vs. Communication
Keep it as simple as possible
but not simpler.
Set deadlines
71. HBO’S GAME OF THRONES
Based on a book series “A Song of Ice and Fire”
Medieval Fantasy. Knights, magic and dragons.
72. HBO’S GAME OF THRONES
Based on a book series “A Song of Ice and Fire”
Medieval Fantasy. Knights, magic and dragons.
Many characters.
Anybody can die.
8 seasons
Multiple storylines in each episode
79. SAMPLE DATA
Character Count
Hodor 10000
Jon Snow 5000
Daenerys 4000
Bran Stark 3000
… …
*These numbers are made up for presentation, not real data.
85. SAMPLE DATA
Character Count
Jon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character Count
Hodor 10000
Jon Snow 5000
Daenerys 4000
… …
INDIVIDUALS CONNECTIONS
+ top emojis + top emojis
*These numbers are made up for presentation, not real data.
86. GRAPH
NODES LINKS
+ top emojis + top emojis
Character Count
Jon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character Count
Hodor 10000
Jon Snow 5000
Daenerys 4000
… …
*These numbers are made up for presentation, not real data.
107. “The first 90% of the code
accounts for the first 90% of the development time.
The remaining 10% of the code
accounts for the other 90% of the development time.”
— Tom Cargill, Bell Labs
108. REFINE & POLISH
Color
UX / UI + Mobile Support
Animation / Transition
Metadata for SEO
Social media preview images
Performance
Loading time, Data file size
129. WHAT COULD HAVE BEEN BETTER?
If I knew how to do XXX…
Learning opportunities
If I had someone who can do XXX…
Look for help
Grow the team
If I did not have to do the same tasks again…
Reusable components
Automate repetitive tasks
135. WHAT I TELL MYSELF BEFORE VISUALIZING
1.
2.
3.
4.
5.
6.
Krist Wongsuphasawat / @kristw
kristw.yellowpigz.com
Expect to find the real need
Expect to clean data a lot
Prepare to iterate again & again
Reserve time for refinement
Plan for feedback
Look back to move forward
136. My former and current colleagues at Twitter and Airbnb
for their collaboration and support in these projects;
and my wife for taking care of our two kids
while I make these slides.
ACKNOWLEDGEMENT
137. WHAT I TELL MYSELF BEFORE VISUALIZING
1.
2.
3.
4.
5.
6.
Krist Wongsuphasawat / @kristw
kristw.yellowpigz.com
Expect to find the real need
Expect to clean data a lot
Prepare to iterate again & again
Reserve time for refinement
Plan for feedback
Look back to move forward