Adventure in Data: A tour of visualization projects at Twitter

Krist Wongsuphasawat / @kristw
Adventure in data
A tour of visualizations at Twitter

Computer Engineer

Bangkok, Thailand
PhD in Computer Science

Univ. of Maryland
Information Visualization
IBM
Microsoft
Data Visualization Scientist

Twitter

Adventure in data
A whirlwind tour of visualization projects at Twitter

Having all Tweets
How people think I feel.

How people think I feel. How I really feel.
Having all Tweets

• Too much data. Want only relevant Tweets
• hashtag: #BRA
• keywords: “goal”
• Need to aggregate & reduce size
• Long processing time (hours)
Challenges

Hadoop Cluster
Data Storage
Workﬂow

Hadoop Cluster
Pig / Scalding (slow)
Data Storage
Tool
Workﬂow

Hadoop Cluster
Data Storage
Tool
Smaller datasetYour laptop
Workﬂow

Hadoop Cluster
Data Storage
Tool
Final dataset
Tool node.js / python / excel (fast)
Your laptop
Workﬂow
Smaller dataset

Cleaning data
a story of my life

Storytelling
Analytics Tools
Creative
Projects

To understand the world
and share the stories
To understand Twitter users
and improve the service
To showcase the data
and inspire
Projects
Storytelling
Analytics Tools
Creative

Storytelling1
World Cup Election
Oscars
TV Shows New Year
Earthquake
Super Bowl
Protest
…
Behaviors
Sleeping
Daylight saving
Language
…
Events
Fasting
Information spread
Commute

So many things
we could learn
from Twitter data

Give us interesting vis
about xxxx by Nov 10

Tweets
(+ media)
photos, videos
What?
Where? When?
GEO TIME
TEXT

What?
Where? When?
GEO TIME
TEXT
Visualize Tweets

Time Tweets/second + Annotation
http://www.ﬂickr.com/photos/twitterofﬁce/5681263084/

Geo
Heatmap
Low density
High density

Geo
San Francisco
ﬂickr.com/photos/twitterofﬁce/8798020541
Low density
High density

Geo
San Francisco
Rebuild the world
based on
tweet volumes
twitter.github.io/interactive/andes/

Text
www.wordle.net
Some experiments
during World Cup

Text
www.wordle.net
Word cloud of Tweets right after the 1st goal

Text Word cloud of Tweets right after the 1st goal
www.wordle.net
It was an “own” goal.

Text WordTree [Wattenberg & Viégas 2008]
www.jasondavies.com/wordtre
www.jasondavies.com/wordtree

Text
word/phrase/hashtag count
topic

Time + Geo Tweet pattern [Rios & Lin 2012]
Night
Late night
Daytime
Night
Late night
Daytime

Night
Late night
Daytime
Night
Late night
Daytime
Time + Geo Tweet pattern [Rios & Lin 2012]

Geo + Text Real-time Tweet map

most
frequent
term

Gmail was down
Jan 24, 2014

Nelson Mandela
passed away
Dec 5, 2013

Time + Text
UEFA Champions League
Biggest tournament for European soccer clubs
Many Tweets during the matches

UEFA Champions League
Dortmund Bayern Munich
Count Tweets mentioning
the teams every minute
Team 1 Team 2
Time + Text

Time + Text UEFA Champions League

+ “goal” count
+ context

+ “offside”

+ players

A B C D
A C
C
Competition Tree
vs vs
vs + =
uclﬁnal.twitter.com

Time + Text + Geo State of the Union
twitter.github.io/interactive/sotu2014

1) timeline + topic from Tweets
4) Density map of
Tweets about
selected topic
3) Volume of Tweets
by topics
during selected
part of the SOTU
2) context
(speech)
twitter.github.io/interactive/sotu2014
Time + Text + Geo State of the Union

World Cup 2014Time + Text
interactive.twitter.com/wccompetitree

Time + Text + Geo World Cup 2014
interactive.twitter.com/wccompetitree

What?
Where? When?
GEO TIME
TEXT
Visualize Tweets
+
Non-Twitter data
CONTEXT

Time + Text + Geo (c) New Year 2014
twitter.github.io/interactive/newyear2014/

Analytics Tools2
Data sources
Output
explore
analyze
present
get
*
*

Analytics Tools2
Data sources
Output
explore
analyze
present
get
*
*
ad-hoc scripts

Analytics Tools2
Data sources
Output
explore
analyze
present
get
*
*
ad-hoc scripts tools for exploration

UsersUse
Product Managers
Curious
Twitter

UsersUse
Curious
Engineers
Log data
in Hadoop
Write Twitter
Instrument
Product Managers

What are being logged?
tweet
activities

tweet from home timeline on twitter.com
tweet from search page on iPhone
activities

tweet from home timeline on twitter.com
tweet from search page on iPhone
sign up
log in
retweet
etc.
activities

log event a.k.a. “client event”
[Lee et al. 2012]

log event a.k.a. “client event”
client : page : section : component : element : action
web : home : timeline : tweet_box : button : tweet
1) User ID
2) Timestamp
3) Event name
4) Event detail
[Lee et al. 2012]

UsersUse
Curious
Engineers
Log data
in Hadoop
Twitter
Instrument
Write
Product Managers
bigger than
Tweet data

UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Ask
Twitter
Instrument
Write
Product Managers

UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find
Ask
Twitter
Instrument
Write
Product Managers

UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find, Clean
Ask
Twitter
Instrument
Write
Product Managers

UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find, Clean
Ask
Monitor
Twitter
Instrument
Write
Product Managers

UsersUse
Curious
Engineers
Log data
in Hadoop
Data Scientists
Find, Clean, Analyze
Ask
Monitor
Twitter
Instrument
Write
Product Managers

Log data
EngineersData Scientists
Usersin Hadoop
Find, Clean, Analyze
Use
Monitor
Ask
Curious
1 2
Twitter
Instrument
Write
Product Managers

Part I
Find & Monitor
Client Events

Log data
in Hadoop
Engineers & Data Scientists
billions of rows

Log data
in Hadoop
Aggregate
10,000+ event types
date client page section comp. elem. action count
20141011 web home home - - impression 100
20141011 web home wtf - - click 20
Client event collection

Log data
in Hadoop
Aggregate
10,000+ event types
date client page section comp. elem. action count
20141011 web home home - - impression 100
20141011 web home wtf - - click 20
(Who-to-Follow)

Log data
in Hadoop
Aggregate

Log data
in Hadoop
Aggregate
Find
client page section component element action
Search

Search
Find
Log data
in Hadoop
Aggregate
web home * * impression*

Search
Find
Query
Return
Log data
in Hadoop
Results
web : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
web home * * impression*

Search
Find
Query
Return
Log data
in Hadoop
Results
Aggregate
search can be better

Search
Find
Query
Return
Log data
in Hadoop
Results
Aggregate
10,000+ event types

Search
Find
Query
Return
Log data
in Hadoop
Results
Aggregate
10,000+ event types
not everybody knows
What are all sections under web:home?

Search
Find
Query
Return
Log data
in Hadoop
Results
Aggregate
one graph / event
10,000+ event types
not everybody knows

Search
Find
Query
Return
Log data
in Hadoop
Results
Aggregate
one graph / event
x 10,000
10,000+ event types
not everybody knows

!
• Search for client events
• Explore client event collection
• Monitor changes
Goals

See

See
Interactions
search box => ﬁlter
narrow down

See
How to visualize?
narrow down
Interactions

See
How to visualize?
narrow down
client : page : section : component : element : actionInteractions

Client event hierarchy
iphone home -
- - impression
tweet tweet click
iphone:home:-:-:-:impression
iphone:home:-:tweet:tweet:click

Detect changes
iphone home -
- - impression
tweet tweet click
iphone home -
- - impression
tweet tweet click
TODAY
7 DAYS AGO
compared to

Calculate changes
+5% +5% +5%
+10% +10% +10%
-5% -5% -5%
DIFF

Display changes
iphone home -
- - impression
tweet tweet click
Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]

Display changes
home -
- - impression
tweet tweet click
iphone

Count page visits
banana : home : - : - : - : impression
home page

Funnel
home page
proﬁle page

Funnel analysis
banana : proﬁle : - : - : - : impression
1 jobhome page
proﬁle page
1 hour

Funnel analysis
banana : proﬁle : - : - : - : impression banana : search : - : - : - : impression
home page
proﬁle page search page
2 jobs
2 hours

Funnel analysis
banana : proﬁle : - : - : - : impression banana : search : - : - : - : impression
home page
proﬁle page search page
Specify all funnels manually!
n jobs
n hours

Goal
… ……
1 job => all funnels, visualized
home page

• Visualize an overview of event sequences
!
• Big data? eBay checkout sequences
Related work
[Wongsuphasawat et al. 2011, Monroe et al. 2013, …]
[Shen et al. 2013]

User sessions
Session#1
A
B
start
end
Session#4
start
end
A
Session#2
B
start
end
A
Session#3
C
start
end
A

Aggregate
4 sessions
A
BB C
start
end endend
A A
end
A

Aggregate
A
BB C
start
end endend
end
4 sessions

Aggregate
C
start
end endend
end
A
B
4 sessions

Aggregate
C
start
end endend
A
B end
4 sessions

Aggregate
C
start
endend
A
B end
4 sessions

Aggregate
start
endend
A
CB end
4 sessions

Aggregate
4,000,000 sessions
endend
A
CB end
start

try with sample data
(~millions sessions, 10,000+ event types)
!
original paper
(100,000 sessions, ~10 event types)

1. Reduce event types
Reduce # of unique sequences

10,000 types select
tweet
sign up
log out

10,000 types select merge
tweet from home timeline
tweet from search page
tweet …
= tweet

2. Reduce sequence length

session
1000 events

session
10 events after (window size & direction)
1000 events
visit home page (alignment)

Ask users for input}

3. More aggregation on Hadoop
Ask users for input}

Collapse events
Sequence
ABBBCCCC
ABBCC
ABC
ABCCCC
ABCD
ABCCCD
ABCCE
ABCDF
ABCDG
ABCDH
e.g.
tweet, tweet, tweet, … = tweet

Sequence
ABC
ABC
ABC
ABC
ABCD
ABCD
ABCE
ABCDF
ABCDG
ABCDH
Collapse events

Group & Count
Sequence
ABC
ABCD
ABCE
ABCDF
ABCDG
ABCDH
…
Count
2000
80
20
1
1
1
…

Group & Count
Sequence
ABC
ABCD
ABCE
ABCDF
ABCDG
ABCDH
ABCDI
ABCDJK
ABCDJL
Count
2000
80
20
1
1
1
1
1
1
rare sequences
(count < threshold)

Truncate
Sequence
ABC
ABCD
ABCE
ABCDx
ABCDx
ABCDx
ABCDx
ABCDJx
ABCDJx
Count
2000
80
20
1
1
1
1
1
1
Replace last event with x (…)

Sequence
ABC
ABCD
ABCE
ABCDx
ABCDJx
Count
2000
80
20
4
2
Group & Count

Truncate more
Sequence
ABC
ABCD
ABCE
ABCDx
ABCDx
Count
2000
80
20
4
2

Group & Count
Sequence
ABC
ABCD
ABCE
ABCDx
Count
2000
80
20
6

1. Deﬁne set of events
2. Pick alignment, direction and window size
3. Run Hadoop job (with more aggregation)
4. Wait for it… (2+ hrs)
5. Visualize
Final process
~100,000 patterns (10MB)
gazillion patterns (TBs)

• Large-scale User Activity Logs + Visual Analytics
• Used in day-to-day operations at Twitter
• Generalize to smaller systems
Summary
Challenge
big data
small data
visualize & interact
aggregate
& sacriﬁce

Data sources
Output
Creative3
…

https://medium.com/@kristw/designing-the-game-of-tweets-7f87c30dc5a2
Demo / Game of Tweets

To understand the world
and share the stories
To understand Twitter users
and improve the service
To showcase the data
and inspire
Projects
Storytelling
Analytics Tools
Creative
Reusable
Toolkits
To implement
once and for all

https://github.com/twitter/d3kit
Demo / d3Kit
http://www.slideshare.net/kristw/d3kit

Conclusions
Data are everywhere.
Many applications:  
Journalism, Product development, Art, etc.
Combine visualization with other skills:  
HCI, Design, Stats, ML, etc.
Don’t repeat yourself.
interactive.twitter.com kristw.yellowpigz.com

@philogb @trebor @miguelrios
@smrogers @lintool @linuslee @chuangl4
and many other colleagues at @twitter
Acknowledgement

Adventure in Data: A tour of visualization projects at Twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Adventure in Data: A tour of visualization projects at Twitter

Similar to Adventure in Data: A tour of visualization projects at Twitter (20)

More from Krist Wongsuphasawat

More from Krist Wongsuphasawat (20)

Recently uploaded

Recently uploaded (20)

Adventure in Data: A tour of visualization projects at Twitter