Mining Social Web Data Like a Pro: Four Steps to Success
Mining Social Web Data Like a
Pro: Four Steps to Success
Presented by Matthew A. Russell
"Data Journalism and Interactivity" - GDA Seminar
Quito, Ecuador - 20 September 2013
1
Hola
2
Trained as a Computer Scientist
CTO @ Digital Reasoning Systems
Data Mining, Machine Learning
Principal @ Zaffra
Boutique Consulting
Author @ O'Reilly Media
5 published books on technology
Transform Curiosity Into Insight
4
An open source project
http://bit.ly/MiningTheSocialWeb2E
Inherently accessible
Virtual machine & IPython Notebook UX
Turn-key code templates for
bootstrapping data science experiments
Think of the book as "premium" support
for the OSS project
Investigative Journalist
6
"A person whose
profession it is to
discover the truth and
to identify lapses from
it in whatever media
may be available."
Data Science
7
Data => Actionable Information
Highly interdisciplinary
Nascent
Necessary
http://wikipedia.org/wiki/Data_science
Digital Signal Explosion
A model for the world: signal and sinks
Growth in data exhaust is accelerating
Digital fingerprints
Software is eating the world
Data mining opportunities galore...
8
Digital Data Stats
100 terabytes of data uploaded daily to Facebook.
Brands and organizations on Facebook receive 34,722 Likes every minute of
the day.
According to Twitter’s own research in early 2012, it sees roughly 175 million
tweets every day
30 Billion pieces of content shared on Facebook every month.
Data production will be 44 times greater in 2020 than it was in 2009
According to estimates, the volume of business data worldwide, across all
companies, doubles every 1.2 years.
9
See http://wikibon.org/blog/big-data-statistics
Social Media Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
10
But Why Is It All the Rage?
It satisfies fundamental human desires
We want to be heard
We want to satisfy our curiosity
We want it easy
We want it now
11
A (Social) Interest Graph
14
Roberto Mercedes
Jorge
Ana
Nina
U2
Juan
Luis
Guerra
Juan
Luís
Guerra
A (Political) Interest Graph
15
Roberto Mercedes
Jorge
Ana
Nina
Johnny
Araya
Rodolfo
Hernández
Social Media Dimensions
16
Facebook
Accounts Types: People & Pages
Mutual Connections
"Likes"
"Shares"
"Comments"
Extensive Privacy Controls
Twitter
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
Why Does This Matter?
"If you can measure it, you can improve it"
Modeling Behavior
Predictive Analysis
Recommending Content
Swaying political situations might just be the ultimate value proposition for
social media
17
Social Media Analysis Framework
Four Steps To Success
Aspire
Acquire
Analyze
Summarize
Let's step through a trivial example...
18
(1) Aspire
Let's frame a trivial hypothesis to illustrate the four steps...
Frame a hypothesis about some real world phenomenon
For example: "Johnny Araya is a more popular candidate than Rodolfo
Hernández"
Let's use social media as a basis of investigation
19
(2) Acquire
Collect the data that you need to test the hypothesis
How?
Use Facebook and Twitter APIs to harvest data about each candidate
Go after low hanging fruit before something more complex
You don't even need to write code to do this (yet)
20
They're both on Facebook
21
http://facebook.com/ElDoctor2014
http://facebook.com/JohnnyArayaMonge
(3) Analyze
Count, Filter, and Rank the Data
Johnny Araya:
~50k Facebook likes
~14k Twitter followers
Rodolfo Hernández:
~37k Facebook likes;
745 Twitter followers
Johnny Araya is indeed more popular in social media
23
(4) Summarize
Present the data in a concise and easily understood manner
Charts
Tables
Simple visualizations
Some examples...
24
Recall the previous hypothesis:
"Johnny Araya is a more popular candidate than Rodolfo Hernández"
What do we know now that we didn't before?
The current state of each candidate's Twitter and Facebook popularity
Let's explore a slightly more complex hypothesis...
30
Reflect and Refine...
(1) Aspire
Redefine the hypothesis:
For example: "Johnny Araya has a more effective social media strategy than
Rodolfo Hernández"
Presumably because of his superior social media status at the moment
31
(2) Acquire
Collect the data that you need to test the hypothesis
How? Use APIs to harvest data about each candidate
Let's consider any Facebook posts for 2013
32
33
for candidate in ['JohnnyArayaMonge', 'ElDoctor2014']:
# Get the data
url = 'https://graph.facebook.com/{0}?' +
fields= posts.limit(500)&access_token=XXX'.format(candidate)
content = requests.get(url).json()
# Save the data
f = open(candidate + ".json", "w")
f.write(json.dumps(content))
f.close()
Python Source Code
(3) Analyze
34
Count, Filter, and Rank the Data
Some more Python source code to crunch the numbers
Extract Facebook likes and shares this year
Facebook Vitals
35
ElDoctor2014
Total Likes 37495
Num Posts since Jan 1, 2013 (of 500 possible) 436
Total Post Likes 155473
Total Post Shares 9684
Oldest Post in Batch 2013-03-15T00:40:21+0000
Num posts prior to Jan 1, 2013 0
Avg likes/post 356.589449541 (0.951032003044%)
Avg shares/post 22.2110091743 (0.059237256099%)
Post Types [(u'photo', 286), (u'link', 77), (u'status', 40), (u'video', 32), (u'swf', 1)]
JohnnyArayaMonge
Total Likes 50301
Num Posts since Jan 1, 2013 (of 500 possible) 205
Total Post Likes 176161
Total Post Shares 7542
Oldest Post in Batch 2013-01-01T07:18:43+0000
Num posts prior to Jan 1, 2013 190
Avg likes/post 859.32195122 (1.70835957778%)
Avg shares/post 36.7902439024 (0.0731401838978%)
Post Types [(u'photo', 149), (u'status', 38), (u'link', 13), (u'video', 5)]
37
Metric Araya Hernández
Total Likes
Posts since 1 Jan 13
Num Prior Posts
Earliest Post
Post Likes since 1 Jan 13
Post Shares since 1 Jan 13
Avg Likes per Post
Avg Shares per Post
50,301 37,495
205 436
190+ 0
1 Jan 2013 15 March 2013
176,161 155,473
7,542 9,684
859 356
36 22
38
Metric Araya Hernández
Total Likes
Posts since 1 Jan 13
Num Prior Posts
Earliest Post
Post Likes since 1 Jan 13
Post Shares since 1 Jan 13
Avg Likes per Post
Avg Shares per Post
50,301 37,495
205 436
190+ 0
1 Jan 2013 15 March 2013
176,161 155,473
7,542 9,684
859 356
36 22
Recall the hypothesis:
"Johnny Araya has a more effective social media strategy than Rodolfo
Hernández because he has more Facebook and Twitter popularity"
What do we know now?
Hernández has Facebook vitals that are quite competitive with Araya
However, Hernández only joined Facebook ~6 months ago!
It would appear that Hernández has the more effective strategy
What is he doing to rise in popularity so quickly?
39
Reflect and Refine...
Past ~2 Months on Facebook
45
Aug 2013 FB Likes Sept 2013 FB Likes % Change
Johnny Araya
Otto Guevara
Guth
José María
Villalta Florez-
Estrada
Dr. Rodolfo
Hernández
Luis Guillermo
Solís Rivera
50,301 53,809 6.97%
24,146 27,675 14.62%
27,262 35,169 29.00%
37,495 38,298 2.14%
5,334 6,763 26.79%
Past ~3 Months on Twitter
46
Aug 2013 Sept 2013 % Change
Johnny Araya
Otto Guevara Guth
José María Villalta
Florez-Estrada
Dr. Rodolfo
Hernández
Luis Guillermo Solís
Rivera
14,573 15,506 6.40%
114 159 39.47%
8,160 8,990 10.17%
745 858 15.17%
1,192 1,487 24.75%
Facebook and Twitter Compared
47
% FB Change % Twitter Change
Johnny Araya
Otto Guevara
Guth
José María
Villalta Florez-
Estrada
Dr. Rodolfo
Hernández
Luis Guillermo
Solís Rivera
6.97% 6.40%
14.62% 39.47%
29.00% 10.17%
2.14% 15.17%
26.79% 24.75%
Your Imagination Is the Only Limit
Analyze the comments that people are leaving on Facebook pages
Try to ascertain common common Facebook fans or Twitter followers
amongst candidates
Deduce demographics from social media by synthesizing public data
Theorize about potential "reach" or "influence" using social media
Analyze data in realtime
48
Thinking about Reach
49
Think about "liking" and "following" as opt-ins to feeds
Remember: Interest Graphs
Arriving at effective metrics is tricker than it initially seems
Potential Twitter Influence
50
Araya Hernández
Followers
Theoretical
Reach
Reach (10)
Reach (100)
Reach (1000)
Reach (10,000)
"Suspect"
Followers
~14k ~750
~40M ~550k
490 673
289 702
2782 X
2832 X
3,246 94
See also http://wp.me/p3QiJd-2a
Realtime Analysis
54
Monitor Twitter's firehose for realtime data using filters such as #Syria
Keep in mind the sheer volume of data can be considerable
Analysis at MiningTheSocialWeb.com
Closing Remarks
Software is the gift that keeps on giving
Code it up once, run it ad infinitum...
Code designed for one account will work for other accounts
Analysis is all about knowing what to count
Coding it up is just the dirty work
Start somewhere and then iteratively explore...then exploit
58
Aspire to Do Great Things
Predicting demographic data such as age or gender is possible for some
languages
Time and space are fundamentals for grounding online discussions in
reality.
Twitter is about as good as it gets for realtime topical analysis
Think of the world as signal producers and signal collectors
Monitoring breaking news events like #Syria
59