Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Mining Social Web Data Like a
Pro: Four Steps to Success
Presented by Matthew A. Russell
"Data Journalism and Interactivit...
Hola
2
Trained as a Computer Scientist
CTO @ Digital Reasoning Systems
Data Mining, Machine Learning
Principal @ Zaffra
Bo...
3
Transform Curiosity Into Insight
4
An open source project
http://bit.ly/MiningTheSocialWeb2E
Inherently accessible
Virtual...
¿Por qué no Español?
5
Investigative Journalist
6
"A person whose
profession it is to
discover the truth and
to identify lapses from
it in whatev...
Data Science
7
Data => Actionable Information
Highly interdisciplinary
Nascent
Necessary
http://wikipedia.org/wiki/Data_sc...
Digital Signal Explosion
A model for the world: signal and sinks
Growth in data exhaust is accelerating
Digital fingerprint...
Digital Data Stats
100 terabytes of data uploaded daily to Facebook.
Brands and organizations on Facebook receive 34,722 L...
Social Media Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
Lin...
But Why Is It All the Rage?
It satisfies fundamental human desires
We want to be heard
We want to satisfy our curiosity
We ...
12
Roberto Mercedes
Jorge
Ana
Nina
Social Network Mechanics
Interest Graph Mechanics
13
Roberto Mercedes
Jorge
Ana
Nina
U2
Juan
Luis
Guerra
Juan
Luís
Guerra
A (Social) Interest Graph
14
Roberto Mercedes
Jorge
Ana
Nina
U2
Juan
Luis
Guerra
Juan
Luís
Guerra
A (Political) Interest Graph
15
Roberto Mercedes
Jorge
Ana
Nina
Johnny
Araya
Rodolfo
Hernández
Social Media Dimensions
16
Facebook
Accounts Types: People & Pages
Mutual Connections
"Likes"
"Shares"
"Comments"
Extensiv...
Why Does This Matter?
"If you can measure it, you can improve it"
Modeling Behavior
Predictive Analysis
Recommending Conte...
Social Media Analysis Framework
Four Steps To Success
Aspire
Acquire
Analyze
Summarize
Let's step through a trivial exampl...
(1) Aspire
Let's frame a trivial hypothesis to illustrate the four steps...
Frame a hypothesis about some real world pheno...
(2) Acquire
Collect the data that you need to test the hypothesis
How?
Use Facebook and Twitter APIs to harvest data about...
They're both on Facebook
21
http://facebook.com/ElDoctor2014
http://facebook.com/JohnnyArayaMonge
They're both on Twitter
22
@Johnny_Araya@ElDoctor2014
(3) Analyze
Count, Filter, and Rank the Data
Johnny Araya:
~50k Facebook likes
~14k Twitter followers
Rodolfo Hernández:
~...
(4) Summarize
Present the data in a concise and easily understood manner
Charts
Tables
Simple visualizations
Some examples...
25
Araya%
Hernandez%
Araya%
Hernandez%
Twitter Popularity
Social Media Popularity: Araya vs Hernández
Facebook Popularity
26
0"
10000"
20000"
30000"
40000"
50000"
60000"
Araya" Hernandez"
Twi5er"followers"
Facebook"fans"
Social Media Popularity...
27
1"
10"
100"
1000"
10000"
100000"
Araya" Hernandez"
Twi0er"followers"
Facebook"fans"
Social Media Popularity: Araya vs H...
Twitter Popularity
28
Facebook Popularity
29
JohnnyArayaMonge,
35%,
o0oguevaraguth,
17%,
luisguillermosolisr,
3%,
villaltaJM,
19%,
ElDoctor2014,...
Recall the previous hypothesis:
"Johnny Araya is a more popular candidate than Rodolfo Hernández"
What do we know now that...
(1) Aspire
Redefine the hypothesis:
For example: "Johnny Araya has a more effective social media strategy than
Rodolfo Hern...
(2) Acquire
Collect the data that you need to test the hypothesis
How? Use APIs to harvest data about each candidate
Let's...
33
for candidate in ['JohnnyArayaMonge', 'ElDoctor2014']:
# Get the data
url = 'https://graph.facebook.com/{0}?' + 
fields=...
(3) Analyze
34
Count, Filter, and Rank the Data
Some more Python source code to crunch the numbers
Extract Facebook likes ...
Facebook Vitals
35
ElDoctor2014
Total Likes 37495
Num Posts since Jan 1, 2013 (of 500 possible) 436
Total Post Likes 15547...
(4) Summarize
Present the data in a concise and easily understood manner
Like a table...
36
37
Metric Araya Hernández
Total Likes
Posts since 1 Jan 13
Num Prior Posts
Earliest Post
Post Likes since 1 Jan 13
Post Sh...
38
Metric Araya Hernández
Total Likes
Posts since 1 Jan 13
Num Prior Posts
Earliest Post
Post Likes since 1 Jan 13
Post Sh...
Recall the hypothesis:
"Johnny Araya has a more effective social media strategy than Rodolfo
Hernández because he has more...
40
Comparison of Facebok Content
Other Candidates
41
Johnny Araya FB Posts
42
Rodolfo Hernández FB Posts
43
44
Past ~2 Months on Facebook
45
Aug 2013 FB Likes Sept 2013 FB Likes % Change
Johnny Araya
Otto Guevara
Guth
José María
Vill...
Past ~3 Months on Twitter
46
Aug 2013 Sept 2013 % Change
Johnny Araya
Otto Guevara Guth
José María Villalta
Florez-Estrada...
Facebook and Twitter Compared
47
% FB Change % Twitter Change
Johnny Araya
Otto Guevara
Guth
José María
Villalta Florez-
E...
Your Imagination Is the Only Limit
Analyze the comments that people are leaving on Facebook pages
Try to ascertain common ...
Thinking about Reach
49
Think about "liking" and "following" as opt-ins to feeds
Remember: Interest Graphs
Arriving at eff...
Potential Twitter Influence
50
Araya Hernández
Followers
Theoretical
Reach
Reach (10)
Reach (100)
Reach (1000)
Reach (10,00...
Potential Influence
51
Who are Candidates Following?
52
What are Candidates Tweeting?
53
Realtime Analysis
54
Monitor Twitter's firehose for realtime data using filters such as #Syria
Keep in mind the sheer volume...
Mapping #Syria Tweets
55
See http://wp.me/p3QiJd-1t Text
Temporal Analysis on #Syria
56
Analyzing #Syria Tweet Entities
57
Closing Remarks
Software is the gift that keeps on giving
Code it up once, run it ad infinitum...
Code designed for one acc...
Aspire to Do Great Things
Predicting demographic data such as age or gender is possible for some
languages
Time and space ...
The Tip of the Iceberg
60
Stay in Touch
Website: http://MiningTheSocialWeb.com
Twitter: @ptwobrussell
FB: http://facebook.com/MiningTheSocialWeb
Lin...
Upcoming SlideShare
Loading in …5
×

Mining Social Web Data Like a Pro: Four Steps to Success

2,891 views

Published on

GDA Presentation - Quito Ecuador - 20 Sept 2013

Published in: Technology, Business

Mining Social Web Data Like a Pro: Four Steps to Success

  1. 1. Mining Social Web Data Like a Pro: Four Steps to Success Presented by Matthew A. Russell "Data Journalism and Interactivity" - GDA Seminar Quito, Ecuador - 20 September 2013 1
  2. 2. Hola 2 Trained as a Computer Scientist CTO @ Digital Reasoning Systems Data Mining, Machine Learning Principal @ Zaffra Boutique Consulting Author @ O'Reilly Media 5 published books on technology
  3. 3. 3
  4. 4. Transform Curiosity Into Insight 4 An open source project http://bit.ly/MiningTheSocialWeb2E Inherently accessible Virtual machine & IPython Notebook UX Turn-key code templates for bootstrapping data science experiments Think of the book as "premium" support for the OSS project
  5. 5. ¿Por qué no Español? 5
  6. 6. Investigative Journalist 6 "A person whose profession it is to discover the truth and to identify lapses from it in whatever media may be available."
  7. 7. Data Science 7 Data => Actionable Information Highly interdisciplinary Nascent Necessary http://wikipedia.org/wiki/Data_science
  8. 8. Digital Signal Explosion A model for the world: signal and sinks Growth in data exhaust is accelerating Digital fingerprints Software is eating the world Data mining opportunities galore... 8
  9. 9. Digital Data Stats 100 terabytes of data uploaded daily to Facebook. Brands and organizations on Facebook receive 34,722 Likes every minute of the day. According to Twitter’s own research in early 2012, it sees roughly 175 million tweets every day 30 Billion pieces of content shared on Facebook every month. Data production will be 44 times greater in 2020 than it was in 2009 According to estimates, the volume of business data worldwide, across all companies, doubles every 1.2 years. 9 See http://wikibon.org/blog/big-data-statistics
  10. 10. Social Media Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate) 10
  11. 11. But Why Is It All the Rage? It satisfies fundamental human desires We want to be heard We want to satisfy our curiosity We want it easy We want it now 11
  12. 12. 12 Roberto Mercedes Jorge Ana Nina Social Network Mechanics
  13. 13. Interest Graph Mechanics 13 Roberto Mercedes Jorge Ana Nina U2 Juan Luis Guerra Juan Luís Guerra
  14. 14. A (Social) Interest Graph 14 Roberto Mercedes Jorge Ana Nina U2 Juan Luis Guerra Juan Luís Guerra
  15. 15. A (Political) Interest Graph 15 Roberto Mercedes Jorge Ana Nina Johnny Araya Rodolfo Hernández
  16. 16. Social Media Dimensions 16 Facebook Accounts Types: People & Pages Mutual Connections "Likes" "Shares" "Comments" Extensive Privacy Controls Twitter Accounts Types: "Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  17. 17. Why Does This Matter? "If you can measure it, you can improve it" Modeling Behavior Predictive Analysis Recommending Content Swaying political situations might just be the ultimate value proposition for social media 17
  18. 18. Social Media Analysis Framework Four Steps To Success Aspire Acquire Analyze Summarize Let's step through a trivial example... 18
  19. 19. (1) Aspire Let's frame a trivial hypothesis to illustrate the four steps... Frame a hypothesis about some real world phenomenon For example: "Johnny Araya is a more popular candidate than Rodolfo Hernández" Let's use social media as a basis of investigation 19
  20. 20. (2) Acquire Collect the data that you need to test the hypothesis How? Use Facebook and Twitter APIs to harvest data about each candidate Go after low hanging fruit before something more complex You don't even need to write code to do this (yet) 20
  21. 21. They're both on Facebook 21 http://facebook.com/ElDoctor2014 http://facebook.com/JohnnyArayaMonge
  22. 22. They're both on Twitter 22 @Johnny_Araya@ElDoctor2014
  23. 23. (3) Analyze Count, Filter, and Rank the Data Johnny Araya: ~50k Facebook likes ~14k Twitter followers Rodolfo Hernández: ~37k Facebook likes; 745 Twitter followers Johnny Araya is indeed more popular in social media 23
  24. 24. (4) Summarize Present the data in a concise and easily understood manner Charts Tables Simple visualizations Some examples... 24
  25. 25. 25 Araya% Hernandez% Araya% Hernandez% Twitter Popularity Social Media Popularity: Araya vs Hernández Facebook Popularity
  26. 26. 26 0" 10000" 20000" 30000" 40000" 50000" 60000" Araya" Hernandez" Twi5er"followers" Facebook"fans" Social Media Popularity: Araya vs Hernández
  27. 27. 27 1" 10" 100" 1000" 10000" 100000" Araya" Hernandez" Twi0er"followers" Facebook"fans" Social Media Popularity: Araya vs Hernández
  28. 28. Twitter Popularity 28
  29. 29. Facebook Popularity 29 JohnnyArayaMonge, 35%, o0oguevaraguth, 17%, luisguillermosolisr, 3%, villaltaJM, 19%, ElDoctor2014, 26%, Facebook(Likes(for(Costa(Rican(Presiden4al(Candidates(
  30. 30. Recall the previous hypothesis: "Johnny Araya is a more popular candidate than Rodolfo Hernández" What do we know now that we didn't before? The current state of each candidate's Twitter and Facebook popularity Let's explore a slightly more complex hypothesis... 30 Reflect and Refine...
  31. 31. (1) Aspire Redefine the hypothesis: For example: "Johnny Araya has a more effective social media strategy than Rodolfo Hernández" Presumably because of his superior social media status at the moment 31
  32. 32. (2) Acquire Collect the data that you need to test the hypothesis How? Use APIs to harvest data about each candidate Let's consider any Facebook posts for 2013 32
  33. 33. 33 for candidate in ['JohnnyArayaMonge', 'ElDoctor2014']: # Get the data url = 'https://graph.facebook.com/{0}?' + fields= posts.limit(500)&access_token=XXX'.format(candidate) content = requests.get(url).json() # Save the data f = open(candidate + ".json", "w") f.write(json.dumps(content)) f.close() Python Source Code
  34. 34. (3) Analyze 34 Count, Filter, and Rank the Data Some more Python source code to crunch the numbers Extract Facebook likes and shares this year
  35. 35. Facebook Vitals 35 ElDoctor2014 Total Likes 37495 Num Posts since Jan 1, 2013 (of 500 possible) 436 Total Post Likes 155473 Total Post Shares 9684 Oldest Post in Batch 2013-03-15T00:40:21+0000 Num posts prior to Jan 1, 2013 0 Avg likes/post 356.589449541 (0.951032003044%) Avg shares/post 22.2110091743 (0.059237256099%) Post Types [(u'photo', 286), (u'link', 77), (u'status', 40), (u'video', 32), (u'swf', 1)] JohnnyArayaMonge Total Likes 50301 Num Posts since Jan 1, 2013 (of 500 possible) 205 Total Post Likes 176161 Total Post Shares 7542 Oldest Post in Batch 2013-01-01T07:18:43+0000 Num posts prior to Jan 1, 2013 190 Avg likes/post 859.32195122 (1.70835957778%) Avg shares/post 36.7902439024 (0.0731401838978%) Post Types [(u'photo', 149), (u'status', 38), (u'link', 13), (u'video', 5)]
  36. 36. (4) Summarize Present the data in a concise and easily understood manner Like a table... 36
  37. 37. 37 Metric Araya Hernández Total Likes Posts since 1 Jan 13 Num Prior Posts Earliest Post Post Likes since 1 Jan 13 Post Shares since 1 Jan 13 Avg Likes per Post Avg Shares per Post 50,301 37,495 205 436 190+ 0 1 Jan 2013 15 March 2013 176,161 155,473 7,542 9,684 859 356 36 22
  38. 38. 38 Metric Araya Hernández Total Likes Posts since 1 Jan 13 Num Prior Posts Earliest Post Post Likes since 1 Jan 13 Post Shares since 1 Jan 13 Avg Likes per Post Avg Shares per Post 50,301 37,495 205 436 190+ 0 1 Jan 2013 15 March 2013 176,161 155,473 7,542 9,684 859 356 36 22
  39. 39. Recall the hypothesis: "Johnny Araya has a more effective social media strategy than Rodolfo Hernández because he has more Facebook and Twitter popularity" What do we know now? Hernández has Facebook vitals that are quite competitive with Araya However, Hernández only joined Facebook ~6 months ago! It would appear that Hernández has the more effective strategy What is he doing to rise in popularity so quickly? 39 Reflect and Refine...
  40. 40. 40 Comparison of Facebok Content
  41. 41. Other Candidates 41
  42. 42. Johnny Araya FB Posts 42
  43. 43. Rodolfo Hernández FB Posts 43
  44. 44. 44
  45. 45. Past ~2 Months on Facebook 45 Aug 2013 FB Likes Sept 2013 FB Likes % Change Johnny Araya Otto Guevara Guth José María Villalta Florez- Estrada Dr. Rodolfo Hernández Luis Guillermo Solís Rivera 50,301 53,809 6.97% 24,146 27,675 14.62% 27,262 35,169 29.00% 37,495 38,298 2.14% 5,334 6,763 26.79%
  46. 46. Past ~3 Months on Twitter 46 Aug 2013 Sept 2013 % Change Johnny Araya Otto Guevara Guth José María Villalta Florez-Estrada Dr. Rodolfo Hernández Luis Guillermo Solís Rivera 14,573 15,506 6.40% 114 159 39.47% 8,160 8,990 10.17% 745 858 15.17% 1,192 1,487 24.75%
  47. 47. Facebook and Twitter Compared 47 % FB Change % Twitter Change Johnny Araya Otto Guevara Guth José María Villalta Florez- Estrada Dr. Rodolfo Hernández Luis Guillermo Solís Rivera 6.97% 6.40% 14.62% 39.47% 29.00% 10.17% 2.14% 15.17% 26.79% 24.75%
  48. 48. Your Imagination Is the Only Limit Analyze the comments that people are leaving on Facebook pages Try to ascertain common common Facebook fans or Twitter followers amongst candidates Deduce demographics from social media by synthesizing public data Theorize about potential "reach" or "influence" using social media Analyze data in realtime 48
  49. 49. Thinking about Reach 49 Think about "liking" and "following" as opt-ins to feeds Remember: Interest Graphs Arriving at effective metrics is tricker than it initially seems
  50. 50. Potential Twitter Influence 50 Araya Hernández Followers Theoretical Reach Reach (10) Reach (100) Reach (1000) Reach (10,000) "Suspect" Followers ~14k ~750 ~40M ~550k 490 673 289 702 2782 X 2832 X 3,246 94 See also http://wp.me/p3QiJd-2a
  51. 51. Potential Influence 51
  52. 52. Who are Candidates Following? 52
  53. 53. What are Candidates Tweeting? 53
  54. 54. Realtime Analysis 54 Monitor Twitter's firehose for realtime data using filters such as #Syria Keep in mind the sheer volume of data can be considerable Analysis at MiningTheSocialWeb.com
  55. 55. Mapping #Syria Tweets 55 See http://wp.me/p3QiJd-1t Text
  56. 56. Temporal Analysis on #Syria 56
  57. 57. Analyzing #Syria Tweet Entities 57
  58. 58. Closing Remarks Software is the gift that keeps on giving Code it up once, run it ad infinitum... Code designed for one account will work for other accounts Analysis is all about knowing what to count Coding it up is just the dirty work Start somewhere and then iteratively explore...then exploit 58
  59. 59. Aspire to Do Great Things Predicting demographic data such as age or gender is possible for some languages Time and space are fundamentals for grounding online discussions in reality. Twitter is about as good as it gets for realtime topical analysis Think of the world as signal producers and signal collectors Monitoring breaking news events like #Syria 59
  60. 60. The Tip of the Iceberg 60
  61. 61. Stay in Touch Website: http://MiningTheSocialWeb.com Twitter: @ptwobrussell FB: http://facebook.com/MiningTheSocialWeb LinkedIn: http://linkedin.com/in/ptwobrussell Email: ptwobrussell@gmail.com 61

×