Mining Social Web Data Like a Pro: Four Steps to Success

1,844
-1

Published on

GDA Presentation - Quito Ecuador - 20 Sept 2013

Published in: Technology, Business
2 Comments
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
1,844
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
87
Comments
2
Likes
8
Embeds 0
No embeds

No notes for slide

Mining Social Web Data Like a Pro: Four Steps to Success

  1. 1. Mining Social Web Data Like a Pro: Four Steps to Success Presented by Matthew A. Russell "Data Journalism and Interactivity" - GDA Seminar Quito, Ecuador - 20 September 2013 1
  2. 2. Hola 2 Trained as a Computer Scientist CTO @ Digital Reasoning Systems Data Mining, Machine Learning Principal @ Zaffra Boutique Consulting Author @ O'Reilly Media 5 published books on technology
  3. 3. 3
  4. 4. Transform Curiosity Into Insight 4 An open source project http://bit.ly/MiningTheSocialWeb2E Inherently accessible Virtual machine & IPython Notebook UX Turn-key code templates for bootstrapping data science experiments Think of the book as "premium" support for the OSS project
  5. 5. ¿Por qué no Español? 5
  6. 6. Investigative Journalist 6 "A person whose profession it is to discover the truth and to identify lapses from it in whatever media may be available."
  7. 7. Data Science 7 Data => Actionable Information Highly interdisciplinary Nascent Necessary http://wikipedia.org/wiki/Data_science
  8. 8. Digital Signal Explosion A model for the world: signal and sinks Growth in data exhaust is accelerating Digital fingerprints Software is eating the world Data mining opportunities galore... 8
  9. 9. Digital Data Stats 100 terabytes of data uploaded daily to Facebook. Brands and organizations on Facebook receive 34,722 Likes every minute of the day. According to Twitter’s own research in early 2012, it sees roughly 175 million tweets every day 30 Billion pieces of content shared on Facebook every month. Data production will be 44 times greater in 2020 than it was in 2009 According to estimates, the volume of business data worldwide, across all companies, doubles every 1.2 years. 9 See http://wikibon.org/blog/big-data-statistics
  10. 10. Social Media Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate) 10
  11. 11. But Why Is It All the Rage? It satisfies fundamental human desires We want to be heard We want to satisfy our curiosity We want it easy We want it now 11
  12. 12. 12 Roberto Mercedes Jorge Ana Nina Social Network Mechanics
  13. 13. Interest Graph Mechanics 13 Roberto Mercedes Jorge Ana Nina U2 Juan Luis Guerra Juan Luís Guerra
  14. 14. A (Social) Interest Graph 14 Roberto Mercedes Jorge Ana Nina U2 Juan Luis Guerra Juan Luís Guerra
  15. 15. A (Political) Interest Graph 15 Roberto Mercedes Jorge Ana Nina Johnny Araya Rodolfo Hernández
  16. 16. Social Media Dimensions 16 Facebook Accounts Types: People & Pages Mutual Connections "Likes" "Shares" "Comments" Extensive Privacy Controls Twitter Accounts Types: "Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  17. 17. Why Does This Matter? "If you can measure it, you can improve it" Modeling Behavior Predictive Analysis Recommending Content Swaying political situations might just be the ultimate value proposition for social media 17
  18. 18. Social Media Analysis Framework Four Steps To Success Aspire Acquire Analyze Summarize Let's step through a trivial example... 18
  19. 19. (1) Aspire Let's frame a trivial hypothesis to illustrate the four steps... Frame a hypothesis about some real world phenomenon For example: "Johnny Araya is a more popular candidate than Rodolfo Hernández" Let's use social media as a basis of investigation 19
  20. 20. (2) Acquire Collect the data that you need to test the hypothesis How? Use Facebook and Twitter APIs to harvest data about each candidate Go after low hanging fruit before something more complex You don't even need to write code to do this (yet) 20
  21. 21. They're both on Facebook 21 http://facebook.com/ElDoctor2014 http://facebook.com/JohnnyArayaMonge
  22. 22. They're both on Twitter 22 @Johnny_Araya@ElDoctor2014
  23. 23. (3) Analyze Count, Filter, and Rank the Data Johnny Araya: ~50k Facebook likes ~14k Twitter followers Rodolfo Hernández: ~37k Facebook likes; 745 Twitter followers Johnny Araya is indeed more popular in social media 23
  24. 24. (4) Summarize Present the data in a concise and easily understood manner Charts Tables Simple visualizations Some examples... 24
  25. 25. 25 Araya% Hernandez% Araya% Hernandez% Twitter Popularity Social Media Popularity: Araya vs Hernández Facebook Popularity
  26. 26. 26 0" 10000" 20000" 30000" 40000" 50000" 60000" Araya" Hernandez" Twi5er"followers" Facebook"fans" Social Media Popularity: Araya vs Hernández
  27. 27. 27 1" 10" 100" 1000" 10000" 100000" Araya" Hernandez" Twi0er"followers" Facebook"fans" Social Media Popularity: Araya vs Hernández
  28. 28. Twitter Popularity 28
  29. 29. Facebook Popularity 29 JohnnyArayaMonge, 35%, o0oguevaraguth, 17%, luisguillermosolisr, 3%, villaltaJM, 19%, ElDoctor2014, 26%, Facebook(Likes(for(Costa(Rican(Presiden4al(Candidates(
  30. 30. Recall the previous hypothesis: "Johnny Araya is a more popular candidate than Rodolfo Hernández" What do we know now that we didn't before? The current state of each candidate's Twitter and Facebook popularity Let's explore a slightly more complex hypothesis... 30 Reflect and Refine...
  31. 31. (1) Aspire Redefine the hypothesis: For example: "Johnny Araya has a more effective social media strategy than Rodolfo Hernández" Presumably because of his superior social media status at the moment 31
  32. 32. (2) Acquire Collect the data that you need to test the hypothesis How? Use APIs to harvest data about each candidate Let's consider any Facebook posts for 2013 32
  33. 33. 33 for candidate in ['JohnnyArayaMonge', 'ElDoctor2014']: # Get the data url = 'https://graph.facebook.com/{0}?' + fields= posts.limit(500)&access_token=XXX'.format(candidate) content = requests.get(url).json() # Save the data f = open(candidate + ".json", "w") f.write(json.dumps(content)) f.close() Python Source Code
  34. 34. (3) Analyze 34 Count, Filter, and Rank the Data Some more Python source code to crunch the numbers Extract Facebook likes and shares this year
  35. 35. Facebook Vitals 35 ElDoctor2014 Total Likes 37495 Num Posts since Jan 1, 2013 (of 500 possible) 436 Total Post Likes 155473 Total Post Shares 9684 Oldest Post in Batch 2013-03-15T00:40:21+0000 Num posts prior to Jan 1, 2013 0 Avg likes/post 356.589449541 (0.951032003044%) Avg shares/post 22.2110091743 (0.059237256099%) Post Types [(u'photo', 286), (u'link', 77), (u'status', 40), (u'video', 32), (u'swf', 1)] JohnnyArayaMonge Total Likes 50301 Num Posts since Jan 1, 2013 (of 500 possible) 205 Total Post Likes 176161 Total Post Shares 7542 Oldest Post in Batch 2013-01-01T07:18:43+0000 Num posts prior to Jan 1, 2013 190 Avg likes/post 859.32195122 (1.70835957778%) Avg shares/post 36.7902439024 (0.0731401838978%) Post Types [(u'photo', 149), (u'status', 38), (u'link', 13), (u'video', 5)]
  36. 36. (4) Summarize Present the data in a concise and easily understood manner Like a table... 36
  37. 37. 37 Metric Araya Hernández Total Likes Posts since 1 Jan 13 Num Prior Posts Earliest Post Post Likes since 1 Jan 13 Post Shares since 1 Jan 13 Avg Likes per Post Avg Shares per Post 50,301 37,495 205 436 190+ 0 1 Jan 2013 15 March 2013 176,161 155,473 7,542 9,684 859 356 36 22
  38. 38. 38 Metric Araya Hernández Total Likes Posts since 1 Jan 13 Num Prior Posts Earliest Post Post Likes since 1 Jan 13 Post Shares since 1 Jan 13 Avg Likes per Post Avg Shares per Post 50,301 37,495 205 436 190+ 0 1 Jan 2013 15 March 2013 176,161 155,473 7,542 9,684 859 356 36 22
  39. 39. Recall the hypothesis: "Johnny Araya has a more effective social media strategy than Rodolfo Hernández because he has more Facebook and Twitter popularity" What do we know now? Hernández has Facebook vitals that are quite competitive with Araya However, Hernández only joined Facebook ~6 months ago! It would appear that Hernández has the more effective strategy What is he doing to rise in popularity so quickly? 39 Reflect and Refine...
  40. 40. 40 Comparison of Facebok Content
  41. 41. Other Candidates 41
  42. 42. Johnny Araya FB Posts 42
  43. 43. Rodolfo Hernández FB Posts 43
  44. 44. 44
  45. 45. Past ~2 Months on Facebook 45 Aug 2013 FB Likes Sept 2013 FB Likes % Change Johnny Araya Otto Guevara Guth José María Villalta Florez- Estrada Dr. Rodolfo Hernández Luis Guillermo Solís Rivera 50,301 53,809 6.97% 24,146 27,675 14.62% 27,262 35,169 29.00% 37,495 38,298 2.14% 5,334 6,763 26.79%
  46. 46. Past ~3 Months on Twitter 46 Aug 2013 Sept 2013 % Change Johnny Araya Otto Guevara Guth José María Villalta Florez-Estrada Dr. Rodolfo Hernández Luis Guillermo Solís Rivera 14,573 15,506 6.40% 114 159 39.47% 8,160 8,990 10.17% 745 858 15.17% 1,192 1,487 24.75%
  47. 47. Facebook and Twitter Compared 47 % FB Change % Twitter Change Johnny Araya Otto Guevara Guth José María Villalta Florez- Estrada Dr. Rodolfo Hernández Luis Guillermo Solís Rivera 6.97% 6.40% 14.62% 39.47% 29.00% 10.17% 2.14% 15.17% 26.79% 24.75%
  48. 48. Your Imagination Is the Only Limit Analyze the comments that people are leaving on Facebook pages Try to ascertain common common Facebook fans or Twitter followers amongst candidates Deduce demographics from social media by synthesizing public data Theorize about potential "reach" or "influence" using social media Analyze data in realtime 48
  49. 49. Thinking about Reach 49 Think about "liking" and "following" as opt-ins to feeds Remember: Interest Graphs Arriving at effective metrics is tricker than it initially seems
  50. 50. Potential Twitter Influence 50 Araya Hernández Followers Theoretical Reach Reach (10) Reach (100) Reach (1000) Reach (10,000) "Suspect" Followers ~14k ~750 ~40M ~550k 490 673 289 702 2782 X 2832 X 3,246 94 See also http://wp.me/p3QiJd-2a
  51. 51. Potential Influence 51
  52. 52. Who are Candidates Following? 52
  53. 53. What are Candidates Tweeting? 53
  54. 54. Realtime Analysis 54 Monitor Twitter's firehose for realtime data using filters such as #Syria Keep in mind the sheer volume of data can be considerable Analysis at MiningTheSocialWeb.com
  55. 55. Mapping #Syria Tweets 55 See http://wp.me/p3QiJd-1t Text
  56. 56. Temporal Analysis on #Syria 56
  57. 57. Analyzing #Syria Tweet Entities 57
  58. 58. Closing Remarks Software is the gift that keeps on giving Code it up once, run it ad infinitum... Code designed for one account will work for other accounts Analysis is all about knowing what to count Coding it up is just the dirty work Start somewhere and then iteratively explore...then exploit 58
  59. 59. Aspire to Do Great Things Predicting demographic data such as age or gender is possible for some languages Time and space are fundamentals for grounding online discussions in reality. Twitter is about as good as it gets for realtime topical analysis Think of the world as signal producers and signal collectors Monitoring breaking news events like #Syria 59
  60. 60. The Tip of the Iceberg 60
  61. 61. Stay in Touch Website: http://MiningTheSocialWeb.com Twitter: @ptwobrussell FB: http://facebook.com/MiningTheSocialWeb LinkedIn: http://linkedin.com/in/ptwobrussell Email: ptwobrussell@gmail.com 61
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×