Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)


Published on

Sunday 9:55 a.m.–10:45 a.m.

Why Twitter Is All the Rage: A Data Miner's Perspective

Presenter: Matthew Russell

Audience level: Novice

In order to be successful, technology must amplify a meaningful aspect of our human experience, and Twitter’s success largely has been dependent on its ability to do this quite well. Although you could describe Twitter as just a “free, high-speed, global text-messaging service,” that would be to miss the much larger point that Twitter scratches some of the most fundamental itches of our humanity.

This talk explains explains why Twitter is "all the rage" by examining Twitter in light of fundamental questions about our humanity:

* We want to be heard
* We want to satisfy our curiosity
* We want it easy
* We want it now

This session examines Twitter's ability to examine these questions and presents its underlying conceptual architecture as an interest graph.

Even if you have minimal programming skills, you'll come away empowered with the ability to think about data mining on Twitter in more effective ways and apply a powerful collection of easily adaptable recipes to fully exploit the 5 kilobytes of metadata that decorates those 140 characters that you commonly think of as a tweet. Learn how to access Twitter's API, search for tweets, discover trending topics, process tweets in real-time from the firehose, and much more.

Published in: Technology
  • Be the first to comment

Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)

  1. 1. 1 Why Twitter Is All The Rage: A Data Miner's Perspective Matthew A. Russell - @ptwobrussell - PyTN - 23 February 2014
  2. 2. 2 Overview Intro Twitter as a Platform for Data Science Applications of Firehose Analysis (#Syria circa last) Understanding the Amazon Prime Air Reaction (IPython Notebook Walk Through) Q&A
  3. 3. 3 Intro
  4. 4. 4 Hello, My Name Is ... Matthew Background in Computer Science Data mining & machine learning CTO @ Digital Reasoning Systems Data mining; machine learning Author @ O'Reilly Media 5 published books on technology Principal @ Zaffra Selective boutique consulting
  5. 5. 5 Transforming Curiosity Into Insight An open source software (OSS) project A book Accessible to (virtually) everyone Virtual machine with turn-key coding templates for data science experiments Think of the book as "premium" support for the OSS project
  6. 6. 6 Mining the Social Web ToC Chapter 1 - Mining Twitter Chapter 2 - Mining Facebook Chapter 3 - Mining LinkedIn Chapter 4 - Mining Google+ Chapter 5 - Mining Web Pages Chapter 6 - Mining Mailboxes Chapter 7 - Mining GitHub Chapter 8 - Mining the Semantically Marked-Up Web Chapter 9 - Twitter Cookbook
  7. 7. 7 Anatomy of Each Chapter Brief Intro Objectives API Primer Analysis Technique(s) Data Visualization Recap Suggested Exercises Recommended Resources
  8. 8. 8 Opportunities for Data Alchemy A model for the world: signal and sinks Growth in data exhaust is accelerating Digital fingerprints of the "real world" are accumulating Lots of opportunities for motivated Python hackers "Software is eating the world"
  9. 9. 9 Social Media Is All the Rage World population: 7B people Facebook: 1B+ users Twitter: 650M users Google+ 500M users LinkedIn: 260M users 250M+ blogs (conservatively?)
  10. 10. 10 But what does it all mean, Basil? It's a platform for data science and the frontier for predictive analytics Understanding world events Swaying political elections Modeling human behavior Analyzing sentiment Making intelligent recommendations
  11. 11. 11 Twitter & Data Science
  12. 12. 12 Data Science Data => Actionable information Highly interdisciplinary Nascent Necessary
  13. 13. 13 Another View of Data Science
  14. 14. 14
  15. 15. 15 Twitter Is All the Rage It satisfies fundamental human desires We want to be heard We want to satisfy our curiosity We want it easy We want it now Accessible, rich, and (mostly) "open" data RESTful APIs and JSON responses Great proving ground for predictive analytics about the real world
  16. 16. 16 Twitter's Network Dynamics ~650M curious users A collective consciousness Real-time communication Short, sweet, ... and fast Asymmetric Following Model An interest graph
  17. 17. 17 Twitter Primitives Accounts Types: "Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  18. 18. 18 Twitter and Facebook Compared Twitter Facebook Accounts Types: "Anything" Accounts Types: People & Pages "Following" Relationships Mutual Connections Favorites "Likes" Retweets "Shares" Replies "Comments" (Almost) No Privacy Controls Extensive Privacy Controls
  19. 19. 19 What's in a Tweet? 140 Characters ... ... Plus ~5KB of metadata! Authorship Time & location Tweet "entities" Replying, retweeting, favoriting, etc.
  20. 20. 20 What are Tweet Entities? Essentially, the "easy to get at" data in the 140 characters @usermentions #hashtags URLs multiple variations (financial) symbols stock tickers media
  21. 21. 21 API Requests RESTful requests Everything is a "resource" You GET, PUT, POST, and DELETE resources Standard HTTP "verbs" Example: GET screen_name=SocialWebMining Streaming API filters JSON responses Cursors (not quite pagination)
  22. 22. 22 Data Mining: Low Hanging Fruit "Know thy data..." Start with simple stats: Count Compare Filter Rank Then, apply more complex analyses
  23. 23. 23 A Starting Point: Histograms A chart that is handy for frequency analysis They look like bar charts...except they're not bar charts Each value on the x-axis is a range (or "bin") of values Not categorical data Each value on the y-axis is the combined frequency of values in each range
  24. 24. 24 Example: Histogram of Retweets
  25. 25. 25 Social Network Mechanics Roberto Mercedes Jorge Nina Ana
  26. 26. 26 Interest Graph Mechanics U2 Roberto Mercedes Juan Luis Luís Guerra Ana Jorge Nina
  27. 27. 27 A (Social) Interest Graph U2 Roberto Mercedes Juan Luis Luís Guerra Ana Jorge Nina
  28. 28. 28 A (Political) Interest Graph Johnny Araya Roberto Mercedes Rodolfo Hernández Ana Jorge Nina
  29. 29. 29 Measuring Influence Is Tricker Than It Looks Spam bot accounts that effectively are zombies and can’t be harnessed for any utility at all Inactive or abandoned accounts that can’t influence or be influenced since they are not in use Accounts that follow so many other accounts that the likelihood of getting noticed (and thus influencing) is practically zero The network effects of retweets by accounts that are active and can be influenced to spread a message See also
  30. 30. 30 Justin Bieber vs Tea Party
  31. 31. 31 Realtime Analysis: #Syria Monitor Twitter's firehose for realtime data using filters such as #Syria Keep in mind the sheer volume of data can be considerable Fuller analysis at
  32. 32. 32 #Syria: Who? See
  33. 33. 33 #Syria: Who? See
  34. 34. 34 #Syria: Who? See
  35. 35. 35 #Syria: What? See
  36. 36. 36 #Syria: What? See
  37. 37. 37 #Syria: Where? See
  38. 38. 38 #Syria: When? See
  39. 39. 39 #Syria: Why? That's for you (as the data scientist) to decide Quantitative automation can amplify human intelligence Qualitative analysis is still requires human intelligence
  40. 40. 40 Twitter Firehose Analysis with pandas
  41. 41. 41 MTSW Virtual Machine Experience Goal: Make it easy to transform curiosity into insight Vagrant-based virtual machine Virtualbox or AWS IPython Notebook User Experience Point-and-click GUI 100+ turn-key examples and templates Social web mining for the masses
  42. 42. 42 Social Media Analysis Framework A memorable four step process to guide data science experiments: Aspire Acquire Analyze Summarize
  43. 43. 43 Goals To understand how to capture data from Twitter's firehose A understand basic pandas usage for tweets To work through a data science experiment with a systematic 4-step process To better understand the emotional reaction to the Amazon Prime Air announcement To introduce some tools for data science
  44. 44. 44 Useful Links Website Twitter Data Mining Round Up All Source Code in IPython Notebook format (GitHub)
  45. 45. 45 Q&A