Mining Social Web APIs with IPython Notebook - Data Day Texas 2014


Published on

Slides from a 2-hour workshop at Data Day Texas 2014 on how to mine social web APIs. This workshop specifically focused on extracting insight from Twitter data and was partitioned into two hour long segments. The first segment focused on familiarity with Twitter's API, while the latter segment focused on using pandas to extract insight from tweets from the firehose via the Streaming API.

Published in: Technology

Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

  1. 1. 1 Mining Social Web APIs with IPython Notebook Matthew A. Russell - @ptwobrussell - Data Day Texas - 11 January 2014
  2. 2. 2 Intro
  3. 3. 3 Hello, My Name Is ... Matthew Background in Computer Science Data mining & machine learning CTO @ Digital Reasoning Systems Data mining; machine learning Author @ O'Reilly Media 5 published books on technology Principal @ Zaffra Selective boutique consulting
  4. 4. 4 Transforming Curiosity Into Insight An open source software (OSS) project A book Accessible to (virtually) everyone Virtual machine with turn-key coding templates for data science experiments Think of the book as "premium" support for the OSS project
  5. 5. 5 The Social Web Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate)
  6. 6. 6 Overview Intro (5 mins) Module 1 - Virtual Machine & IPython Notebook Overview (10 mins) Module 2 - Twitter Intro/Overview (45 mins) Module 3 - Twitter Firehose Analysis with pandas (45 mins) Module 4 - Overview of other MTSW IPython Notebooks (5 mins) Wrap Up/Final Q&A (10 mins)
  7. 7. 7 Workshop Objective To send you away as a social web hacker Hands-on experience hacking on Twitter data Empowered to walk away ready for on Facebook, LinkedIn, Google+, etc. Broad working knowledge popular social web APIs To have fun and learn a few things
  8. 8. 8 Just a Few More Things This workshop is... An adaptation of Chapters 1+9 from Mining the Social Web, 2nd Edition More of a guided hacking session where you follow along (vs a lecture) Designed to be very hands-on, not a lecture I'm available 24/7 this week (and beyond) to help you be successful
  9. 9. 9 Assumptions At some point in your life, you have Programmed with Python Worked with JSON Made requests and processed responses to/from web servers Or you want to learn to do these things now... And you're a quick learner
  10. 10. 10 Module 1: Virtual Machine Setup
  11. 11. 11 Why do you need a VM? To save time Because installation and configuration management is harder than it first appears So that you can focus on the task at hand instead So that I can support you regardless of your hardware and operating system
  12. 12. 12 But I can do all of that myself... True... If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages Including scientific computing tools that require underlying C/C++ code to be compiled Which requires specific versions of developer libraries to be installed You get the idea...
  13. 13. 13 The Virtual Machine Experience Vagrant A nice abstraction around virtual machine providers One ring to rule them all Virtualbox, VMWare, AWS, ... IPython Notebook The easiest way to program with Python A better REPL (interpreter) Great for hacking
  14. 14. 14 What happens when you vagrant up? Vagrant follows the instructions in your Vagrantfile Starts up a Virtualbox instance Uses Chef to provision it Installs OS patches/updates Installs MTSW software dependencies Starts IPython Notebook server on port 8888
  15. 15. 15 Why Should I Use IPython Notebook? Because it's great for hacking And hacking is usually the first step Because it's great for collaboration Sharing/publishing results is trivial Because the UX is as easy as working in a notepad Think of it as "executable paper"
  16. 16. 16
  17. 17. 17
  18. 18. 18 VM Quick Start Instructions Go to Follow the instructions And watch the screencasts! Basically: Install Virtualbox & Vagrant Run "vagrant up" in a terminal to start a guest VM Then, go to http://localhost:8888 on your host machine's web browser
  19. 19. 19 What Could Be Easier? A hosted version of the VM! But only for a few hours during this workshop Because it costs money to run these servers Go to and pick a machine Please do not share the URLs outside of this workshop! With a cherry on top...
  20. 20. 20 A Hosted Virtual Machine Is it free? Perhaps... ...Sign-up for the AWS free tier at But not right now. Do it later See this blog post for some inspiration on how to easily build your own AMI from Vagrant boxes
  21. 21. 21 One More Thing There's a new alpha product from O'Reilly Media that hosts IPython Notebooks and other software to enhance reading experiences I can share out "invites" with any interested volunteers
  22. 22. 22 Module 2: Twitter Intro/Overview
  23. 23. 23 Objectives Be able to identify Twitter primitives Understand tweet metadata and how to use it Learn how to extract entities such as user mentions, hashtags, and URLs Apply techniques for performing frequency analysis with Python Be able to plot histograms of Twitter data with IPython Notebook Learn about a Twitter cookbook that you can easily adapt
  24. 24. 24 Twitter Primitives Accounts Types: "Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  25. 25. 25 API Requests RESTful requests Everything is a "resource" You GET, PUT, POST, and DELETE resources Standard HTTP "verbs" Example: GET screen_name=SocialWebMining Streaming API filters JSON responses Cursors (not quite pagination)
  26. 26. 26 Twitter is an Interest Graph Johnny Araya Roberto Mercedes Rodolfo Hernández Ana Jorge Nina
  27. 27. 27 What's in a Tweet? 140 Characters ... ... Plus ~5KB of metadata! Authorship Time & location Tweet "entities" Replying, retweeting, favoriting, etc.
  28. 28. 28 What are Tweet Entities? Essentially, the "easy to get at" data in the 140 characters @usermentions #hashtags URLs multiple variations (financial) symbols stock tickers media
  29. 29. 29 Data Mining Is Often Just... Counting Comparing Filtering Ranking
  30. 30. 30 Histograms A chart that is handy for frequency analysis They look like bar charts...except they're not bar charts Each value on the x-axis is a range (or "bin") of values Not categorical data Each value on the y-axis is the combined frequency of values in each range
  31. 31. 31 Example: Histogram of Retweets
  32. 32. 32 Social Media Analysis Framework A memorable four step process to guide data science experiments: Aspire To test a hypothesis (answer a question) Acquire Get the data Analyze Count things Summarize Plot the results
  33. 33. 33 Exercises Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook Fill in Example 1-1 with credentials and begin work See for a helpful video Execute each example sequentially Customize queries, explore tweet metadata, count tweet entities, etc. Explore the "Chapter 9 (Twitter Cookbook)" notebook In particular, check out Example 9-8 (Twitter's Streaming API)
  34. 34. 34 Module 3: Twitter Firehose Analysis with pandas
  35. 35. 35 Objectives To understand how to capture data from Twitter's firehose A understand basic pandas usage for tweets To work through a data science experiment with a systematic 4-step process
  36. 36. 36 Social Media Analysis Framework Remember: Aspire Acquire Analyze Summarize
  37. 37. 37 Understanding the Reaction Amazon Prime Air Open up the notebook entitled __Understanding the Reaction to Amazon Prime Air.ipynb and follow along Or, visit and follow along if you're just joining us
  38. 38. 38 Module 4: Overview of other MTSW IPython Notebooks
  39. 39. 39 Mining the Social Web ToC Chapter 1 - Mining Twitter Chapter 2 - Mining Facebook Chapter 3 - Mining LinkedIn Chapter 4 - Mining Google+ Chapter 5 - Mining Web Pages Chapter 6 - Mining Mailboxes Chapter 7 - Mining GitHub Chapter 8 - Mining the Semantically Marked-Up Web Chapter 9 - Twitter Cookbook
  40. 40. 40 A Recommendation Bookmark Take note of Mining the Social Web under "Books" Notice lots of other terrific notebooks, too
  41. 41. 41 Wrap Up / Final Q&A
  42. 42. 42 Helpful Links & Free Stuff Mining the Social Web 2E Chapter 1 (Chimera) Source Code (GitHub) (numbered examples) Screencasts (Vimeo)