Mining Social Web APIs with IPython Notebook (Strata 2013)


Published on

Slides from my Strata / Hadoop World 2013 (NYC) hands-on workshop.

Workshop Description from

Social web properties such as Twitter, Facebook, LinkedIn, and Google+ have vast amounts of valuable insights lurking just beneath the surface, and this workshop minimizes the barriers to exploring and mining this valuable data by presenting turn-key examples from Mining the Social Web (2nd Edition) with IPython Notebook.

Each module consists of a brief period in which each attendee will customize the corresponding notebook for the module with their own account credentials with the remainder of the module devoted to learning what data is available from the API and exercises demonstrating analysis of the data—all from a pre-populated IPython Notebook. Even attendees with minimal programming experience should be able to walk away from this workshop with a working knowledge of the material and be equipped with sample code that can be easily repurposed given the design of this tutorial.

Time will be set aside at the end of each module’s follow-along presentation for attendees to hack on the code, discuss examples, and ask any lingering questions.

Published in: Technology

Mining Social Web APIs with IPython Notebook (Strata 2013)

  1. 1. 1 Mining Social Web APIs with IPython Notebook Matthew A. Russell - @ptwobrussell - New York City - 28 October 2013
  2. 2. 2 Intro
  3. 3. 3 Hello, My Name Is ... Matthew Background in Computer Science Data mining & machine learning CTO @ Digital Reasoning Systems Data mining; machine learning Author @ O'Reilly Media 5 published books on technology Principal @ Zaffra Selective boutique consulting
  4. 4. 4 Transforming Curiosity Into Insight An open source software (OSS) project A book Accessible to (virtually) everyone Virtual machine with turn-key coding templates for data science experiments Think of the book as "premium" support for the OSS project
  5. 5. 5 The Social Web Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate)
  6. 6. 6 Overview Intro (5 mins) Module 1 - Virtual Machine Setup (10 mins) Module 2 - Mining Twitter (40 mins) Module 3 - Mining Facebook (35 mins) BREAK (30 mins) Module 4 - Mining LinkedIn (40 mins) Module 5 - Open Hack (40 mins) Final Q&A; Wrap Up (10 mins)
  7. 7. 7 Module Format ~10-15 minutes of exposition I talk; you listen ~25-30 minutes of independent (or collaborative) work You hack while I walk around and help you ~5 minutes of Q&A You ask; I try to answer
  8. 8. 8 Workshop Objective To send you away as a social web hacker Broad working knowledge popular social web APIs Hands-on experience hacking on social web data with a common toolkit Not to listen to me talk to you for 3 hours
  9. 9. 9 Just a Few More Things This workshop is... An adaptation of Mining the Social Web, 2nd Edition More of a guided hacking session where you follow along (vs a preso) Wider than it is deeper There's only so much you can do in a few hours I'm available 24/7 this week (and beyond) to help you be successful
  10. 10. 10 Assumptions At some point in your life, you have Programmed with Python Worked with JSON Made requests and processed responses to/from web servers Or you want to learn to do these things now... And you're a quick learner
  11. 11. 11 Module 1: Virtual Machine Setup
  12. 12. 12 Why do you need a VM? To save time Because installation and configuration management is harder than it first appears So that you can focus on the task at hand instead So that I can support you regardless of your hardware and operating system
  13. 13. 13 But I can do all of that myself... True... If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages Including scientific computing tools that require underlying C/C++ code to be compiled Which requires specific versions of developer libraries to be installed You get the idea...
  14. 14. 14 The Virtual Machine Experience Vagrant A nice abstraction around virtual machine providers One ring to rule them all Virtualbox, VMWare, AWS, ... IPython Notebook The easiest way to program with Python A better REPL (interpreter) Great for hacking
  15. 15. 15 What happens when you vagrant up? Vagrant follows the instructions in your Vagrantfile Starts up a Virtualbox instance Uses Chef to provision it Installs OS patches/updates Installs MTSW software dependencies Starts IPython Notebook server on port 8888
  16. 16. 16 Why Should I Use IPython Notebook? Because it's great for hacking And hacking is usually the first step Because it's great for collaboration Sharing/publishing results is trivial Because the UX is as easy as working in a notepad Think of it as "executable paper"
  17. 17. 17
  18. 18. 18
  19. 19. 19 VM Quick Start Instructions Go to Follow the instructions And watch the screencasts! Basically: Install Virtualbox & Vagrant Run "vagrant up" in a terminal to start a guest VM Then, go to http://localhost:8888 on your host machine's web browser
  20. 20. 20 What Could Be Easier? A hosted version of the VM! But only for a few hours during this workshop Because it costs money to run these servers Go to <the URL provided in the session> and pick a machine Do not share the URLs outside of this workshop! Please don't try to hack the machines I'll verbally provide the connection details (port and password)
  21. 21. 21 A Hosted Virtual Machine Yes, please. Is it free? Perhaps... ...Sign-up for the AWS free tier at But not right now. Do it later Standby for the step-by-step instructions on how to do it I'll publish a post on it in the next day or so
  22. 22. 22
  23. 23. 23 Module 2: Mining Twitter
  24. 24. 24 Objectives Be able to identify Twitter primitives Understand tweet metadata and how to use it Learn how to extract entities such as user mentions, hashtags, and URLs from tweets Apply techniques for performing frequency analysis with Python Be able to plot histograms of Twitter data with IPython Notebook
  25. 25. 25 Twitter Primitives Accounts Types: "Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  26. 26. 26 API Requests RESTful requests Everything is a "resource" You GET, PUT, POST, and DELETE resources Standard HTTP "verbs" Example: GET screen_name=SocialWebMining Streaming API filters JSON responses Cursors (not quite pagination)
  27. 27. 27 Twitter is an Interest Graph Johnny Araya Roberto Mercedes Rodolfo Hernández Ana Jorge Nina
  28. 28. 28 What's in a Tweet? 140 Characters ... ... Plus ~5KB of metadata! Authorship Time & location Tweet "entities" Replying, retweeting, favoriting, etc.
  29. 29. 29 What are Tweet Entities? Essentially, the "easy to get at" data in the 140 characters @usermentions #hashtags URLs multiple variations (financial) symbols stock tickers media
  30. 30. 30 Data Mining Is... Counting Comparing Filtering Ranking
  31. 31. 31 Histograms A chart that is handy for frequency analysis They look like bar charts...except they're not bar charts Each value on the x-axis is a range (or "bin") of values Not categorical data Each value on the y-axis is the combined frequency of values in each range
  32. 32. 32 Plotting with IPython Notebook
  33. 33. 33 Example: Histogram of Retweets
  34. 34. 34 Social Media Analysis Framework A memorable four step process to guide data science experiments: Aspire To test a hypothesis (answer a question) Acquire Get the data Analyze Count things Summarize Plot the results
  35. 35. 35 Exercises Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook Fill in Example 1-1 with credentials and begin work Execute each example sequentially Customize queries Explore tweet metadata; count tweet entities; plot histograms of results Explore the "Chapter 9 (Twitter Cookbook)" notebook Think of it as a collection of building blocks
  36. 36. 36 Module 3: Mining Facebook
  37. 37. 37 Objectives Be able to identify Facebook primitives Learn about Facebook’s Social Graph API and how to make API requests Understand how Open Graph protocol extends Facebook's Social Graph API Be able to analyze likes from Facebook pages and friends
  38. 38. 38 Facebook Primitives Account Types: People & Pages Mutual Connections Likes Shares Comments Extensive Privacy Controls
  39. 39. 39 API Requests Social Graph API requests Not RESTful but easy to learn and use Special "field expansion" syntax Example: GET fields=id,name,friends.fields(likes.limit(10)) JSON responses Traditional pagination
  40. 40. 40 Facebook is an Interest Graph Johnny Araya Roberto Mercedes Rodolfo Hernández Ana Jorge Nina
  41. 41. 41 Facebook API Explorer Go to Really, go there right now...
  42. 42. 42 Retrieve Your Likes
  43. 43. 43 Facebook Permissions
  44. 44. 44 Facebook Permissions
  45. 45. 45 Explore Facebook Pages Names of pages MiningTheSocialWeb CrossFit OReilly Web URLs (OGP extensions to Facebook's Social Graph)
  46. 46. 46 Social Media Analysis Framework Recall the same four step process to guide data science experiments: Aspire Acquire Analyze Summarize
  47. 47. 47 Embedded Visualizations with IPython NB
  48. 48. 48 Social Network Diagram with D3
  49. 49. 49 Exercises Copy/paste your access token from the Graph API Explorer into the "Chapter 2 (Mining Facebook)" notebook Paste the value and execute the cell just before Example 2-1 Execute examples sequentially (try to at least make it to Example 2-10) Analyze your likes, your friends and likes from pages of interest If you have time... Remaining examples
  50. 50. 50 Module 4: Mining LinkedIn
  51. 51. 51 Objectives Learn about LinkedIn’s Developer Platform Understand how clustering works A fundamental type of machine learning Be able to employ geocoding services to arrive at a set of coordinates from a textual reference to a location Visualize geographic data with cartograms
  52. 52. 52 LinkedIn Primitives Account Types: People, Companies The data seems "more closely held" than Facebook or Twitter No FOAF visibility Richest data source Profile descriptions from mutual connections A little messier than it first appears Not necessarily a bad thing
  53. 53. 53 API Requests (Weirdly) RESTful Requests Not really RESTful Field selector syntax,last-name,headline,picture-url) XML responses CSV address book download
  54. 54. 54 Is LinkedIn an Interest Graph? Fundamentally: yes. But not so much at the developer API level Less trivial to find some of the "pivots" No Skills API (yet) But the data is there (mostly in profile descriptions) for your direct connections Companies, job titles, job descriptions Lots of richness is tucked away in human language data
  55. 55. 55 Clustering An unsupervised machine learning learning technique Think: an algorithm that organizes the data into partitions
  56. 56. 56 Example: Clustered Job Titles
  57. 57. 57 3 Steps to Clustering Your Data Normalization Compare (similarity/distance measurement) n-grams, edit distance, and Jaccard are common, but your imagination is the limit Why can't you just compare everything to everything? Dimensionality Reduction Ideally, your clustering algorithm will mitigate the pain k-means is among the most common clustering techniques in use
  58. 58. 58 Jaccard Similarity
  59. 59. 59 k-Means Explained 1. Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K1, K2, ..., Kk. 2. Assign each of the n points to a cluster by finding the nearest Kn—effectively creating k clusters and requiring k*n comparisons. 3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each iteration of the algorithm.) 4. Repeat steps 2–3 until the members of the clusters do not change between iterations. Generally speaking, relatively few iterations are required for convergence.
  60. 60. 60 k-Means: Initialize
  61. 61. 61 k-Means: Step 1
  62. 62. 62 k-Means: Step 2
  63. 63. 63 k-Means: Step 3
  64. 64. 64 k-Means: (Fast-Forward) Step 9
  65. 65. 65 Geocoding Transforming a location to a set of coordinates Nashville, TN => (36.16783905029297, -86.77816009521484) A harder problem than it first appears The Bing API is especially generous Requires an account sign up: Use the API key with the geopy package
  66. 66. 66 Cartograms
  67. 67. 67 Unless you use a Dorling Cartogram
  68. 68. 68 Social Media Analysis Framework Remember: Use the same four step process to guide data science experiments: Aspire Acquire Analyze Summarize
  69. 69. 69 Exercises Follow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API connection and follow along with the first few examples Download your connections as a CSV file from export-settings and save them to your VM A deviation from instructions in Example 3-6 is necessary for remote VMs See Create a Bing Maps portal account and get your API key for Examples 3-8 and beyond Try clustering your contacts in Example 3-12 Try Example 3-13 (visualizing data in Google Earth) at home...
  70. 70. 70 Social Media Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate)
  71. 71. 71 Module 5: Open Hack
  72. 72. 72 Objectives To work on "loose ends" or areas of interest from previous modules To hack on code in notebooks not yet encountered To setup the virtual machine on your own box if you haven't yet To collaborate/talk and otherwise make the most of our togetherness
  73. 73. 73 Social Media Analysis Framework Remember: Aspire Acquire Analyze Summarize
  74. 74. 74 Recommendations Setup your own development environment if you haven't already Appendix A Text Mining & Natural Language Processing Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages) Graph Mining Chapter 7 (Mining GitHub) Analyzing Semantic Markup Chapter 8 (Mining the Semantically Marked-Up Web)
  75. 75. 75 Final Q&A; Wrap Up
  76. 76. 76 Free Stuff Mining the Social Web 2E Chapter 1 (Chimera) Source Code (GitHub) (numbered examples) Screencasts (Vimeo)