Mining Social Web APIs with IPython Notebook (Strata 2013)

Matthew Russell
Matthew RussellChief Technology Officer at Digital Reasoning Systems
1

Mining Social Web APIs
with IPython Notebook
Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
New York City - 28 October 2013
2

Intro
3

Hello, My Name Is ... Matthew
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
4

Transforming Curiosity Into Insight
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding
templates for data science experiments
Think of the book as "premium" support for the
OSS project
5

The Social Web Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
6

Overview
Intro (5 mins)
Module 1 - Virtual Machine Setup (10 mins)
Module 2 - Mining Twitter (40 mins)
Module 3 - Mining Facebook (35 mins)
BREAK (30 mins)
Module 4 - Mining LinkedIn (40 mins)
Module 5 - Open Hack (40 mins)
Final Q&A; Wrap Up (10 mins)
7

Module Format
~10-15 minutes of exposition
I talk; you listen

~25-30 minutes of independent (or collaborative) work
You hack while I walk around and help you

~5 minutes of Q&A
You ask; I try to answer
8

Workshop Objective

To send you away as a social web hacker
Broad working knowledge popular social web APIs
Hands-on experience hacking on social web data with a common toolkit

Not to listen to me talk to you for 3 hours
9

Just a Few More Things
This workshop is...
An adaptation of Mining the Social Web, 2nd Edition
More of a guided hacking session where you follow along (vs a preso)
Wider than it is deeper
There's only so much you can do in a few hours

I'm available 24/7 this week (and beyond) to help you be successful
10

Assumptions
At some point in your life, you have
Programmed with Python
Worked with JSON
Made requests and processed responses to/from web servers

Or you want to learn to do these things now...
And you're a quick learner
11

Module 1: Virtual Machine Setup
12

Why do you need a VM?
To save time
Because installation and configuration management is harder than it first
appears
So that you can focus on the task at hand instead
So that I can support you regardless of your hardware and operating
system
13

But I can do all of that myself...
True...
If you would rather troubleshoot unexpected installation/configuration issues
instead of immediately focusing on the real task at hand

At least give it a shot before resorting to your own devices so that you
don't have to install specific versions of ~40 Python packages
Including scientific computing tools that require underlying C/C++ code to
be compiled
Which requires specific versions of developer libraries to be installed

You get the idea...
14

The Virtual Machine Experience
Vagrant
A nice abstraction around virtual machine providers
One ring to rule them all
Virtualbox, VMWare, AWS, ...

IPython Notebook
The easiest way to program with Python
A better REPL (interpreter)
Great for hacking
15

What happens when you vagrant up?
Vagrant follows the instructions in your Vagrantfile
Starts up a Virtualbox instance
Uses Chef to provision it
Installs OS patches/updates
Installs MTSW software dependencies
Starts IPython Notebook server on port 8888
16

Why Should I Use IPython Notebook?
Because it's great for hacking
And hacking is usually the first step

Because it's great for collaboration
Sharing/publishing results is trivial

Because the UX is as easy as working in a notepad
Think of it as "executable paper"
17
18
19

VM Quick Start Instructions
Go to http://MiningTheSocialWeb.com/quick-start/
Follow the instructions
And watch the screencasts!

Basically:
Install Virtualbox & Vagrant
Run "vagrant up" in a terminal to start a guest VM
Then, go to http://localhost:8888 on your host machine's web browser
20

What Could Be Easier?
A hosted version of the VM!
But only for a few hours during this workshop
Because it costs money to run these servers

Go to <the URL provided in the session> and pick a machine
Do not share the URLs outside of this workshop!
Please don't try to hack the machines
I'll verbally provide the connection details (port and password)
21

A Hosted Virtual Machine
Yes, please.
Is it free?
Perhaps...
...Sign-up for the AWS free tier at http://aws.amazon.com/free/
But not right now. Do it later

Standby for the step-by-step instructions on how to do it
I'll publish a post on it in the next day or so
22
23

Module 2: Mining Twitter
24

Objectives
Be able to identify Twitter primitives
Understand tweet metadata and how to use it
Learn how to extract entities such as user mentions, hashtags, and URLs
from tweets
Apply techniques for performing frequency analysis with Python
Be able to plot histograms of Twitter data with IPython Notebook
25

Twitter Primitives
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
26

API Requests
RESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"

Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?
screen_name=SocialWebMining

Streaming API filters
JSON responses
Cursors (not quite pagination)
27

Twitter is an Interest Graph
Johnny
Araya
Roberto

Mercedes

Rodolfo
Hernández

Ana

Jorge

Nina
28

What's in a Tweet?
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
29

What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations

(financial) symbols
stock tickers

media
30

Data Mining Is...

Counting
Comparing
Filtering
Ranking
31

Histograms

A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
32

Plotting with IPython Notebook
33

Example: Histogram of Retweets
34

Social Media Analysis Framework
A memorable four step process to guide data science experiments:
Aspire
To test a hypothesis (answer a question)

Acquire
Get the data

Analyze
Count things

Summarize
Plot the results
35

Exercises
Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook
Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook
Fill in Example 1-1 with credentials and begin work
Execute each example sequentially
Customize queries
Explore tweet metadata; count tweet entities; plot histograms of results
Explore the "Chapter 9 (Twitter Cookbook)" notebook
Think of it as a collection of building blocks
36

Module 3: Mining Facebook
37

Objectives

Be able to identify Facebook primitives
Learn about Facebook’s Social Graph API and how to make API requests
Understand how Open Graph protocol extends Facebook's Social Graph
API

Be able to analyze likes from Facebook pages and friends
38

Facebook Primitives

Account Types: People & Pages
Mutual Connections
Likes
Shares
Comments
Extensive Privacy Controls
39

API Requests
Social Graph API requests
Not RESTful but easy to learn and use
Special "field expansion" syntax
Example: GET http://graph.facebook.com/ptwobrussell/?
fields=id,name,friends.fields(likes.limit(10))

JSON responses
Traditional pagination
40

Facebook is an Interest Graph
Johnny
Araya
Roberto

Mercedes

Rodolfo
Hernández

Ana

Jorge

Nina
41

Facebook API Explorer

Go to https://developers.facebook.com/tools/explorer
Really, go there right now...
42

Retrieve Your Likes
43

Facebook Permissions
44

Facebook Permissions
45

Explore Facebook Pages
Names of pages
MiningTheSocialWeb
CrossFit
OReilly

Web URLs (OGP extensions to Facebook's Social Graph)
http://www.imdb.com/title/tt0117500
46

Social Media Analysis Framework

Recall the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
47

Embedded Visualizations with IPython NB
48

Social Network Diagram with D3
49

Exercises
Copy/paste your access token from the Graph API Explorer into the "Chapter 2
(Mining Facebook)" notebook
Paste the value and execute the cell just before Example 2-1
Execute examples sequentially (try to at least make it to Example 2-10)
Analyze your likes, your friends and likes from pages of interest
If you have time...
Remaining examples
50

Module 4: Mining LinkedIn
51

Objectives
Learn about LinkedIn’s Developer Platform
Understand how clustering works
A fundamental type of machine learning

Be able to employ geocoding services to arrive at a set of coordinates
from a textual reference to a location
Visualize geographic data with cartograms
52

LinkedIn Primitives
Account Types: People, Companies
The data seems "more closely held" than Facebook or Twitter
No FOAF visibility
Richest data source
Profile descriptions from mutual connections
A little messier than it first appears
Not necessarily a bad thing
53

API Requests

(Weirdly) RESTful Requests
Not really RESTful
Field selector syntax
http://api.linkedin.com/v1/people/~:(first-name,last-name,headline,picture-url)

XML responses
CSV address book download
54

Is LinkedIn an Interest Graph?
Fundamentally: yes. But not so much at the developer API level
Less trivial to find some of the "pivots"
No Skills API (yet)
But the data is there (mostly in profile descriptions) for your direct connections
Companies, job titles, job descriptions
Lots of richness is tucked away in human language data
55

Clustering

An unsupervised machine learning learning technique
Think: an algorithm that organizes the data into partitions
56

Example: Clustered Job Titles
57

3 Steps to Clustering Your Data
Normalization
Compare (similarity/distance measurement)
n-grams, edit distance, and Jaccard are common, but your imagination is the limit
Why can't you just compare everything to everything?
Dimensionality Reduction
Ideally, your clustering algorithm will mitigate the pain
k-means is among the most common clustering techniques in use
58

Jaccard Similarity
59

k-Means Explained
1. Randomly pick k points in the data space as initial values that will be used to
compute the k clusters: K1, K2, ..., Kk.
2. Assign each of the n points to a cluster by finding the nearest Kn—effectively
creating k clusters and requiring k*n comparisons.
3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and
reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each
iteration of the algorithm.)
4. Repeat steps 2–3 until the members of the clusters do not change between
iterations. Generally speaking, relatively few iterations are required for convergence.
60

k-Means: Initialize
61

k-Means: Step 1
62

k-Means: Step 2
63

k-Means: Step 3
64

k-Means: (Fast-Forward) Step 9
65

Geocoding
Transforming a location to a set of coordinates
Nashville, TN => (36.16783905029297, -86.77816009521484)
A harder problem than it first appears
The Bing API is especially generous
Requires an account sign up: http://bingmapsportal.com
Use the API key with the geopy package
66

Cartograms
67

Unless you use a Dorling Cartogram
68

Social Media Analysis Framework

Remember: Use the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
69

Exercises
Follow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API
connection and follow along with the first few examples
Download your connections as a CSV file from http://www.linkedin.com/people/
export-settings and save them to your VM
A deviation from instructions in Example 3-6 is necessary for remote VMs
See http://bit.ly/mtsw-ch03-helper-code

Create a Bing Maps portal account and get your API key for Examples 3-8 and
beyond
Try clustering your contacts in Example 3-12
Try Example 3-13 (visualizing data in Google Earth) at home...
70

Social Media Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
71

Module 5: Open Hack
72

Objectives

To work on "loose ends" or areas of interest from previous modules
To hack on code in notebooks not yet encountered
To setup the virtual machine on your own box if you haven't yet
To collaborate/talk and otherwise make the most of our togetherness
73

Social Media Analysis Framework

Remember:
Aspire
Acquire
Analyze
Summarize
74

Recommendations
Setup your own development environment if you haven't already
Appendix A
Text Mining & Natural Language Processing
Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages)
Graph Mining
Chapter 7 (Mining GitHub)
Analyzing Semantic Markup
Chapter 8 (Mining the Semantically Marked-Up Web)
75

Final Q&A; Wrap Up
76

Free Stuff
http://MiningTheSocialWeb.com
Mining the Social Web 2E Chapter 1 (Chimera)
http://bit.ly/13XgNWR
Source Code (GitHub)
http://bit.ly/MiningTheSocialWeb2E
http://bit.ly/1fVf5ej (numbered examples)
Screencasts (Vimeo)
http://bit.ly/mtsw2e-screencasts
1 of 76

Recommended

Mining Social Web APIs with IPython Notebook - Data Day Texas 2014 by
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014Mining Social Web APIs with IPython Notebook - Data Day Texas 2014
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014Matthew Russell
7.6K views42 slides
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015) by
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Matthew Russell
2.3K views87 slides
Mining Social Web APIs with IPython Notebook (PyCon 2014) by
Mining Social Web APIs with IPython Notebook (PyCon 2014)Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Matthew Russell
4.4K views88 slides
Mining the Social Web for Fun and Profit: A Getting Started Guide by
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMatthew Russell
1K views35 slides
Mining the Social Web for Fun and Profit: A Getting Started Guide by
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMatthew Russell
3.9K views37 slides
Why Twitter Is All the Rage: A Data Miner's Perspective by
Why Twitter Is All the Rage: A Data Miner's PerspectiveWhy Twitter Is All the Rage: A Data Miner's Perspective
Why Twitter Is All the Rage: A Data Miner's PerspectiveMatthew Russell
2.9K views45 slides

More Related Content

Similar to Mining Social Web APIs with IPython Notebook (Strata 2013)

OpenWhisk by Example - Auto Retweeting Example in Python by
OpenWhisk by Example - Auto Retweeting Example in PythonOpenWhisk by Example - Auto Retweeting Example in Python
OpenWhisk by Example - Auto Retweeting Example in PythonCodeOps Technologies LLP
1.5K views4 slides
What does OOP stand for? by
What does OOP stand for?What does OOP stand for?
What does OOP stand for?Colin Riley
3.1K views80 slides
Managing Phone Dev Projects by
Managing Phone Dev ProjectsManaging Phone Dev Projects
Managing Phone Dev ProjectsJohn McKerrell
555 views38 slides
A tale of two proxies by
A tale of two proxiesA tale of two proxies
A tale of two proxiesSensePost
2.1K views32 slides
Data Workflows for Machine Learning - Seattle DAML by
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
31.6K views74 slides
Introduction to python by
Introduction to pythonIntroduction to python
Introduction to pythonRajesh Rajamani
165 views20 slides

Similar to Mining Social Web APIs with IPython Notebook (Strata 2013)(20)

What does OOP stand for? by Colin Riley
What does OOP stand for?What does OOP stand for?
What does OOP stand for?
Colin Riley3.1K views
A tale of two proxies by SensePost
A tale of two proxiesA tale of two proxies
A tale of two proxies
SensePost2.1K views
Data Workflows for Machine Learning - Seattle DAML by Paco Nathan
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan31.6K views
Samsung SDS OpeniT - The possibility of Python by Insuk (Chris) Cho
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
Insuk (Chris) Cho883 views
UKSG - Just Do IT Yourself by Tony Hirst
UKSG  - Just Do IT YourselfUKSG  - Just Do IT Yourself
UKSG - Just Do IT Yourself
Tony Hirst1.4K views
Ardian Haxha- Flying with Python (OSCAL2014) by Open Labs Albania
Ardian Haxha- Flying with Python  (OSCAL2014)Ardian Haxha- Flying with Python  (OSCAL2014)
Ardian Haxha- Flying with Python (OSCAL2014)
Open Labs Albania338 views
What is Python? An overview of Python for science. by Nicholas Pringle
What is Python? An overview of Python for science.What is Python? An overview of Python for science.
What is Python? An overview of Python for science.
Nicholas Pringle2K views
Going open source with small teams by Jamie Thomas
Going open source with small teamsGoing open source with small teams
Going open source with small teams
Jamie Thomas764 views
The Rise of the DataOps - Dataiku - J On the Beach 2016 by Dataiku
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
Dataiku10.4K views
Building an Open Source iOS app: lessons learned by Wojciech Koszek
Building an Open Source iOS app: lessons learnedBuilding an Open Source iOS app: lessons learned
Building an Open Source iOS app: lessons learned
Wojciech Koszek458 views
python programming.pptx by Kaviya452563
python programming.pptxpython programming.pptx
python programming.pptx
Kaviya452563117 views
Machine learning in cybersecutiry by Vishwas N
Machine learning in cybersecutiryMachine learning in cybersecutiry
Machine learning in cybersecutiry
Vishwas N89 views
3stages Wdn08 V3 by Boris Mann
3stages Wdn08 V33stages Wdn08 V3
3stages Wdn08 V3
Boris Mann2.5K views
Lessons Learned from Building Machine Learning Software at Netflix by Justin Basilico
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico14.5K views
Microsoft IoT & Data OpenHack Zürich Day 2 by Sascha Corti
Microsoft IoT & Data OpenHack Zürich Day 2Microsoft IoT & Data OpenHack Zürich Day 2
Microsoft IoT & Data OpenHack Zürich Day 2
Sascha Corti337 views
Going deep (learning) with tensor flow and quarkus by Red Hat Developers
Going deep (learning) with tensor flow and quarkusGoing deep (learning) with tensor flow and quarkus
Going deep (learning) with tensor flow and quarkus
Red Hat Developers2.2K views
Data Workflows for Machine Learning - SF Bay Area ML by Paco Nathan
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan8.9K views

Recently uploaded

Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...ShapeBlue
194 views62 slides
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
457 views92 slides
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...ShapeBlue
186 views15 slides
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...The Digital Insurer
90 views52 slides
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...ShapeBlue
180 views18 slides
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineShapeBlue
221 views19 slides

Recently uploaded(20)

Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue194 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue186 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue180 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue221 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue238 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue126 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue297 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE79 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue152 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue119 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue263 views
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue106 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu423 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue180 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays56 views

Mining Social Web APIs with IPython Notebook (Strata 2013)

  • 1. 1 Mining Social Web APIs with IPython Notebook Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com New York City - 28 October 2013
  • 3. 3 Hello, My Name Is ... Matthew Background in Computer Science Data mining & machine learning CTO @ Digital Reasoning Systems Data mining; machine learning Author @ O'Reilly Media 5 published books on technology Principal @ Zaffra Selective boutique consulting
  • 4. 4 Transforming Curiosity Into Insight An open source software (OSS) project http://bit.ly/MiningTheSocialWeb2E A book http://bit.ly/135dHfs Accessible to (virtually) everyone Virtual machine with turn-key coding templates for data science experiments Think of the book as "premium" support for the OSS project
  • 5. 5 The Social Web Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate)
  • 6. 6 Overview Intro (5 mins) Module 1 - Virtual Machine Setup (10 mins) Module 2 - Mining Twitter (40 mins) Module 3 - Mining Facebook (35 mins) BREAK (30 mins) Module 4 - Mining LinkedIn (40 mins) Module 5 - Open Hack (40 mins) Final Q&A; Wrap Up (10 mins)
  • 7. 7 Module Format ~10-15 minutes of exposition I talk; you listen ~25-30 minutes of independent (or collaborative) work You hack while I walk around and help you ~5 minutes of Q&A You ask; I try to answer
  • 8. 8 Workshop Objective To send you away as a social web hacker Broad working knowledge popular social web APIs Hands-on experience hacking on social web data with a common toolkit Not to listen to me talk to you for 3 hours
  • 9. 9 Just a Few More Things This workshop is... An adaptation of Mining the Social Web, 2nd Edition More of a guided hacking session where you follow along (vs a preso) Wider than it is deeper There's only so much you can do in a few hours I'm available 24/7 this week (and beyond) to help you be successful
  • 10. 10 Assumptions At some point in your life, you have Programmed with Python Worked with JSON Made requests and processed responses to/from web servers Or you want to learn to do these things now... And you're a quick learner
  • 11. 11 Module 1: Virtual Machine Setup
  • 12. 12 Why do you need a VM? To save time Because installation and configuration management is harder than it first appears So that you can focus on the task at hand instead So that I can support you regardless of your hardware and operating system
  • 13. 13 But I can do all of that myself... True... If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages Including scientific computing tools that require underlying C/C++ code to be compiled Which requires specific versions of developer libraries to be installed You get the idea...
  • 14. 14 The Virtual Machine Experience Vagrant A nice abstraction around virtual machine providers One ring to rule them all Virtualbox, VMWare, AWS, ... IPython Notebook The easiest way to program with Python A better REPL (interpreter) Great for hacking
  • 15. 15 What happens when you vagrant up? Vagrant follows the instructions in your Vagrantfile Starts up a Virtualbox instance Uses Chef to provision it Installs OS patches/updates Installs MTSW software dependencies Starts IPython Notebook server on port 8888
  • 16. 16 Why Should I Use IPython Notebook? Because it's great for hacking And hacking is usually the first step Because it's great for collaboration Sharing/publishing results is trivial Because the UX is as easy as working in a notepad Think of it as "executable paper"
  • 17. 17
  • 18. 18
  • 19. 19 VM Quick Start Instructions Go to http://MiningTheSocialWeb.com/quick-start/ Follow the instructions And watch the screencasts! Basically: Install Virtualbox & Vagrant Run "vagrant up" in a terminal to start a guest VM Then, go to http://localhost:8888 on your host machine's web browser
  • 20. 20 What Could Be Easier? A hosted version of the VM! But only for a few hours during this workshop Because it costs money to run these servers Go to <the URL provided in the session> and pick a machine Do not share the URLs outside of this workshop! Please don't try to hack the machines I'll verbally provide the connection details (port and password)
  • 21. 21 A Hosted Virtual Machine Yes, please. Is it free? Perhaps... ...Sign-up for the AWS free tier at http://aws.amazon.com/free/ But not right now. Do it later Standby for the step-by-step instructions on how to do it I'll publish a post on it in the next day or so
  • 22. 22
  • 24. 24 Objectives Be able to identify Twitter primitives Understand tweet metadata and how to use it Learn how to extract entities such as user mentions, hashtags, and URLs from tweets Apply techniques for performing frequency analysis with Python Be able to plot histograms of Twitter data with IPython Notebook
  • 25. 25 Twitter Primitives Accounts Types: "Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  • 26. 26 API Requests RESTful requests Everything is a "resource" You GET, PUT, POST, and DELETE resources Standard HTTP "verbs" Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json? screen_name=SocialWebMining Streaming API filters JSON responses Cursors (not quite pagination)
  • 27. 27 Twitter is an Interest Graph Johnny Araya Roberto Mercedes Rodolfo Hernández Ana Jorge Nina
  • 28. 28 What's in a Tweet? 140 Characters ... ... Plus ~5KB of metadata! Authorship Time & location Tweet "entities" Replying, retweeting, favoriting, etc.
  • 29. 29 What are Tweet Entities? Essentially, the "easy to get at" data in the 140 characters @usermentions #hashtags URLs multiple variations (financial) symbols stock tickers media
  • 31. 31 Histograms A chart that is handy for frequency analysis They look like bar charts...except they're not bar charts Each value on the x-axis is a range (or "bin") of values Not categorical data Each value on the y-axis is the combined frequency of values in each range
  • 34. 34 Social Media Analysis Framework A memorable four step process to guide data science experiments: Aspire To test a hypothesis (answer a question) Acquire Get the data Analyze Count things Summarize Plot the results
  • 35. 35 Exercises Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook Fill in Example 1-1 with credentials and begin work Execute each example sequentially Customize queries Explore tweet metadata; count tweet entities; plot histograms of results Explore the "Chapter 9 (Twitter Cookbook)" notebook Think of it as a collection of building blocks
  • 37. 37 Objectives Be able to identify Facebook primitives Learn about Facebook’s Social Graph API and how to make API requests Understand how Open Graph protocol extends Facebook's Social Graph API Be able to analyze likes from Facebook pages and friends
  • 38. 38 Facebook Primitives Account Types: People & Pages Mutual Connections Likes Shares Comments Extensive Privacy Controls
  • 39. 39 API Requests Social Graph API requests Not RESTful but easy to learn and use Special "field expansion" syntax Example: GET http://graph.facebook.com/ptwobrussell/? fields=id,name,friends.fields(likes.limit(10)) JSON responses Traditional pagination
  • 40. 40 Facebook is an Interest Graph Johnny Araya Roberto Mercedes Rodolfo Hernández Ana Jorge Nina
  • 41. 41 Facebook API Explorer Go to https://developers.facebook.com/tools/explorer Really, go there right now...
  • 45. 45 Explore Facebook Pages Names of pages MiningTheSocialWeb CrossFit OReilly Web URLs (OGP extensions to Facebook's Social Graph) http://www.imdb.com/title/tt0117500
  • 46. 46 Social Media Analysis Framework Recall the same four step process to guide data science experiments: Aspire Acquire Analyze Summarize
  • 49. 49 Exercises Copy/paste your access token from the Graph API Explorer into the "Chapter 2 (Mining Facebook)" notebook Paste the value and execute the cell just before Example 2-1 Execute examples sequentially (try to at least make it to Example 2-10) Analyze your likes, your friends and likes from pages of interest If you have time... Remaining examples
  • 51. 51 Objectives Learn about LinkedIn’s Developer Platform Understand how clustering works A fundamental type of machine learning Be able to employ geocoding services to arrive at a set of coordinates from a textual reference to a location Visualize geographic data with cartograms
  • 52. 52 LinkedIn Primitives Account Types: People, Companies The data seems "more closely held" than Facebook or Twitter No FOAF visibility Richest data source Profile descriptions from mutual connections A little messier than it first appears Not necessarily a bad thing
  • 53. 53 API Requests (Weirdly) RESTful Requests Not really RESTful Field selector syntax http://api.linkedin.com/v1/people/~:(first-name,last-name,headline,picture-url) XML responses CSV address book download
  • 54. 54 Is LinkedIn an Interest Graph? Fundamentally: yes. But not so much at the developer API level Less trivial to find some of the "pivots" No Skills API (yet) But the data is there (mostly in profile descriptions) for your direct connections Companies, job titles, job descriptions Lots of richness is tucked away in human language data
  • 55. 55 Clustering An unsupervised machine learning learning technique Think: an algorithm that organizes the data into partitions
  • 57. 57 3 Steps to Clustering Your Data Normalization Compare (similarity/distance measurement) n-grams, edit distance, and Jaccard are common, but your imagination is the limit Why can't you just compare everything to everything? Dimensionality Reduction Ideally, your clustering algorithm will mitigate the pain k-means is among the most common clustering techniques in use
  • 59. 59 k-Means Explained 1. Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K1, K2, ..., Kk. 2. Assign each of the n points to a cluster by finding the nearest Kn—effectively creating k clusters and requiring k*n comparisons. 3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each iteration of the algorithm.) 4. Repeat steps 2–3 until the members of the clusters do not change between iterations. Generally speaking, relatively few iterations are required for convergence.
  • 65. 65 Geocoding Transforming a location to a set of coordinates Nashville, TN => (36.16783905029297, -86.77816009521484) A harder problem than it first appears The Bing API is especially generous Requires an account sign up: http://bingmapsportal.com Use the API key with the geopy package
  • 67. 67 Unless you use a Dorling Cartogram
  • 68. 68 Social Media Analysis Framework Remember: Use the same four step process to guide data science experiments: Aspire Acquire Analyze Summarize
  • 69. 69 Exercises Follow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API connection and follow along with the first few examples Download your connections as a CSV file from http://www.linkedin.com/people/ export-settings and save them to your VM A deviation from instructions in Example 3-6 is necessary for remote VMs See http://bit.ly/mtsw-ch03-helper-code Create a Bing Maps portal account and get your API key for Examples 3-8 and beyond Try clustering your contacts in Example 3-12 Try Example 3-13 (visualizing data in Google Earth) at home...
  • 70. 70 Social Media Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate)
  • 72. 72 Objectives To work on "loose ends" or areas of interest from previous modules To hack on code in notebooks not yet encountered To setup the virtual machine on your own box if you haven't yet To collaborate/talk and otherwise make the most of our togetherness
  • 73. 73 Social Media Analysis Framework Remember: Aspire Acquire Analyze Summarize
  • 74. 74 Recommendations Setup your own development environment if you haven't already Appendix A Text Mining & Natural Language Processing Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages) Graph Mining Chapter 7 (Mining GitHub) Analyzing Semantic Markup Chapter 8 (Mining the Semantically Marked-Up Web)
  • 76. 76 Free Stuff http://MiningTheSocialWeb.com Mining the Social Web 2E Chapter 1 (Chimera) http://bit.ly/13XgNWR Source Code (GitHub) http://bit.ly/MiningTheSocialWeb2E http://bit.ly/1fVf5ej (numbered examples) Screencasts (Vimeo) http://bit.ly/mtsw2e-screencasts