A presentation to the Nashville Data Science Meetup that introduces Mining the Social Web as an Open Source Software project/book, its virtual machine experience, the codebase, and a brief primer on data mining with Twitter
Hello, My Name Is ... Matthew
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
Transforming Curiosity Into Insight
An open source software (OSS) project
A (rewritten) book
Accessible to (virtually) everyone
Virtual machine with turn-key coding
templates for data science experiments
Think of the book as "premium" support for the
The Social Web Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
Table of Contents (1/2)
Chapter 1 - Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking
About, and More
Chapter 2 - Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More
Chapter 3 - Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More
Chapter 4 - Mining Google+: Computing Document Similarity, Extracting Collocations, and
Chapter 5 - Mining Web Pages: Using Natural Language Processing to Understand Human
Language, Summarize Blog Posts, and More
Chapter 6 - Mining Mailboxes: Analyzing Who's Talking to Whom About What, How Often, and
Table of Contents (2/2)
Chapter 7 - Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs,
Chapter 8 - Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing
over RDF, and More
Chapter 9 - Twitter Cookbook
Appendix A - Information About This Machine's Virtual Machine Experience
Appendix B - OAuth Primer
Appendix C - Python and IPython Notebook Tips & Tricks
Anatomy of Each Chapter
Why do you need a VM?
To save time
Because installation and conﬁguration management is harder than it ﬁrst
So that you can focus on the task at hand instead
So that I can support you regardless of your hardware and operating
Arguably, it's even a best practice for a dev environment
But I can do all of that myself...
If you would rather troubleshoot unexpected installation/conﬁguration issues
instead of immediately focusing on the real task at hand
At least give it a shot before resorting to your own devices so that you
don't have to install speciﬁc versions of ~40 Python packages
Including scientiﬁc computing tools that require underlying C/C++ code to
Which requires speciﬁc versions of developer libraries to be installed
You get the idea...
The Virtual Machine Experience
A nice abstraction around virtual machine providers
One ring to rule them all
Virtualbox, VMWare, AWS, ...
The easiest way to program with Python
A better REPL (interpreter)
Great for hacking
What happens when you vagrant up?
Vagrant follows the instructions in your Vagrantﬁle
Starts up a Virtualbox instance
Uses Chef to provision it
Installs OS patches/updates
Installs MTSW software dependencies
Starts IPython Notebook server on port 8888
Why Should I Use IPython Notebook?
Because it's great for hacking
And hacking is usually the ﬁrst step
Because it's great for collaboration
Sharing/publishing results is trivial
Because the UX is as easy as working in a notepad
Think of it as "executable paper"
VM Quick Start Instructions
Go to http://MiningTheSocialWeb.com/quick-start/
Follow the instructions
And watch the screencasts!
Install Virtualbox & Vagrant
Run "vagrant up" in a terminal to start a guest VM
Then, go to http://localhost:8888 on your host machine's web browser
An (AWS) Hosted Virtual Machine
Is it free?
...Sign-up for the AWS free tier at http://aws.amazon.com/free/
But not right now. Do it later
See this blog post for some inspiration on how to easily build your own
AMI from Vagrant boxes
Virtual Machine and IPython
Demonstration of Virtual Machine
Your ﬁrst "vagrant up"
Be able to identify Twitter primitives
Understand tweet metadata and how to use it
Learn how to extract entities such as user mentions, hashtags, and URLs
Apply techniques for performing frequency analysis with Python
Be able to plot histograms of Twitter data with IPython Notebook
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"
Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?
Streaming API ﬁlters
Cursors (not quite pagination)
Twitter is an Interest Graph
What's in a Tweet?
140 Characters ...
... Plus ~5KB of metadata!
Time & location
Replying, retweeting, favoriting, etc.
What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
Data Mining Is...
A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
Social Media Analysis Framework
A memorable four step process to guide data science experiments:
To test a hypothesis (answer a question)
Get the data
Plot the results
Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook
Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook
Fill in Example 1-1 with credentials and begin work
Execute each example sequentially
Explore tweet metadata; count tweet entities; plot histograms of results
Explore the "Chapter 9 (Twitter Cookbook)" notebook
Think of it as a collection of building blocks