Your SlideShare is downloading. ×
Mining the Social Web for Fun and Profit: A Getting Started Guide
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Mining the Social Web for Fun and Profit: A Getting Started Guide

317
views

Published on

A presentation to the FrontRange PyData Meetup on how to get started with Mining the Social Web.

A presentation to the FrontRange PyData Meetup on how to get started with Mining the Social Web.


0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
317
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Mining the Social Web for Fun and Profit: A Getting Started Guide Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com Front Range PyData Meetup - 21 May 2014 1
  • 2. Overview Intro (5 mins) Virtual Machine Experience (10 mins) Virtual Machine and IPython Notebook Demonstration (10 mins) Mining Twitter: A Primer (20 mins) Wrap Up/Final Q&A (10 mins) 2
  • 3. Intro 3
  • 4. Hello, My Name Is ... Matthew 4 Background in Computer Science Data mining & machine learning CTO @ Digital Reasoning Systems Data mining; machine learning Author @ O'Reilly Media 5 published books on technology Principal @ Zaffra Selective boutique consulting
  • 5. Transforming Curiosity Into Insight 5 An open source software (OSS) project http://bit.ly/MiningTheSocialWeb2E A (rewritten) book http://bit.ly/135dHfs Accessible to (virtually) everyone Virtual machine with turn-key coding templates for data science experiments Think of the book as "premium" support for the OSS project
  • 6. The Social Web Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate) 6
  • 7. Table of Contents (1/2) Chapter 1 - Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More Chapter 2 - Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More Chapter 3 - Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More Chapter 4 - Mining Google+: Computing Document Similarity, Extracting Collocations, and More Chapter 5 - Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More Chapter 6 - Mining Mailboxes: Analyzing Who's Talking to Whom About What, How Often, and More 7
  • 8. Table of Contents (2/2) Chapter 7 - Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More Chapter 8 - Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More Chapter 9 - Twitter Cookbook Appendix A - Information About This Machine's Virtual Machine Experience Appendix B - OAuth Primer Appendix C - Python and IPython Notebook Tips & Tricks 8
  • 9. Anatomy of Each Chapter Brief Intro Objectives API Primer Analysis Technique(s) Data Visualization Recap Suggested Exercises Recommended Resources 9
  • 10. The Virtual Machine Experience 10
  • 11. Why do you need a VM? 11 To save time Because installation and configuration management is harder than it first appears So that you can focus on the task at hand instead So that I can support you regardless of your hardware and operating system Arguably, it's even a best practice for a dev environment
  • 12. But I can do all of that myself... True... If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages Including scientific computing tools that require underlying C/C++ code to be compiled Which requires specific versions of developer libraries to be installed You get the idea... 12
  • 13. The Virtual Machine Experience Vagrant A nice abstraction around virtual machine providers One ring to rule them all Virtualbox, VMWare, AWS, ... IPython Notebook The easiest way to program with Python A better REPL (interpreter) Great for hacking 13
  • 14. What happens when you vagrant up? Vagrant follows the instructions in your Vagrantfile Starts up a Virtualbox instance Uses Chef to provision it Installs OS patches/updates Installs MTSW software dependencies Starts IPython Notebook server on port 8888 14
  • 15. Why Should I Use IPython Notebook? Because it's great for hacking And hacking is usually the first step Because it's great for collaboration Sharing/publishing results is trivial Because the UX is as easy as working in a notepad Think of it as "executable paper" 15
  • 16. 16
  • 17. 17
  • 18. VM Quick Start Instructions Go to http://MiningTheSocialWeb.com/quick-start/ Follow the instructions And watch the screencasts! Basically: Install Virtualbox & Vagrant Run "vagrant up" in a terminal to start a guest VM Then, go to http://localhost:8888 on your host machine's web browser 18
  • 19. An (AWS) Hosted Virtual Machine Is it free? Perhaps... ...Sign-up for the AWS free tier at http://aws.amazon.com/free/ But not right now. Do it later See this blog post for some inspiration on how to easily build your own AMI from Vagrant boxes http://wp.me/p3QiJd-3T 19
  • 20. Virtual Machine and IPython Notebook Demonstration 20
  • 21. Demonstration of Virtual Machine http://nbviewer.ipython.org http://MiningTheSocialWeb.com/quick-start/ Your first "vagrant up" 21
  • 22. Mining Twitter: A Primer 22
  • 23. Objectives 23 Be able to identify Twitter primitives Understand tweet metadata and how to use it Learn how to extract entities such as user mentions, hashtags, and URLs from tweets Apply techniques for performing frequency analysis with Python Be able to plot histograms of Twitter data with IPython Notebook
  • 24. Twitter Primitives 24 Accounts Types: "Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  • 25. API Requests RESTful requests Everything is a "resource" You GET, PUT, POST, and DELETE resources Standard HTTP "verbs" Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json? screen_name=SocialWebMining Streaming API filters JSON responses Cursors (not quite pagination) 25
  • 26. Twitter is an Interest Graph 26 Roberto Mercedes Jorge Ana Nina Johnny Araya Rodolfo Hernández
  • 27. What's in a Tweet? 27 140 Characters ... ... Plus ~5KB of metadata! Authorship Time & location Tweet "entities" Replying, retweeting, favoriting, etc.
  • 28. What are Tweet Entities? Essentially, the "easy to get at" data in the 140 characters @usermentions #hashtags URLs multiple variations (financial) symbols stock tickers media 28
  • 29. Data Mining Is... Counting Comparing Filtering Ranking 29
  • 30. Histograms A chart that is handy for frequency analysis They look like bar charts...except they're not bar charts Each value on the x-axis is a range (or "bin") of values Not categorical data Each value on the y-axis is the combined frequency of values in each range 30
  • 31. 31 Example: Histogram of Retweets
  • 32. Social Media Analysis Framework A memorable four step process to guide data science experiments: Aspire To test a hypothesis (answer a question) Acquire Get the data Analyze Count things Summarize Plot the results 32
  • 33. Recommended Exercises Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook Fill in Example 1-1 with credentials and begin work Execute each example sequentially Customize queries Explore tweet metadata; count tweet entities; plot histograms of results Explore the "Chapter 9 (Twitter Cookbook)" notebook Think of it as a collection of building blocks 33
  • 34. Final Q&A; Wrap Up 34
  • 35. Recommended Resources http://MiningTheSocialWeb.com Mining the Social Web 2E Chapter 1 (Chimera) http://bit.ly/13XgNWR Source Code (GitHub) http://bit.ly/MiningTheSocialWeb2E http://bit.ly/1fVf5ej (numbered examples) Screencasts (Vimeo) http://bit.ly/mtsw2e-screencasts 35

×