Sunday 9:55 a.m.–10:45 a.m.
Why Twitter Is All the Rage: A Data Miner's Perspective
Presenter: Matthew Russell
Audience level: Novice
Description:
In order to be successful, technology must amplify a meaningful aspect of our human experience, and Twitter’s success largely has been dependent on its ability to do this quite well. Although you could describe Twitter as just a “free, high-speed, global text-messaging service,” that would be to miss the much larger point that Twitter scratches some of the most fundamental itches of our humanity.
Abstract:
This talk explains explains why Twitter is "all the rage" by examining Twitter in light of fundamental questions about our humanity:
* We want to be heard
* We want to satisfy our curiosity
* We want it easy
* We want it now
This session examines Twitter's ability to examine these questions and presents its underlying conceptual architecture as an interest graph.
Even if you have minimal programming skills, you'll come away empowered with the ability to think about data mining on Twitter in more effective ways and apply a powerful collection of easily adaptable recipes to fully exploit the 5 kilobytes of metadata that decorates those 140 characters that you commonly think of as a tweet. Learn how to access Twitter's API, search for tweets, discover trending topics, process tweets in real-time from the firehose, and much more.
How to Troubleshoot Apps for the Modern Connected Worker
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
1. 1
Why Twitter Is All The Rage:
A Data Miner's Perspective
Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
PyTN - 23 February 2014
2. 2
Overview
Intro
Twitter as a Platform for Data Science
Applications of Firehose Analysis (#Syria circa last)
Understanding the Amazon Prime Air Reaction (IPython Notebook Walk Through)
Q&A
4. 4
Hello, My Name Is ... Matthew
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
5. 5
Transforming Curiosity Into Insight
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding
templates for data science experiments
Think of the book as "premium" support for the
OSS project
6. 6
Mining the Social Web ToC
Chapter 1 - Mining Twitter
Chapter 2 - Mining Facebook
Chapter 3 - Mining LinkedIn
Chapter 4 - Mining Google+
Chapter 5 - Mining Web Pages
Chapter 6 - Mining Mailboxes
Chapter 7 - Mining GitHub
Chapter 8 - Mining the Semantically Marked-Up Web
Chapter 9 - Twitter Cookbook
7. 7
Anatomy of Each Chapter
Brief Intro
Objectives
API Primer
Analysis Technique(s)
Data Visualization
Recap
Suggested Exercises
Recommended Resources
8. 8
Opportunities for Data Alchemy
A model for the world: signal and sinks
Growth in data exhaust is accelerating
Digital fingerprints of the "real world" are accumulating
Lots of opportunities for motivated Python hackers
"Software is eating the world"
9. 9
Social Media Is All the Rage
World population: 7B people
Facebook: 1B+ users
Twitter: 650M users
Google+ 500M users
LinkedIn: 260M users
250M+ blogs (conservatively?)
10. 10
But what does it all mean, Basil?
It's a platform for data science and the frontier for predictive analytics
Understanding world events
Swaying political elections
Modeling human behavior
Analyzing sentiment
Making intelligent recommendations
15. 15
Twitter Is All the Rage
It satisfies fundamental human desires
We want to be heard
We want to satisfy our curiosity
We want it easy
We want it now
Accessible, rich, and (mostly) "open" data
RESTful APIs and JSON responses
Great proving ground for predictive analytics about the real world
16. 16
Twitter's Network Dynamics
~650M curious users
A collective consciousness
Real-time communication
Short, sweet, ... and fast
Asymmetric Following Model
An interest graph
19. 19
What's in a Tweet?
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
20. 20
What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations
(financial) symbols
stock tickers
media
21. 21
API Requests
RESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"
Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?
screen_name=SocialWebMining
Streaming API filters
JSON responses
Cursors (not quite pagination)
22. 22
Data Mining: Low Hanging Fruit
"Know thy data..."
Start with simple stats:
Count
Compare
Filter
Rank
Then, apply more complex analyses
23. 23
A Starting Point: Histograms
A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
29. 29
Measuring Influence Is Tricker Than It Looks
Spam bot accounts that effectively are zombies and can’t be harnessed for any utility
at all
Inactive or abandoned accounts that can’t influence or be influenced since they are
not in use
Accounts that follow so many other accounts that the likelihood of getting noticed (and
thus influencing) is practically zero
The network effects of retweets by accounts that are active and can be influenced to
spread a message
See also http://wp.me/p3QiJd-2a
31. 31
Realtime Analysis: #Syria
Monitor Twitter's firehose for realtime data using filters such as #Syria
Keep in mind the sheer volume of data can be considerable
Fuller analysis at http://wp.me/p3QiJd-1I
39. 39
#Syria: Why?
That's for you (as the data scientist) to decide
Quantitative automation can amplify human intelligence
Qualitative analysis is still requires human intelligence
41. 41
MTSW Virtual Machine Experience
Goal: Make it easy to transform curiosity into insight
Vagrant-based virtual machine
Virtualbox or AWS
IPython Notebook User Experience
Point-and-click GUI
100+ turn-key examples and templates
Social web mining for the masses
42. 42
Social Media Analysis Framework
A memorable four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
43. 43
Goals
To understand how to capture data from Twitter's firehose
A understand basic pandas usage for tweets
To work through a data science experiment with a systematic 4-step
process
To better understand the emotional reaction to the Amazon Prime Air
announcement
To introduce some tools for data science