1

Why Twitter Is All The Rage:
A Data Miner's Perspective
Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
PyTN - 23 February 2014
2

Overview

Intro
Twitter as a Platform for Data Science
Applications of Firehose Analysis (#Syria circa last)
Understanding the Amazon Prime Air Reaction (IPython Notebook Walk Through)
Q&A
3

Intro
4

Hello, My Name Is ... Matthew
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
5

Transforming Curiosity Into Insight
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding
templates for data science experiments
Think of the book as "premium" support for the
OSS project
6

Mining the Social Web ToC
Chapter 1 - Mining Twitter
Chapter 2 - Mining Facebook
Chapter 3 - Mining LinkedIn
Chapter 4 - Mining Google+
Chapter 5 - Mining Web Pages
Chapter 6 - Mining Mailboxes
Chapter 7 - Mining GitHub
Chapter 8 - Mining the Semantically Marked-Up Web
Chapter 9 - Twitter Cookbook
7

Anatomy of Each Chapter
Brief Intro
Objectives
API Primer
Analysis Technique(s)
Data Visualization
Recap
Suggested Exercises
Recommended Resources
8

Opportunities for Data Alchemy

A model for the world: signal and sinks
Growth in data exhaust is accelerating
Digital fingerprints of the "real world" are accumulating
Lots of opportunities for motivated Python hackers
"Software is eating the world"
9

Social Media Is All the Rage
World population: 7B people
Facebook: 1B+ users
Twitter: 650M users
Google+ 500M users
LinkedIn: 260M users
250M+ blogs (conservatively?)
10

But what does it all mean, Basil?
It's a platform for data science and the frontier for predictive analytics
Understanding world events
Swaying political elections
Modeling human behavior
Analyzing sentiment
Making intelligent recommendations
11

Twitter & Data Science
12

Data Science

Data => Actionable information
Highly interdisciplinary
Nascent
Necessary

http://wikipedia.org/wiki/Data_science
13

Another View of Data Science
14
15

Twitter Is All the Rage
It satisfies fundamental human desires
We want to be heard
We want to satisfy our curiosity
We want it easy
We want it now
Accessible, rich, and (mostly) "open" data
RESTful APIs and JSON responses
Great proving ground for predictive analytics about the real world
16

Twitter's Network Dynamics
~650M curious users
A collective consciousness
Real-time communication
Short, sweet, ... and fast
Asymmetric Following Model
An interest graph
17

Twitter Primitives
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
18

Twitter and Facebook Compared
Twitter

Facebook

Accounts Types: "Anything"

Accounts Types: People & Pages

"Following" Relationships

Mutual Connections

Favorites

"Likes"

Retweets

"Shares"

Replies

"Comments"

(Almost) No Privacy Controls

Extensive Privacy Controls
19

What's in a Tweet?
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
20

What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations

(financial) symbols
stock tickers

media
21

API Requests
RESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"

Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?
screen_name=SocialWebMining

Streaming API filters
JSON responses
Cursors (not quite pagination)
22

Data Mining: Low Hanging Fruit
"Know thy data..."
Start with simple stats:
Count
Compare
Filter
Rank
Then, apply more complex analyses
23

A Starting Point: Histograms

A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
24

Example: Histogram of Retweets
25

Social Network Mechanics

Roberto

Mercedes

Jorge

Nina

Ana
26

Interest Graph Mechanics
U2

Roberto

Mercedes

Juan
Luis
Luís
Guerra

Ana

Jorge

Nina
27

A (Social) Interest Graph
U2

Roberto

Mercedes

Juan
Luis
Luís
Guerra

Ana

Jorge

Nina
28

A (Political) Interest Graph
Johnny
Araya
Roberto

Mercedes

Rodolfo
Hernández

Ana

Jorge

Nina
29

Measuring Influence Is Tricker Than It Looks

Spam bot accounts that effectively are zombies and can’t be harnessed for any utility
at all
Inactive or abandoned accounts that can’t influence or be influenced since they are
not in use
Accounts that follow so many other accounts that the likelihood of getting noticed (and
thus influencing) is practically zero
The network effects of retweets by accounts that are active and can be influenced to
spread a message
See also http://wp.me/p3QiJd-2a
30

Justin Bieber vs Tea Party
31

Realtime Analysis: #Syria

Monitor Twitter's firehose for realtime data using filters such as #Syria
Keep in mind the sheer volume of data can be considerable
Fuller analysis at http://wp.me/p3QiJd-1I
32

#Syria: Who?

See http://wp.me/p3QiJd-1I
33

#Syria: Who?

See http://wp.me/p3QiJd-1I
34

#Syria: Who?

See http://wp.me/p3QiJd-1I
35

#Syria: What?

See http://wp.me/p3QiJd-1I
36

#Syria: What?

See http://wp.me/p3QiJd-1I
37

#Syria: Where?

See http://wp.me/p3QiJd-1I
38

#Syria: When?

See http://wp.me/p3QiJd-1I
39

#Syria: Why?

That's for you (as the data scientist) to decide
Quantitative automation can amplify human intelligence
Qualitative analysis is still requires human intelligence
40

Twitter Firehose Analysis with
pandas
41

MTSW Virtual Machine Experience
Goal: Make it easy to transform curiosity into insight
Vagrant-based virtual machine
Virtualbox or AWS
IPython Notebook User Experience
Point-and-click GUI
100+ turn-key examples and templates
Social web mining for the masses
42

Social Media Analysis Framework

A memorable four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
43

Goals
To understand how to capture data from Twitter's firehose
A understand basic pandas usage for tweets
To work through a data science experiment with a systematic 4-step
process
To better understand the emotional reaction to the Amazon Prime Air
announcement
To introduce some tools for data science
44

Useful Links
Website
http://MiningTheSocialWeb.com
Twitter Data Mining Round Up
http://wp.me/p3QiJd-5H

All Source Code in IPython Notebook format (GitHub)
http://bit.ly/MiningTheSocialWeb2E
45

Q&A

Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)

  • 1.
    1 Why Twitter IsAll The Rage: A Data Miner's Perspective Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com PyTN - 23 February 2014
  • 2.
    2 Overview Intro Twitter as aPlatform for Data Science Applications of Firehose Analysis (#Syria circa last) Understanding the Amazon Prime Air Reaction (IPython Notebook Walk Through) Q&A
  • 3.
  • 4.
    4 Hello, My NameIs ... Matthew Background in Computer Science Data mining & machine learning CTO @ Digital Reasoning Systems Data mining; machine learning Author @ O'Reilly Media 5 published books on technology Principal @ Zaffra Selective boutique consulting
  • 5.
    5 Transforming Curiosity IntoInsight An open source software (OSS) project http://bit.ly/MiningTheSocialWeb2E A book http://bit.ly/135dHfs Accessible to (virtually) everyone Virtual machine with turn-key coding templates for data science experiments Think of the book as "premium" support for the OSS project
  • 6.
    6 Mining the SocialWeb ToC Chapter 1 - Mining Twitter Chapter 2 - Mining Facebook Chapter 3 - Mining LinkedIn Chapter 4 - Mining Google+ Chapter 5 - Mining Web Pages Chapter 6 - Mining Mailboxes Chapter 7 - Mining GitHub Chapter 8 - Mining the Semantically Marked-Up Web Chapter 9 - Twitter Cookbook
  • 7.
    7 Anatomy of EachChapter Brief Intro Objectives API Primer Analysis Technique(s) Data Visualization Recap Suggested Exercises Recommended Resources
  • 8.
    8 Opportunities for DataAlchemy A model for the world: signal and sinks Growth in data exhaust is accelerating Digital fingerprints of the "real world" are accumulating Lots of opportunities for motivated Python hackers "Software is eating the world"
  • 9.
    9 Social Media IsAll the Rage World population: 7B people Facebook: 1B+ users Twitter: 650M users Google+ 500M users LinkedIn: 260M users 250M+ blogs (conservatively?)
  • 10.
    10 But what doesit all mean, Basil? It's a platform for data science and the frontier for predictive analytics Understanding world events Swaying political elections Modeling human behavior Analyzing sentiment Making intelligent recommendations
  • 11.
  • 12.
    12 Data Science Data =>Actionable information Highly interdisciplinary Nascent Necessary http://wikipedia.org/wiki/Data_science
  • 13.
    13 Another View ofData Science
  • 14.
  • 15.
    15 Twitter Is Allthe Rage It satisfies fundamental human desires We want to be heard We want to satisfy our curiosity We want it easy We want it now Accessible, rich, and (mostly) "open" data RESTful APIs and JSON responses Great proving ground for predictive analytics about the real world
  • 16.
    16 Twitter's Network Dynamics ~650Mcurious users A collective consciousness Real-time communication Short, sweet, ... and fast Asymmetric Following Model An interest graph
  • 17.
    17 Twitter Primitives Accounts Types:"Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  • 18.
    18 Twitter and FacebookCompared Twitter Facebook Accounts Types: "Anything" Accounts Types: People & Pages "Following" Relationships Mutual Connections Favorites "Likes" Retweets "Shares" Replies "Comments" (Almost) No Privacy Controls Extensive Privacy Controls
  • 19.
    19 What's in aTweet? 140 Characters ... ... Plus ~5KB of metadata! Authorship Time & location Tweet "entities" Replying, retweeting, favoriting, etc.
  • 20.
    20 What are TweetEntities? Essentially, the "easy to get at" data in the 140 characters @usermentions #hashtags URLs multiple variations (financial) symbols stock tickers media
  • 21.
    21 API Requests RESTful requests Everythingis a "resource" You GET, PUT, POST, and DELETE resources Standard HTTP "verbs" Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json? screen_name=SocialWebMining Streaming API filters JSON responses Cursors (not quite pagination)
  • 22.
    22 Data Mining: LowHanging Fruit "Know thy data..." Start with simple stats: Count Compare Filter Rank Then, apply more complex analyses
  • 23.
    23 A Starting Point:Histograms A chart that is handy for frequency analysis They look like bar charts...except they're not bar charts Each value on the x-axis is a range (or "bin") of values Not categorical data Each value on the y-axis is the combined frequency of values in each range
  • 24.
  • 25.
  • 26.
  • 27.
    27 A (Social) InterestGraph U2 Roberto Mercedes Juan Luis Luís Guerra Ana Jorge Nina
  • 28.
    28 A (Political) InterestGraph Johnny Araya Roberto Mercedes Rodolfo Hernández Ana Jorge Nina
  • 29.
    29 Measuring Influence IsTricker Than It Looks Spam bot accounts that effectively are zombies and can’t be harnessed for any utility at all Inactive or abandoned accounts that can’t influence or be influenced since they are not in use Accounts that follow so many other accounts that the likelihood of getting noticed (and thus influencing) is practically zero The network effects of retweets by accounts that are active and can be influenced to spread a message See also http://wp.me/p3QiJd-2a
  • 30.
  • 31.
    31 Realtime Analysis: #Syria MonitorTwitter's firehose for realtime data using filters such as #Syria Keep in mind the sheer volume of data can be considerable Fuller analysis at http://wp.me/p3QiJd-1I
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
    39 #Syria: Why? That's foryou (as the data scientist) to decide Quantitative automation can amplify human intelligence Qualitative analysis is still requires human intelligence
  • 40.
  • 41.
    41 MTSW Virtual MachineExperience Goal: Make it easy to transform curiosity into insight Vagrant-based virtual machine Virtualbox or AWS IPython Notebook User Experience Point-and-click GUI 100+ turn-key examples and templates Social web mining for the masses
  • 42.
    42 Social Media AnalysisFramework A memorable four step process to guide data science experiments: Aspire Acquire Analyze Summarize
  • 43.
    43 Goals To understand howto capture data from Twitter's firehose A understand basic pandas usage for tweets To work through a data science experiment with a systematic 4-step process To better understand the emotional reaction to the Amazon Prime Air announcement To introduce some tools for data science
  • 44.
    44 Useful Links Website http://MiningTheSocialWeb.com Twitter DataMining Round Up http://wp.me/p3QiJd-5H All Source Code in IPython Notebook format (GitHub) http://bit.ly/MiningTheSocialWeb2E
  • 45.