The Art of Social Media Analysis with Twitter & Python

The Art of Social Media
Analysis
with Twitter & Python

krishna sankar
@ksankar
http://www.oscon.com/oscon2012/public/schedule/detail/23130

Intro

API,
Objects,…

o  House
Rules
(1
of
2)
Twitter
Network We will analyze @clouderati,
o  Doesn’t
assume
any
knowledge
Analysis 2072 followers, exploding to
of
Twitter
API
Pipeline

~980,000 distinct users down
one level
o  Goal:
Everybody
in
the
same

page
&
get
a
working

knowledge
of
Twitter
API

NLP, NLTK,
o  To
bootstrap
your
exploration
@mention Cliques, social
Sentiment
network graph
into
Social
Network
Analysis
&
Analysis

Twitter

Rewteeet analytics,
Growth,
#tag Network Information
o  Simple
programs,
to
illustrate
contagion weakties
usage
&
data
manipulation

Intro

API,
Objects,…

Twitter
o  House
Rules
(2
of
2)
Network We will analyze @clouderati,
Analysis 2072 followers, exploding to
o  Am
using
the
requests
library

Pipeline

~980,000 distinct users down
o  There
are
good
Twitter
frameworks
one level
for
python,
but
wanted
to
build

from
the
basics.
Once
one

understands
the
fundamentals,

frameworks
can
help
NLP, NLTK,
@mention Cliques, social
Sentiment
o  Many
areas
to
explore
–
not
enough
Analysis
network graph
time.
So
decided
to
focus
on
social

graph,
cliques
&
networkx
Rewteeet analytics,
Growth,
#tag Network Information
contagion weakties

About Me
•  Lead
Engineer/Data
Scientist/AWS
Ops
Guy
at

Genophen.com

o  Co-‐chair
–
2012
IEEE
Precision
Time
Synchronization

•  http://www.ispcs.org/2012/index.html

o  Blog
:
http://doubleclix.wordpress.com/

o  Quora
:
http://www.quora.com/Krishna-‐Sankar

•  Prior
Gigs

o  Lead
Architect
(Egnyte)

o  Distinguished
Engineer
(CSCO)

o  Employee
#64439
(CSCO)
to
#39(Egnyte)
&
now
#9
!

•  Current
Focus:

o  Design,
build
&
ops
of
BioInformatics/Consumer
Infrastructure
on
AWS,

MongoDB,
Solr,
Drupal,GitHub,…

o  Big
Data
(more
of
variety,
variability,
context
&
graphs,
than
volume
or
velocity
–

so
far
!)

o  Overlay
based
semantic
search
&
ranking

•  Other
related
Presentations

o  http://goo.gl/P1rhc
Big
Data
Engineering
Top
10
Pragmatics
(Summary)

o  http://goo.gl/0SQDV
The
Art
of
Big
Data
(Detailed)

o  http://goo.gl/EaUKH
The
Hitchhiker’s
Guide
to
Kaggle
OSCON
2011
Tutorial

Twitter Tips – A Baker’s Dozen
1.  Twitter
APIs
are
(more
or
less)
congruent
&
symmetric

2.  Twitter
is
usually
right
&
simple
-‐
recheck
when
you
get
unexpected
results

before
blaming
Twitter

o  I
was
getting
numbers
when
I
was
expecting
screen_names
in
user
objects.

o  Was
ready
to
send
blasting
e-‐mails
to
Twitter
team.
Decided
to
check
one
more
time

and
found
that
my
parameter
key
was
wrong-‐screen_name
instead
of
user_id

o  Always test with one or two records before a long run ! - learned the hard way
3.  Twitter
APIs
are
very
powerful
–
consistent
use
can
bear
huge
data

o  In
a
week,
you
can
pull
in
4-‐5
million
users
&
some
tweets
!

o  Night runs are far more faster & error-free
4.  Use
a
NOSQL
data
store
as
a
command
buﬀer
&
data
buﬀer

o  Would
make
it
easy
to
work
with
Twitter
at
scale

o  I
use

MongoDB

The
o  Keep
the
schema
simple
&
no
fancy
transformation
End
•  And
as
far
as
possible
same
as
the
( json)
response

Beg As Th
inni
o  Use
NOSQL
CLI
for
trimming
records
et
al
ng
e


5.  Always
use
a
big
data
pipeline

o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize
o  That
way
you
can
orthogonally
extend,
with
functional
components
like
command
buffers,

validation
et
al

6.  Use
functional
approach
for
a
scalable
pipeline

o  Compose
your
data
big
pipeline
with
well
defined
granular
functions,
each
doing
only
one
thing

o  Don’t
overload
the
functional
components
(i.e.
no
collect,
unroll
&
store
as
a
single
component)

o  Have
well
defined
functional
components
with
appropriate
caching,
buffering,
checkpoints
&

restart
techniques

•  This did create some trouble for me, as we will see later
7.  Crawl-‐Store-‐Validate-‐Recrawl-‐Refresh
cycle

o  The
equivalent
of
the
traditional
ETL

o  Validation
stage
&
validation
routines
are
important

•  Cannot
expect
perfect
runs

•  Cannot
manually
look
at
data
either,
when
data
is
at
scale

8.  Have
control
numbers
to
validate
runs
&
monitor
them

o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that
number through the various runs !
o  There will be a separate printout of the control numbers that will be kept in the operations files

9.  Program
defensively

o  more so for a REST-based-Big Data-Analytics systems
o  Expect
failures
at
the
transport
layer
&
accommodate
for
them

10.  Have
Erlang-‐style
supervisors
in
your
pipeline

o  Fail
fast
&
move
on

o  Don’t
linger
and
try
to
ﬁx
errors
that
cannot
be
controlled
at
that
layer

o  A
higher
layer
process
will
circle
back
and
do
incremental
runs
to

correct
missing
spiders
and
crawls

o  Be
aware
of
visibility
&
lack
of
context.
Validate
at
the
lowest
layer
that

has
enough
context
to
take
corrective
actions

o  I have an example in part 2
11.  Data
will
never
be
perfect

o  Know
your
data
&
accommodate
for
it’s
idiosyncrasies

•  for
example:
0
followers,
protected
users,
0
friends,…

12.  Check
Point
frequently
(preferably
after
ever
API
call)
&
have
a

re-‐startable
command
buﬀer
cache

o  See a MongoDB example in Part 2
13.  Don’t
bombard
the
URL

o  Wait
a
few
seconds
before
successful
calls.
This
will
end
up
with
a

scalable
system,
eventually

o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to
work with 5 seconds with wait & retry. Then, the rate limit started kicking in !
14.  Always
measure
the
elapsed
time
of
your
API
runs
&
processing

o  Kind
of
early
warning
when
something
is
wrong

15.  Develop
incrementally;
don’t
fail
to
check
“cut
&
paste”
errors

16.  The
Twitter
big
data
pipeline
has
lots
of
opportunities
for
parallelism

o  Leverage
data
parallelism
frameworks
like
MapReduce

o  But
ﬁrst
:

§  Prototype
as
a
linear
system,

§  Optimize
and
tweak
the
functional
modules
&
cache
strategies,

§  Note
down
stages
and
tasks
that
can
be
parallelized
and

§  Then
parallelize
them

o  For the example project, we will see later, I did not leverage any parallel frameworks, but the
opportunities were clearly evident. I will point them out, as we progress through the tutorial
17. 
Pay
attention
to
handoﬀs
between
stages

o  They
might
require
transformation
–
for
example
collect
&
store
might
store
a
user
list

as
multiple
arrays,
while
the
model
requires
each
user
to
be
a
document
for

aggregation

o  But resist the urge to overload collect with transform
o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform
the array to separate documents
o  Add transformation as a granular function – of course, with appropriate buffering, caching,
checkpoints & restart techniques
18.  Have
a
good
log
management
system
to
capture
and
wade
through

logs

19.  Understand
the
underlying
network
characteristics
for
the

inference
you
want
to
make

o  Twitter
Network

!=
Facebook
Network
,

Twitter
Graph
!=
LinkedIn
Graph

o  Twitter
Network
is
more
of
an
Interest
Network

o  So, many of the traditional network mechanisms & mechanics, like network
diameter & degrees of separation, might not make sense
o  But, others like Cliques and Bipartite Graphs do

Twitter Gripes
1.  Need
more
rich
APIs
for
#tags

o  Somewhat
similar
to
users
viz.
followers,
friends
et
al

o  Might
make
sense
to
make
#tags
a
top
level
object
with
it’s
own
semantics

2.  HTTP
Error
Return
is
not
uniform

o  Returns
400
bad
Request
instead
of
420

o  Granted, there is enough information to figure this out
3.  Need
an
easier
way
to
get
screen_name
from
user_id

4.  “following”
vs.
“friends_count”
i.e.
“following”
is
a
dummy
variable.

o  There are a few like this, most probably for backward compatibility
5.  Parameter
Validation
is
not
uniform

o  Gives
“404
Not
found”
instead
of
“406
Not
Acceptable”
or
“413
Too
Long”
or
“416

Range
Unacceptable”

6.  Overall
more
validation
would
help

o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the
rest is easy to figure out

A Fork

&
deep
,NLTK

•   NLP weets
into
T ment

4
o  Sen ysis

Anal

• Not enough time for both
• I chose the Social Graph route

A minute about Twitter as platform & it’s evolution

blog/
er. com/ tter-‐
twitt wi
ps:/ /dev. nsistent-‐t
htt ring-‐co
e
deliv ence
“The micro-blogging service must find the
ri
expe
right balance of running a profitable
business and maintaining a robust
“.. we want to make sure that the Twitter experience is developers' community.” – Chenda, CBS
straightforward and easy to understand -- whether you’re on
news!
Twitter.com or elsewhere on the web”-Michael!
My
Wish
&
Hope

•  I
spend
a
lot
of
time
with
Twitter
&
derive
value;
the
platform
is
rich
&
the
APIs
intuitive

•  I
did
like
the
fact
that
tweets
are
part
of
LinkedIn.
I
still
used
Twitter
more
than
LinkedIn

o  I
don’t
think
showing
Tweets
in
LinkedIn
took
anything
away
from
the
Twitter
experience

o  LinkedIn
experience
&
Twitter
experience
are
diﬀerent
&
distinct.
Showing
tweets
in
LinkedIn
didn’t
change
that

•  I
sincerely
hope
that
the
platform
grows
with
a
rich
developer
eco
system

•  Orthogonally
extensible
platform
is
essential

•  Of
course,
along
with
a
congruent
user
experience
–
“
…
core
Twitter
consumption
experience
through
consistent
tools”

•  For
Hands
on
Today

Setup
o  Python
2.7.3

o  easy_install
–v
requests

•  http://docs.python-‐requests.org/en/latest/user/quickstart/#make-‐a-‐
request

o  easy_install
–v
requests-‐oauth

o  Hands
on
programs
at
https://github.com/xsankar/oscon2012-‐handson

•  For
advanced
data
science
with
social
graphs

o  easy_install
–v
networkx

o  easy_install
–v
numpy

o  easy_install
–v
nltk

•  Not
for
this
tutorial,
but
good
for
sentiment
analysis
et
al

o  Mongodb

•  I
used
MongoDB
in
AWS
m2.xlarge,
RAID
10
X
8
X
15
GB
EBS

o  graphviz
-‐
http://www.graphviz.org/;
easy_install
pygraphviz

o  easy_install
pydot

Problem Domain For this tutorial

•  Data
Science
(trends,
analytics
et
al)
on
Social
Networks
as

observed
by
Twitter
primitives

o  Not
for
Twitter
based
apps
for
real
time
tweets

o  Not
web
sites
with
real
time
tweets

•  By
looking
at
the
domain
in
aggregate
to
derive
inferences
&

actionable
recommendations

•  Which
also
means,
you
need
to
be
deliberate
&
systemic
(
i.e.

not
look
at
a
ﬂuctuation
as
a
trend
but
dig
deeper
before

pronouncing
a
trend)

Agenda

I.  Mechanics
:
Twitter
API
(1:30
PM
-‐
3:00
PM)

o  Essential
Fundamentals
(Rate
Limit,
HTTP
Codes
et
al)

o  Objects

o  API

o  Hands-‐on
(2:45
PM
-‐
3:00
PM)

II.  Break
(3:00
PM
-‐
3:30
PM)

III.  Twitter
Social
Graph
Analysis
(3:30
PM
-‐
5:00
PM)

o  Underlying
Concepts

o  Social
Graph
Analysis
of
@clouderati

§  Stages,
Strategies
&
Tasks

§  Code
Walk
thru

Twi5er API : Read These First
•  Using
Twitter
Brand

o  New
logo
&
associated
guidelines
:
https://twitter.com/about/logos

o  Twitter
Rules
:

https://support.twitter.com/groups/33-‐report-‐a-‐violation/topics/121-‐guidelines-‐
best-‐practices/articles/18311-‐the-‐twitter-‐rules

o  Developer
Rules
of
the
road
https://dev.twitter.com/terms/api-‐terms

•  Read
These
Links
First

1.  https://dev.twitter.com/docs/things-‐every-‐developer-‐should-‐know

2.  https://dev.twitter.com/docs/faq

3.  Field
Guide
to
Objects
https://dev.twitter.com/docs/platform-‐objects

4.  Security
https://dev.twitter.com/docs/security-‐best-‐practices

5.  Media
Best
Practices
:
https://dev.twitter.com/media

6.  Consolidates
Page
:
https://dev.twitter.com/docs

7.  Streaming
APIs
https://dev.twitter.com/docs/streaming-‐apis

8.  How
to
Appeal
(Not
that
you
all
would
need
it
!)
https://support.twitter.com/
articles/72585

•  Only
One
version
of
Twitter
APIs

API Status Page

•  https://dev.twitter.com/status

•  https://dev.twitter.com/issues

•  https://dev.twitter.com/discussions

h5ps://dev.twi5er.com/status

http://www.buzzfeed.com/tommywilhelm/google-‐
users-‐being-‐total-‐dicks-‐about-‐the-‐twitter

Open This First
•  Install
pre-‐req
as
per
the
setup
slide

•  Run

o  oscon2012_open_this_ﬁrst.py

o  To
test
connectivity
–
“canary
query”

•  Run

o  oscon2012_rate_limit_status.py

o  Use
http://www.epochconverter.com
to
check
reset_time

•  Formats
xml,
json,
atom
&
rss

Twitter
API

Near-realtime,
High Volume

Follow users,
Core Data,

REST
Streaming
topics, data
Core Twitter mining

Objects

Public
Streams

Seach & User
Streams

Trend

Twitter
Twitter
Site
Streams

REST
Search
Firehose

Build
Proﬁle
Keywords

Create/Post
Tweets
Speciﬁc
User

Reply
Trends

Favorite,
Re-‐tweet
Rate
Limit
:

Rate
Limit
:
150/350

Complexity
&
Frequency

Rate Limits
•  By
API
type
&
Authentication
Mode

API

No authC

authC

Error

REST
150/hr
350/hr
400

Search
Complexity
&
-‐N/A-‐
420

Frequency

Streaming
Upto
1%

Fire
hose
none
none

Rate Limit Header
•  {

•  "status":
"200
OK",

• 

"vary":
"Accept-‐Encoding",

• 

"x-‐frame-‐options":
"SAMEORIGIN",

• 

"x-‐mid":
"8e775a9323c45f2a541eeb4d2d1eb9b468w81c6",

• 

"x-‐ratelimit-‐class":
"api",

• 

"x-‐ratelimit-‐limit":
"150",

• 

"x-‐ratelimit-‐remaining":
"149",

• 

"x-‐ratelimit-‐reset":
"1340467358",

• 

"x-‐runtime":
"0.04144",

• 

"x-‐transaction":
"2b49ac31cf8709af",

• 

"x-‐transaction-‐mask":

"a6183ﬀa5f8ca943ﬀ1b53b5644ef114df9d6bba"

•  }

Rate Limit-‐‑ed Header
•  {

• 

"cache-‐control":
"no-‐cache,
max-‐age=300",

• 

"content-‐encoding":
"gzip",

• 

"content-‐length":
"150",

• 

"content-‐type":
"application/json;
charset=utf-‐8",

• 

"date":
"Wed,
04
Jul
2012
00:48:25
GMT",

• 

"expires":
"Wed,
04
Jul
2012
00:53:25
GMT",

• 

"server":
"tfe",

• 

”…

• 

"status":
"400
Bad
Request",

• 

"vary":

• 

"api",

• 

"150",

• 

"0",

• 

"1341363230",

• 

"x-‐runtime":
"0.01126"

•  }

Rate Limit Example
•  Run

o  oscon2012_rate_limit_02.py

•  It
iterates
through
a
list
to
get
followers

•  List
is
2072
long

•  {

• 

…

• 

"date":
"Wed,
04
Jul
2012
00:54:16
GMT",

•  "status":
"200
OK",

• 

"vary":

• 

"SAMEORIGIN",

• 

"x-‐mid":
"f31c7278ef8b6e28571166d359132f152289c3b8",

• 

"api",

• 

"150",

Last
time,
it
gave
me
5
min.

Now
the
reset
timer
is
1

• 

"147",

hour

• 

"1341366831",

150
calls,
not
authenticated

• 

"x-‐runtime":
"0.02768",

• 

"x-‐transaction":
"f1bafd60112dddeb",

• 

"a6183ﬀa5f8ca943ﬀ1b53b5644ef11417281dbc"

•  }

•  {

• 

"cache-‐control":
"no-‐cache,
max-‐age=300",

• 

"gzip",

• 

"content-‐type":
"application/json;
charset=utf-‐8",

• 

"date":
"Wed,
04
Jul
2012
00:55:04
GMT",

And Rate Limit kicked-‐‑in
•  …

•  "status":
"400
Bad
Request",

• 

"transfer-‐encoding":
"chunked",

• 

"vary":

• 

"api",

• 

"150",

• 

"0",

• 

"1341366831",

• 

"x-‐runtime":
"0.01342"

•  }

API with OAuth
•  {

• 

…

• 

"date":
"Wed,
04
Jul
2012
01:32:01
GMT",

• 

"etag":
""dd419c02ed00fc6b2a825cc27wbe040"",

• 

"expires":
"Tue,
31
Mar
1981
05:00:00
GMT",

• 

"last-‐modified":
"Wed,
04
Jul
2012
01:32:01
GMT",

• 

"pragma":
"no-‐cache",

• 

"server":
"tfe",

•  …

•  "status":
"200
OK",

• 

"vary":

• 

"x-‐access-‐level":
"read",

• 

"SAMEORIGIN",

• 

"x-‐mid":
"5bbb87c04fa43c43bc9d7482bc62633a1ece381c",

• 

"api_identified",

• 

"350",

• 

"349",

• 

"1341369121",

• 

"x-‐runtime":
"0.05539",

OAuth

• 
• 

"x-‐transaction":
"9f8508fe4c73a407",

"a6183ffa5f8ca943ff1b53b5644ef11417281dbc"

“api-‐identified”

•  }
1
hr
reset

350
calls

•  {

• 

…

• 

"date":
"Thu,
05
Jul
2012
14:56:05
GMT",

•  …

• 

"api_identiﬁed",

• 

"350",

• 

"133",

• 

"1341500165",

• 
…
Rate Limit resets during
•  }
consecutive calls
•  ********
2416

•  {

+1
•  …
hour
• 

"date":
"Thu,
05
Jul
2012
14:56:18
GMT",

•  …

• 

"status":
"200
OK",

• 

….

• 

"api_identiﬁed",

• 

"350",

• 

"349",

• 

"1341503776",

•  ********
2417

Unexplained Errors
•  Traceback
(most
recent
call
last):

• 

File
"oscon2012_get_user_info_01.py",
line
39,
in
<module>

• 

r
=
client.get(url,
params=payload)

• 

File
"build/bdist.macosx-‐10.6-‐intel/egg/requests/sessions.py",
line
244,
in
get

• 

File
"build/bdist.macosx-‐10.6-‐intel/egg/requests/sessions.py",
line
230,
in
request

• 

File
"build/bdist.macosx-‐10.6-‐intel/egg/requests/models.py",
line
609,
in
send

•  requests.exceptions.ConnectionError:
HTTPSConnectionPool(host='api.twitter.com',
port=443):
Max

retries
exceeded
with
url:
/1/users/lookup.json?
user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44
614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854
7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8
962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C
While
trying
to
get
details
of
1,000,000
users,
I
get
this
error
–

17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C
usually
10-‐6
AM
PST

42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C
8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%

2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084%
Got
around
by
“Trap
&
wait
5
seconds”

2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%

2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155
56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260
Night
Runs
are
relatively
error
free

09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446
14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886
54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C
13727232%2C199803906%2C220435108%2C268531201

•  {

• 
• 

…

"date":
"Fri,
06
Jul
2012
03:41:09
GMT",

A Day in the life of
• 

"expires":
"Fri,
06
Jul
2012
03:46:09
GMT",

Twitter Rate Limit
• 

"server":
"tfe",

• 

"set-‐cookie":
"dnt=;
domain=.twitter.com;
path=/;
expires=Thu,
01-‐Jan-‐1970
00:00:00
GMT",

• 

"status":
"400
Bad
Request",

• 

"vary":

• 

"api_identiﬁed",

• 

"350",

• 

"0",

Missed by 4 min!
• 

"1341546334",

• 

"x-‐runtime":
"0.01918"

•  }

•  Error,
sleeping

•  {

• 
…

• 
"date":
"Fri,
06
Jul
2012
03:46:12
GMT",

• 
…

• 
"status":
"200
OK",

• 
…

• 
"api_identiﬁed",

• 

"350",

• 

"349",

OK after 5 min sleep
• 
…

Strategies
I
have
no
exotic
strategies,
so
far
!

1.  Obvious
:

Track
elapsed
time
&
sleep
when
rate
limit
kicks
in

2.  Combine
authenticated
&
non-‐authenticated
calls

3.  Use
multiple
API
types

4.  Cache

5.  Store
&
get
only
what
is
needed

6.  Checkpoint
&
buﬀer
request
commands

7.  Distributed
data
parallelism
–
for
example
AWS
instances

http://www.epochconverter.com/
<-‐
useful
to
debug
the
timer

Pl share your tips and tricks for conserving the Rate Limit

Authentication
•  Three
modes

o  Anonymous

o  HTTP
Basic
Auth

o  OAuth

•  As
of
Aug
31,
2010,
only
Anonymous
or
OAuth
are

supported

• 
OAuth
enables
the
user
to
authorize
an
application

without
sharing
credentials

•  Also
has
the
ability
to
revoke

•  Twitter
supports
OAuth
1.0a

•  OAuth
2.0
is
the
new
standard,
much
simpler

o  No
timeframe
for
Twitter
support,
yet

OAuth Pragmatics
•  Helpful
Links

o  https://dev.twitter.com/docs/auth/oauth

o  https://dev.twitter.com/docs/auth/moving-‐from-‐basic-‐auth-‐to-‐oauth

o  https://dev.twitter.com/docs/auth/oauth/single-‐user-‐with-‐examples

o  http://blog.andydenmark.com/2009/03/how-‐to-‐build-‐oauth-‐consumer.html

•  Discussion
on
OAuth
internal
mechanisms
is
better
left
for

another
day

•  For
headless
applications
to
get
OAuth
token,
go
to
https://
dev.twitter.com/apps

• 
Create
an
application
&
get
four
credential
pieces

o  Consumer
Key,
Consumer
Secret,
Access
Token
&
Access
Token
Secret

•  All
the
frameworks
have
support
for
OAuth.
So
plug
–in

these
values
&
use
the
framework’s
calls

•  I
used
request-‐oauth
library
like
so:

request-‐‑oauth
def
get_oauth_client():
Get
client
using
the

consumer_key
=
"5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"
token,
key
&
secret
from

consumer_secret
=
"fceb3aedb960374e74f559caeabab3562efe97b4"
dev.twitter.com/apps

access_token
=
"df919acd38722bc0bd553651c80674fab2b465086782Ls"

access_token_secret
=
"1370adbe858f9d726a43211afea2b2d9928ed878"

header_auth
=
True

oauth_hook
=
OAuthHook(access_token,
access_token_secret,
consumer_key,
consumer_secret,
header_auth)

client
=
requests.session(hooks={'pre_request':
oauth_hook})

return
client

Use
the
client
instead

def
get_followers(user_id):
of
requests

url
=
'https://api.twitter.com/1/followers/ids.json’

payload={"user_id":user_id}
#
if
cursor
is
needed
{"cursor":-‐1,"user_id":scr_name}

r
=
requests.get(url,
params=payload)

def
get_followers_with_oauth(user_id,client):

url
=
'https://api.twitter.com/1/followers/ids.json'

payload={"user_id":user_id}
#
if
cursor
is
needed
{"cursor":-‐1,"user_id":scr_name}

r
=
client.get(url,
params=payload)

Ref: h5p://pypi.python.org/pypi/requests-‐‑oauth

OAuth Authorize screen
•  The
user

authenticates
with

Twitter
&
grants

access
to
Forbes

Social

•  Forbes
social

doesn’t
have
the

users
credentials,

but
uses
OAuth
to

access
the
user’s

account

HTTP status Codes
•  0
Never
made
it
to
Twitter
Servers
-‐
•  404
Not
Found

Library
error
•  406
Not
Acceptable

•  200
OK
•  413
Too
Long

•  304
Not
Modiﬁed
•  416
Range
Unacceptable

•  400
Bad
Request
•  420
Enhance
Your
Calm

o  Check
error
message
for
explanation
o  Rate
Limited

o  REST
Rate
Limit
!

•  500
Internal
Server
Error

•  401
UnAuthorized
•  502
Bad
Gateway

o  Beware
–
you
could
get
this
for
other
o  Down
for
maintenance

reasons
as
well.

•  503
Service
Unavailable

•  403
Forbidden
o  Overloaded
“Fail
whale”

o  Hit
Update
Limit
(>
max
Tweets/day,
•  504
Gateway
Timeout

following
too
many
people)
o  Overloaded

h5ps://dev.twi5er.com/docs/error-‐‑codes-‐‑responses

HTTP Status Code -‐‑ Example
•  {

• 

"cache-‐control":
"no-‐cache,
max-‐age=300",

• 

"gzip",

• 

"91",

• 

"content-‐type":
"application/json;
charset=utf-‐8",

• 

"date":
"Sat,
23
Jun
2012
00:06:56
GMT",

• 

"expires":
"Sat,
23
Jun
2012
00:11:56
GMT",

• 

"server":
"tfe",

• 
…

• 

"status":
"401
Unauthorized",

• 

"vary":

• 

"www-‐authenticate":
"OAuth
realm="https://api.twitter.com"",

• 
• 

"api",

"0",

Detailed
error

• 

"0",

message

in
JSON
!

• 

"1340413616",

• 

"x-‐runtime":
"0.01997"
I
like
this

•  }

•  {

• 

"errors":
[

• 

{

• 

"code":
53,

• 

"message":
"Basic
authentication
is
not
supported"

• 

}

• 

]

•  }

HTTP Status Code – Confusing Example
•  {
•  GET
https://api.twitter.com/1/users/lookup.json?
•  …

screen_nme=twitterapi,twitter&include_entities=
• 

"pragma":
"no-‐cache",

true

• 

"server":
"tfe",

• 
…

•  Spelling
Mistake

• 

"status":
"404
Not
Found",

o  Should
be
screen_name

• 

…
•  But
confusing
error
!

•  }

•  {
•  Should
be
406
Not
Acceptable
or
413
Too
Long
,

• 

"errors":
[
showing
parameter
error

• 

{

• 

"code":
34,

• 

"message":
"Sorry,
that
page
does
not
exist"

• 

}

• 

]

•  }

HTTP Status Code -‐‑ Example
•  {

• 

"cache-‐control":
"no-‐cache,
no-‐store,
must-‐revalidate,
pre-‐check=0,
post-‐check=0",

• 

"gzip",

• 

"112",

• 

"content-‐type":
"application/json;charset=utf-‐8",

Sometimes,
the
errors
are

• 

"date":
"Sat,
23
Jun
2012
01:23:47
GMT",

not
correct.
I
got
this
error

• 

"expires":
"Tue,
31
Mar
1981
05:00:00
GMT",

•  …

for
user_timeline.json
w/

• 

"status":
"401
Unauthorized",

user_id=20,15,12

• 

"www-‐authenticate":
"OAuth
realm="https://api.twitter.com"",

Clearly
a
parameter
error

• 

"SAMEORIGIN",

• 

"api",

(i.e.
more
parameters)

• 

"150",

• 

"147",

• 

"1340417742",

• 

"x-‐transaction":
"d545a806f9c72b98"

•  }

•  {

• 

"error":
"Not
authorized",

• 

"request":
"/1/statuses/user_timeline.json?user_id=12%2C15%2C20"

•  }

Followers

Twitter
Platform

Friends

Are Followed By

Objects

Follow

Users

Status Update

@ user_mentions

Entities

embed

urls

Temporally
Tweets

embe
d

Ordered

media

TimeLine
#

Places
hashtags

h5ps://dev.twi5er.com/docs/platform-‐‑objects

Tweets
•  A.k.a
Status
Updates

•  Interesting
ﬁelds

o  Coordinates
<-‐
geo
location

o  created_at

o  entities
(will
see
later)

o  Id,
id_str

o  possibly
sensitive

o  user
(will
see
later)

•  perspectival
attributes
embedded
within
a
child
object
of
an
unlike
parent
–

hard
to
maintain
at
scale

•  https://dev.twitter.com/docs/faq#6981

o  withheld_in_countries

•  https://dev.twitter.com/blog/new-‐withheld-‐content-‐ﬁelds-‐api-‐responses

h5ps://dev.twi5er.com/docs/platform-‐‑objects/tweets

A word about id, id_str
•  June
1,
2010

o  Snowflake
the
id
generator
service

o  “The
full
ID
is
composed
of
a
timestamp,

a
worker
number,
and
a
sequence

number”

o  Had
problems
with
JavaScript
to
handle

numbers
>
53
bits

o  “id”:819797

o  “id_str”:”819797”

h5p://engineering.twi5er.com/2010/06/announcing-‐‑snowflake.html
h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-‐‑development-‐‑talk/ahbvo3VTIYI
h5ps://dev.twi5er.com/docs/twi5er-‐‑ids-‐‑json-‐‑and-‐‑snowflake

Tweets -‐‑ example
•  Let
us
run
oscon2012-‐tweets.py

•  Example
of
tweet

o  coordinates

o  id

o  id_str

Users
•  followers_count

•  geo_enabled

•  Id,
Id_str

•  name,
screen_name

•  Protected

•  status,
statuses_count

•  withheld_in_countries

h5ps://dev.twi5er.com/docs/platform-‐‑objects/users

Users – Let us run some examples
•  Run

o  oscon_2012_users.py

•  Lookup
users
by
screen_name

o  oscon12_ﬁrst_20_ids.py

•  Lookup
users
by
user_id

•  Inspect
the
results

o  id,
name,
status,
status_count,
protected,
followers

(for
top
10
followers),
withheld
users

•  Can
use
information
for
customizing

the
user’s
screen
in
your
web
app

Entities
•  Metadata
&
Contextual
Information

•  You
can
parse
them,
but
Entities

parse
them
out
as
structured
data

•  REST
API/Search
API
–

include_entities=1

•  Streaming
API
–
included
by
default

•  hashtags,
media,
urls,

user_mentions

h5ps://dev.twi5er.com/docs/platform-‐‑objects/entities
h5ps://dev.twi5er.com/docs/tweet-‐‑entities
h5ps://dev.twi5er.com/docs/tco-‐‑url-‐‑wrapper

Entities
•  Run

o  oscon2012_entities.py

•  Inspect
hashtags,
urls
et
al

Places
•  attributes

•  bounding_box

•  Id
(as
a
string!)

•  country

•  name

h5ps://dev.twi5er.com/docs/platform-‐‑objects/places
h5ps://dev.twi5er.com/docs/about-‐‑geo-‐‑place-‐‑a5ributes

Places
•  Can
search
for
tweets
near
a
place
like
so:

•  Get
latlong
of
conv
center
[45.52929,-‐122.66289]

o  Tweets
near
that
place

•  Tweets
near
San
Jose
[37.395715,-‐122.102308]

•  We
will
not
see
further
here.
But
very
useful

Timelines
•  Collections
of
tweets
ordered
by
time

•  Use
max_id
&
since_id
for
navigation

h5ps://dev.twi5er.com/docs/working-‐‑with-‐‑timelines

Other Objects & APIs
•  Lists

•  Notiﬁcations

•  Friendships/exists
to
see
if
one
follows

the
other

Hands-‐‑on Exercise (15 min)
•  Setup
environment
–
slide
#14

•  Sanity
Check
Environment
&
Libraries

o  oscon2012_open_this_ﬁrst.py

o  oscon2012_rate_limit_status.py

•  Get
objects
(show
calls)

o  Lookup
users
by
screen_name

-‐
oscon12_users.py

o  Lookup
users
by
id
-‐
oscon12_ﬁrst_20_ids.py

o  Lookup
tweets
-‐
oscon12_tweets.py

o  Get
entities
-‐
oscon12_entities.py

•  Inspect
the
results

•  Explore
a
little
bit

•  Discussion

Twitter
API

Near-realtime,
High Volume

Follow users,
Core Data,

REST
Streaming
topics, data
Core Twitter mining

Objects

Public Streams
Seach & User Streams
Trend

Twitter
Twitter
Site Streams
REST
Search
Firehose

Build Proﬁle
Keywords
Create/Post Tweets
Speciﬁc User
Reply
Trends
Favorite, Re-‐‑tweet
Rate Limit :
Rate Limit : 150/350
Complexity & Frequency

Twi5er REST API
•  https://dev.twitter.com/docs/api

•  What
we
were
doing
were
the
REST
API

•  Request-‐Response

•  Anonymous
or
OAuth

•  Rate
Limited
:

o  150/350

Twi5er Trends
•  oscon2012-‐trends.py

•  Trends/weekly,
Trends/monthly

•  Let
us
run
some
examples

o  oscon2012_trends_daily.py

o  oscon2012_trends_weekly.py

•  Trends
&
hashtags

o  #hashtag
euro2012

o  http://hashtags.org/euro2012

o  http://sproutsocial.com/insights/2011/08/twitter-‐hashtags/

o  http://blog.twitter.com/2012/06/euro-‐2012-‐follow-‐all-‐action-‐on-‐pitch.html

o  Top
10
:
http://twittercounter.com/pages/100,
http://twitaholic.com/

Brand Rank w/ Twi5er
•  Walk
Through
&
results
of
following

o  oscon2012_brand_01.py

•  Followed
10
user-‐brands
for
a
few
days
to
ﬁnd

growth

•  Brand
Rank

o  Growth
of
a
brand
w.r.t
the
industry

o  Surge
in
popularity
–
could
be
due
to
–ve
or
+ve
buzz.
Need
to
understand
&

correlate
using
Twitter
APIs
&
metrics

•  API
:
url='https://api.twitter.com/1/users/
lookup.json'

•  payload={"screen_name":"miamiheat,okcthunder,n
ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati,
googleio,OReillyMedia"}

Clouderati
is very
stable

Tech Brands
•  Google
I/O
showed
a
spike
on
6/27-‐

6/28

•  OReillyMedia
shares
some
spike

•  Looking
at
a
few
days
worth
of

data,
our
best
inference
is
that

“oscon
doesn’t
track
with
googleio”

•  “Clouderati
doesn’t
track
at
all”

World of Soccer
•  FOXSoccer,UEFAcom

track
each
other

The numbers seldom
decrease. So calculating
–ve velocity will not
work
OTOH, if you see a –ve
velocity, investigate

World of Basketball
•  NBA,
MiamiHeat,
okcthunder
track
each
other

•  Used
%
than
absolute
numbers
to
compare

•  The
hike
on
7/6
to
7/10
is
interesting.

Rising Tide …
•  For
some
reason,
all
numbers
are
going
up
7/6
thru

7/10
–
except
for
clouderati!

•  Is
a
rising
(Twitter)
tide
lifting
all
(well,
almost
all)
?

Trivia : Search API
•  Search(search.twitter.com)

o  Built
by
Summize
which
was
acquired
by
Twitter
in

2008

o  Summize
described
itself
as
“sentiment
mining”

Search API
•  Very
simple

o  GET
http://search.twitter.com/search.json?q=<blah>

•  Based
on
a
search
criteria

•  “The Twitter Search API is a dedicated API for
running searches against the real-time index of
recent Tweets”
•  Recent
=
Last
6-‐9
days
worth
of
tweets

•  Anonymous
Call

•  Rate
Limit

o  Not
No.
of
calls/hour,
but
Complexity
&
Frequency

h5ps://dev.twi5er.com/docs/using-‐‑search
h5ps://dev.twi5er.com/docs/api/1/get/search

Search API
•  Filters

o  Search
URL
encoded

o  @
=
%40,
#=%23

o 
emoticons

:)
and
:(,

o  http://search.twitter.com/search.atom?q=sometimes+%3A)

o  http://search.twitter.com/search.atom?q=sometimes+%3A(

•  Location
Filters,
date
ﬁlters

•  Content
searches

Streaming API
•  Not
request
response;
but
stream

•  Twitter
frameworks
have
the
support

•  Rate
Limit
:
Upto
1%

•  Stall
warning
if
the
client
is
falling
behind

•  Good
Documentation
Links

o  https://dev.twitter.com/docs/streaming-‐apis/connecting

o  https://dev.twitter.com/docs/streaming-‐apis/parameters

o  https://dev.twitter.com/docs/streaming-‐apis/processing

Firehose
•  ~
400
million
public
tweets/day

•  If
you
are
working
with
Twitter
ﬁrehose,
I
envy
you
!

•  If
you
hit
real
limits,
then
explore
the
ﬁrehose
route

•  AFAIK,
it
is
not
cheap,
but
worth
it

API Best Practices
1.  Use
JSON

2.  Use
user_id
than
screen_name

o  User_id
is
constant
while
screen_name
can
change

3.  max_id
and
since_id

o  For
example
direct
messages,
if
you
have
last
message
use

since_id
for
search

o  max_id
how
far
to
go
back

4.  Cache
as
much
as
you
can

5.  Set
the
User-‐Agent
header
for
debugging

I have listed a few good blogs that have API best practices, in the
reference section, at the end of this presentation
These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the
sources

Twitter
API

Near-realtime,
High Volume

Follow users,
Core Data,

REST
Streaming
topics, data
Core Twitter mining

Objects

Public Streams
Seach & User Streams
Trend

Twitter
Twitter
Site Streams
REST
Search
Firehose

Build Proﬁle
Questions
?

Keywords
Create/Post Tweets
Speciﬁc User
Reply
Trends
Favorite, Re-‐‑tweet
Rate Limit :
Rate Limit : 150/350
Complexity & Frequency

Part II
SNA
Part II
Twitter Network Analysis

2.
Store
3.
Transform
&

1.
Collect

Analyze

the
Validate Dataset & . Keep don’t
Tip: 3 simple;
re-crawl/refresh

a
schem afrai d to
be
for m
Most
important
&
trans
the
ugliest
slide
in

this
deck
!
as
lem ent ,
1. Imp ipeline 4.
Model

Tip: age d p nolith 5.
Predict,
&

a st r a mo Reason

neve Recommend
&

Visualize

Trivia
•  Social
Network
Analysis
originated
as
Sociometry
&

the
social
network
was
called
a
sociogram

•  Back
then,
Facebook
was
called
SocioBinder!

•  Jacob
Levi
Morano,
is
considered
the
originator

o  NYTimes,
April
3,
1933,
P.
17

Twi5er Networks-‐‑Deﬁnitions
•  Nodes

o  Users

o  #tags

•  Edges

o  Follows

o  Friends

o  @mentions

o  #tags

•  Directed

Twi5er Networks-‐‑Deﬁnitions
•  In-‐degree

o  Followers

•  Out-‐Degree

o  Friends/Follow

•  Centrality
Measures

•  Hubs
&
Authorities

o  Hubs/Directories
tell
us
where

Authorities
are

o  “Of
Mortals
&
Celebrities”
is

more
“Twitter-‐style”

Twi5er Networks-‐‑Properties
M
•  Concepts
From
Citation
N
Networks
K
J
o  Cocitation
L

I
•  Common
papers
that
cite
a
paper
A
•  Common
Followers
B G
o  C
&
G
(Followed
by
F
&
H)

C H
o  Bibliographic
Coupling

•  Cite
the
same
papers
D F

•  Common
Friends
(i.e.
follow
same
E
person)

o  D,
E,
F
&
H

•  Concepts
From
Citation
Networks
M
o  Cocitation
N
•  Common
papers
that
cite
a
paper
K
•  Common
Followers

J

L

o  C
&
G
(Followed
by
F
&
H)
I

o  Bibliographic
Coupling
A
•  Cite
the
same
papers
B G
•  Common
Friends

(i.e.
follow
same
person)

o  D,
E,
F
&
H
follow
C

o  H
&
F
follow
C
&
G
H
C
•  So
H
&
F
have
high
coupling
D
•  Hence,
if
H
follows
A,
we
can
F

recommend
F
to
follow
A
E

•  Bipartite/Aﬃliation
Networks

o  Two
disjoint
subsets

o  The
bipartite
concept
is
very
relevant
to
Twitter
social
graph

o  Membership
in
Lists

•  lists
vs.
users
bipartite
graph

o  Common
#Tags
in
Tweets

•  #tags
vs.
members
bipartite
graph

o  @mention
together

•  ?
Can
this
be
a
bipartite
graph

•  ?
How
would
we
fold
this
?

Other Metrics & Mechanisms
•  Kronecker
Graphs
Models

o  Kronecker
product
is
a
way
of
generating
self-‐similar
matrices

o  Prof.Leskovec
et
al
deﬁne
the
Kronecker
product
of
two
graphs
as
the
Kronecker
product
of

their
adjacency
matrices

o  Application
:
Generating
models
for
analysis,
prediction,
anomaly
detection
et
al

•  Erdos-‐Renyl
Random
Graphs

o  Easy
to
build
a
Gn,p
graph

o  Assumes
equal
likelihood
of
edges
between
two
nodes

o  In a Twitter social network, we can create a more realistic expected distribution (adding the
“social reality” dimension) by inspecting the #tags & @mentions
•  Network
Diameter

•  Weak
Ties

•  Follower
velocity
(+ve
&
–ve),
Association
strength

o  Unfollow
not
a
reliable
measure

o  But
an
interesting
property
to
investigate
when
it
happens

Not covered here, but potential for an encore !
Ref: Jure Leskovec: Kronecker Graphs, Random Graphs

•  Twitter != LinkedIn, Twitter != Facebook
•  Twitter Network == Interest Network
•  Be
cognizant
of
the
above
when
you
apply
traditional
network

properties
to
Twitter

•  For
example,

o  Six
degrees
of
separation
doesn't
make
sense
(most
of
the
time)
in

Twitter
–
except
may
be
for
Cliques

o  Is
diameter
a
reliable
measure
for
a
Twitter
Network
?

•  Probably
not

o  Do
cut
sets
make
sense
?

•  Probably
not

o  But
citation
network
principles
do
apply;
we
can
learn
from
cliques

o  Bipartite
graphs
do
make
sense

Cliques (1 of 2)
•  “Maximal
subset
of
the
vertices
in
an

undirected
network
such
that
every
member

of
the
set
is
connected
by
an
edge
to
every

other”

•  Cohesive
subgroup,
closely
connected

•  Near-‐cliques
than
a
perfect
clique
(k-‐plex
i.e.

connected
to
at
least
n-‐k
others)

•  k-‐plex
clique
to
discover
sub
groups
in
a
sparse

network;
1-‐plex
being
the
perfect
clique

Ref: Networks, An Introduction-‐‑Newman

Cliques (2 of 2)
•  k-‐core
–
at
least
k
others
in
the
subset;

(n-‐k)-‐plex

•  k-‐clique
–
no
more
than
k
distance
away

o  Path
inside
or
outside
the
subset

o  k-‐clan
or
k-‐club
(path
inside
the
subset)

•  We
will
apply
k-‐plex
Cliques
for
one
of

our
hands-‐on

Ref: Networks, An Introduction-‐‑Newman

Sentiment Analysis
•  Sentiment
Analysis
is
an
important
&
interesting
work

on
the
Twitter
platform

o  Collect
Tweets

o  Opinion
Estimation
-‐Pass
thru
Classiﬁer,
Sentiment
Lexicons

•  Naïve
Bayes/Max
Entropy
Class/SVM

o  Aggregated
Text
Sentiment/Moving
Average

•  I
chose
not
to
dive
deeper
because
of
time
constraints

o  Couldn’t
do
justice
to
API,
Social
Network
and
Sentiment
Analysis,

all
in
3
hrs

•  Next
3
Slides
have
couple
of
interesting
examples

Sentiment Analysis
•  Twitter
Mining
for
Airline
Sentiment

•  Opinion
Lexicon
-‐
+ve
2000,
-‐ve
4800

h5p://www.inside-‐‑r.org/howto/mining-‐‑twi5er-‐‑airline-‐‑consumer-‐‑sentiment
h5p://sentiment.christopherpo5s.net/lexicons.html#opinionlexicon

Need I say more ?
“A
bit
of
clever
math
can
uncover
interes4ng
pa7erns
that
are
not
visible
to
the

human
eye”

h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-‐‑social-‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket
h5p://www.relevantdata.com/pdfs/IUStudy.pdf

Interesting Vectors of Exploration

1.  Find
trending
#tags
&
then
related
#tags
–
using

cliques
over
co-‐#tag-‐citation,
which
infers
topics

related
to
trending
topics

2.  Related
#tag
topics
over
a
set
of
tweets
by
a
user
or

group
of
users

3.  Analysis-‐In/Out
ﬂow,
Tweet
Flow

–  Frequent
@mention

4.  Find
aﬃliation
networks
by
List
memberships,
#tags

or
frequent
@mentions

Interesting Vectors of Exploration

5.  Use
centrality
measures
to
determine
mortals
vs.

celebrities

6.  Classify
Tweet
networks/cliques
based
on
message

passing
characteristics

–  Tweets
vs.
Retweets,
No
of
reweets,…

7.  Retweet
Network

–  Measure
Inﬂuence
by
retweet
count
&
frequency

–  Information
contagion
by
looking
at
diﬀerent
retweet

network
subcomponents
–
who,
when,
how
much,…

Twi5er Network
Graph Analysis
An
Example

Analysis Story Board
•  @clouderati
is
a
popular
cloud
related

Twitter
account

•  Goals:

o  Analyze
the
social
graph
characteristics
of
the
users
who
are

following
the
account

In this •  Dig
one
level
deep,
to
the
followers
&
friends,
of
the

tutorial
followers
of
@clouderati

o  How
many
cliques
?
How
strong
are
they
?

o  Does
the
@mention
support
the
clique
inferences
?

For you to o  What
are
the
retweet
characteristics
?

explore !!
o  How
does
the
#tag
network
graph
look
like
?

Twi5er Analysis Pipeline Story Board
Stages, Strategies, APIs & Tasks
Stage
4

Stag
o  e
5

o  Get
&
Store
User
details
For
e
(distinct
user
list)
follo ach
@c
o  w loud
o  Unroll
Find er
erat

frie i

inte nd=f
rsec o
tion llower

Note:
Needed
a
Note:
Unroll

-‐
se
stage
took
time
t

command
buﬀer

to
manage
scale
&
missteps

(~980,000
users)

Stage
3
Stage
6
raph

s ocial
g heory

o  Create twork
t
ne
o  Get
distinct
user
list

o  Apply
ues
&
other

applying
the
liq
o  Infer
c s

set(union(list))
operation
tie
proper

@clouderati Twi5er Social Graph
•  Stats
(Retrospect
after
the
runs):

o  Stage
1

•  @clouderati
has
2072
followers

o  Stage
2

•  Limiting
followers
to
5,000
per
user

o  Stage
3

•  Digging
1st
level
(set
union
of
followers
&
friends
of
the

followers
of
@clouderati)
explodes
into
~980,000
distinct

users

o  MongoDB
of
the
cache
and
intermediate
datasets
~10
GB

o  The
database
was
hosted
at
AWS
(Hi
Mem
XLarge
–
m2.xlarge
),
8

X
15
GB,
Raid
10,
opened
to
Internet
with
DB
authentication

Code & Run Walk Through
o  Code:

§  oscon_2012_user_list_spider_01.py

o  Challenges:

Stage
1

§  Nothing
fancy

§  Get
the
record
and
store

o  Get
@clouderati
Followers

o  Store
in
MongoDB
§  Would
have
had
to
recurse
through
a
REST

cursor
if
there
were
more
than
5000
followers

§  @clouderati
has
2072
followers

o  Interesting
Points:

Code & Run Walk Through
o  Code:

§  oscon_2012_user_list_spider_02.py

§  oscon_2012_twitter_utils.py

§  oscon_2012_mongo.py

§  oscon_2012_validate_dataset.py

o  Challenges:

§  Multiple
runs,
errors
et
al
!

Stage
2

o  Interesting
Points:

§  Set
operation
between
two
mongo
collections
for
restart
buﬀer

o  Crawl
1
level
deep

§  Protected
users,
some
had
0
followers,
or
0
friends

o  Get
friends
&
followers

§  Interesting
operations
for
validate,
re-‐crawl
and
refresh

o  Validate,
re-‐crawl
&
refresh

§  Added
“status_code”
to
diﬀerentiate
protected
users

§  {'$set':
{'status_code':
'401
Unauthorized,401
Unauthorized'}}

§  Getting friends & followers of 2000 users is the hardest (or so I thought,
until I got through the next stage!)

The Art of Social Media Analysis with Twitter & Python

The Art of Social Media Analysis with Twitter & Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The Art of Social Media Analysis with Twitter & Python

Similar to The Art of Social Media Analysis with Twitter & Python (20)

More from Krishna Sankar

More from Krishna Sankar (20)

Recently uploaded

Recently uploaded (20)

The Art of Social Media Analysis with Twitter & Python