Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
1. The Art of Social Media
Analysis
with Twitter & Python
krishna sankar
@ksankar
http://www.oscon.com/oscon2012/public/schedule/detail/23130
2. Intro
API,
Objects,…
o House
Rules
(1
of
2)
Twitter
Network We will analyze @clouderati,
o Doesn’t
assume
any
knowledge
Analysis 2072 followers, exploding to
of
Twitter
API
Pipeline
~980,000 distinct users down
one level
o Goal:
Everybody
in
the
same
page
&
get
a
working
knowledge
of
Twitter
API
NLP, NLTK,
o To
bootstrap
your
exploration
@mention Cliques, social
Sentiment
network graph
into
Social
Network
Analysis
&
Analysis
Twitter
Rewteeet analytics,
Growth,
#tag Network Information
o Simple
programs,
to
illustrate
contagion weakties
usage
&
data
manipulation
3. Intro
API,
Objects,…
Twitter
o House
Rules
(2
of
2)
Network We will analyze @clouderati,
Analysis 2072 followers, exploding to
o Am
using
the
requests
library
Pipeline
~980,000 distinct users down
o There
are
good
Twitter
frameworks
one level
for
python,
but
wanted
to
build
from
the
basics.
Once
one
understands
the
fundamentals,
frameworks
can
help
NLP, NLTK,
@mention Cliques, social
Sentiment
o Many
areas
to
explore
–
not
enough
Analysis
network graph
time.
So
decided
to
focus
on
social
graph,
cliques
&
networkx
Rewteeet analytics,
Growth,
#tag Network Information
contagion weakties
4. About Me
• Lead
Engineer/Data
Scientist/AWS
Ops
Guy
at
Genophen.com
o Co-‐chair
–
2012
IEEE
Precision
Time
Synchronization
• http://www.ispcs.org/2012/index.html
o Blog
:
http://doubleclix.wordpress.com/
o Quora
:
http://www.quora.com/Krishna-‐Sankar
• Prior
Gigs
o Lead
Architect
(Egnyte)
o Distinguished
Engineer
(CSCO)
o Employee
#64439
(CSCO)
to
#39(Egnyte)
&
now
#9
!
• Current
Focus:
o Design,
build
&
ops
of
BioInformatics/Consumer
Infrastructure
on
AWS,
MongoDB,
Solr,
Drupal,GitHub,…
o Big
Data
(more
of
variety,
variability,
context
&
graphs,
than
volume
or
velocity
–
so
far
!)
o Overlay
based
semantic
search
&
ranking
• Other
related
Presentations
o http://goo.gl/P1rhc
Big
Data
Engineering
Top
10
Pragmatics
(Summary)
o http://goo.gl/0SQDV
The
Art
of
Big
Data
(Detailed)
o http://goo.gl/EaUKH
The
Hitchhiker’s
Guide
to
Kaggle
OSCON
2011
Tutorial
5. Twitter Tips – A Baker’s Dozen
1. Twitter
APIs
are
(more
or
less)
congruent
&
symmetric
2. Twitter
is
usually
right
&
simple
-‐
recheck
when
you
get
unexpected
results
before
blaming
Twitter
o I
was
getting
numbers
when
I
was
expecting
screen_names
in
user
objects.
o Was
ready
to
send
blasting
e-‐mails
to
Twitter
team.
Decided
to
check
one
more
time
and
found
that
my
parameter
key
was
wrong-‐screen_name
instead
of
user_id
o Always test with one or two records before a long run ! - learned the hard way
3. Twitter
APIs
are
very
powerful
–
consistent
use
can
bear
huge
data
o In
a
week,
you
can
pull
in
4-‐5
million
users
&
some
tweets
!
o Night runs are far more faster & error-free
4. Use
a
NOSQL
data
store
as
a
command
buffer
&
data
buffer
o Would
make
it
easy
to
work
with
Twitter
at
scale
o I
use
MongoDB
The
o Keep
the
schema
simple
&
no
fancy
transformation
End
• And
as
far
as
possible
same
as
the
( json)
response
Beg As Th
inni
o Use
NOSQL
CLI
for
trimming
records
et
al
ng
e
6. Twitter Tips – A Baker’s Dozen
5. Always
use
a
big
data
pipeline
o Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize
o That
way
you
can
orthogonally
extend,
with
functional
components
like
command
buffers,
validation
et
al
6. Use
functional
approach
for
a
scalable
pipeline
o Compose
your
data
big
pipeline
with
well
defined
granular
functions,
each
doing
only
one
thing
o Don’t
overload
the
functional
components
(i.e.
no
collect,
unroll
&
store
as
a
single
component)
o Have
well
defined
functional
components
with
appropriate
caching,
buffering,
checkpoints
&
restart
techniques
• This did create some trouble for me, as we will see later
7. Crawl-‐Store-‐Validate-‐Recrawl-‐Refresh
cycle
o The
equivalent
of
the
traditional
ETL
o Validation
stage
&
validation
routines
are
important
• Cannot
expect
perfect
runs
• Cannot
manually
look
at
data
either,
when
data
is
at
scale
8. Have
control
numbers
to
validate
runs
&
monitor
them
o I still remember control numbers which start with the number of punch cards in the input deck &d then follow that
number through the various runs !
o There will be a separate printout of the control numbers that will be kept in the operations files
7. Twitter Tips – A Baker’s Dozen
9. Program
defensively
o more so for a REST-based-Big Data-Analytics systems
o Expect
failures
at
the
transport
layer
&
accommodate
for
them
10. Have
Erlang-‐style
supervisors
in
your
pipeline
o Fail
fast
&
move
on
o Don’t
linger
and
try
to
fix
errors
that
cannot
be
controlled
at
that
layer
o A
higher
layer
process
will
circle
back
and
do
incremental
runs
to
correct
missing
spiders
and
crawls
o Be
aware
of
visibility
&
lack
of
context.
Validate
at
the
lowest
layer
that
has
enough
context
to
take
corrective
actions
o I have an example in part 2
11. Data
will
never
be
perfect
o Know
your
data
&
accommodate
for
it’s
idiosyncrasies
• for
example:
0
followers,
protected
users,
0
friends,…
8. Twitter Tips – A Baker’s Dozen
12. Check
Point
frequently
(preferably
after
ever
API
call)
&
have
a
re-‐startable
command
buffer
cache
o See a MongoDB example in Part 2
13. Don’t
bombard
the
URL
o Wait
a
few
seconds
before
successful
calls.
This
will
end
up
with
a
scalable
system,
eventually
o I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to
work with 5 seconds with wait & retry. Then, the rate limit started kicking in !
14. Always
measure
the
elapsed
time
of
your
API
runs
&
processing
o Kind
of
early
warning
when
something
is
wrong
15. Develop
incrementally;
don’t
fail
to
check
“cut
&
paste”
errors
9. Twitter Tips – A Baker’s Dozen
16. The
Twitter
big
data
pipeline
has
lots
of
opportunities
for
parallelism
o Leverage
data
parallelism
frameworks
like
MapReduce
o But
first
:
§ Prototype
as
a
linear
system,
§ Optimize
and
tweak
the
functional
modules
&
cache
strategies,
§ Note
down
stages
and
tasks
that
can
be
parallelized
and
§ Then
parallelize
them
o For the example project, we will see later, I did not leverage any parallel frameworks, but the
opportunities were clearly evident. I will point them out, as we progress through the tutorial
17.
Pay
attention
to
handoffs
between
stages
o They
might
require
transformation
–
for
example
collect
&
store
might
store
a
user
list
as
multiple
arrays,
while
the
model
requires
each
user
to
be
a
document
for
aggregation
o But resist the urge to overload collect with transform
o i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform
the array to separate documents
o Add transformation as a granular function – of course, with appropriate buffering, caching,
checkpoints & restart techniques
18. Have
a
good
log
management
system
to
capture
and
wade
through
logs
10. Twitter Tips – A Baker’s Dozen
19. Understand
the
underlying
network
characteristics
for
the
inference
you
want
to
make
o Twitter
Network
!=
Facebook
Network
,
Twitter
Graph
!=
LinkedIn
Graph
o Twitter
Network
is
more
of
an
Interest
Network
o So, many of the traditional network mechanisms & mechanics, like network
diameter & degrees of separation, might not make sense
o But, others like Cliques and Bipartite Graphs do
11. Twitter Gripes
1. Need
more
rich
APIs
for
#tags
o Somewhat
similar
to
users
viz.
followers,
friends
et
al
o Might
make
sense
to
make
#tags
a
top
level
object
with
it’s
own
semantics
2. HTTP
Error
Return
is
not
uniform
o Returns
400
bad
Request
instead
of
420
o Granted, there is enough information to figure this out
3. Need
an
easier
way
to
get
screen_name
from
user_id
4. “following”
vs.
“friends_count”
i.e.
“following”
is
a
dummy
variable.
o There are a few like this, most probably for backward compatibility
5. Parameter
Validation
is
not
uniform
o Gives
“404
Not
found”
instead
of
“406
Not
Acceptable”
or
“413
Too
Long”
or
“416
Range
Unacceptable”
6. Overall
more
validation
would
help
o Granted, it is more of growing pains. Once one comes across a few inconsistencies, the
rest is easy to figure out
12. A Fork
&
deep
,NLTK
• NLP weets
into
T ment
4
o Sen ysis
Anal
• Not enough time for both
• I chose the Social Graph route
13. A minute about Twitter as platform & it’s evolution
blog/
er. com/ tter-‐
twitt wi
ps:/ /dev. nsistent-‐t
htt ring-‐co
e
deliv ence
“The micro-blogging service must find the
ri
expe
right balance of running a profitable
business and maintaining a robust
“.. we want to make sure that the Twitter experience is developers' community.” – Chenda, CBS
straightforward and easy to understand -- whether you’re on
news!
Twitter.com or elsewhere on the web”-Michael!
My
Wish
&
Hope
• I
spend
a
lot
of
time
with
Twitter
&
derive
value;
the
platform
is
rich
&
the
APIs
intuitive
• I
did
like
the
fact
that
tweets
are
part
of
LinkedIn.
I
still
used
Twitter
more
than
LinkedIn
o I
don’t
think
showing
Tweets
in
LinkedIn
took
anything
away
from
the
Twitter
experience
o LinkedIn
experience
&
Twitter
experience
are
different
&
distinct.
Showing
tweets
in
LinkedIn
didn’t
change
that
• I
sincerely
hope
that
the
platform
grows
with
a
rich
developer
eco
system
• Orthogonally
extensible
platform
is
essential
• Of
course,
along
with
a
congruent
user
experience
–
“
…
core
Twitter
consumption
experience
through
consistent
tools”
14. • For
Hands
on
Today
Setup
o Python
2.7.3
o easy_install
–v
requests
• http://docs.python-‐requests.org/en/latest/user/quickstart/#make-‐a-‐
request
o easy_install
–v
requests-‐oauth
o Hands
on
programs
at
https://github.com/xsankar/oscon2012-‐handson
• For
advanced
data
science
with
social
graphs
o easy_install
–v
networkx
o easy_install
–v
numpy
o easy_install
–v
nltk
• Not
for
this
tutorial,
but
good
for
sentiment
analysis
et
al
o Mongodb
• I
used
MongoDB
in
AWS
m2.xlarge,
RAID
10
X
8
X
15
GB
EBS
o graphviz
-‐
http://www.graphviz.org/;
easy_install
pygraphviz
o easy_install
pydot
16. Problem Domain For this tutorial
• Data
Science
(trends,
analytics
et
al)
on
Social
Networks
as
observed
by
Twitter
primitives
o Not
for
Twitter
based
apps
for
real
time
tweets
o Not
web
sites
with
real
time
tweets
• By
looking
at
the
domain
in
aggregate
to
derive
inferences
&
actionable
recommendations
• Which
also
means,
you
need
to
be
deliberate
&
systemic
(
i.e.
not
look
at
a
fluctuation
as
a
trend
but
dig
deeper
before
pronouncing
a
trend)
17. Agenda
I. Mechanics
:
Twitter
API
(1:30
PM
-‐
3:00
PM)
o Essential
Fundamentals
(Rate
Limit,
HTTP
Codes
et
al)
o Objects
o API
o Hands-‐on
(2:45
PM
-‐
3:00
PM)
II. Break
(3:00
PM
-‐
3:30
PM)
III. Twitter
Social
Graph
Analysis
(3:30
PM
-‐
5:00
PM)
o Underlying
Concepts
o Social
Graph
Analysis
of
@clouderati
§ Stages,
Strategies
&
Tasks
§ Code
Walk
thru
19. Twi5er API : Read These First
• Using
Twitter
Brand
o New
logo
&
associated
guidelines
:
https://twitter.com/about/logos
o Twitter
Rules
:
https://support.twitter.com/groups/33-‐report-‐a-‐violation/topics/121-‐guidelines-‐
best-‐practices/articles/18311-‐the-‐twitter-‐rules
o Developer
Rules
of
the
road
https://dev.twitter.com/terms/api-‐terms
• Read
These
Links
First
1. https://dev.twitter.com/docs/things-‐every-‐developer-‐should-‐know
2. https://dev.twitter.com/docs/faq
3. Field
Guide
to
Objects
https://dev.twitter.com/docs/platform-‐objects
4. Security
https://dev.twitter.com/docs/security-‐best-‐practices
5. Media
Best
Practices
:
https://dev.twitter.com/media
6. Consolidates
Page
:
https://dev.twitter.com/docs
7. Streaming
APIs
https://dev.twitter.com/docs/streaming-‐apis
8. How
to
Appeal
(Not
that
you
all
would
need
it
!)
https://support.twitter.com/
articles/72585
• Only
One
version
of
Twitter
APIs
20. API Status Page
• https://dev.twitter.com/status
• https://dev.twitter.com/issues
• https://dev.twitter.com/discussions
22. Open This First
• Install
pre-‐req
as
per
the
setup
slide
• Run
o oscon2012_open_this_first.py
o To
test
connectivity
–
“canary
query”
• Run
o oscon2012_rate_limit_status.py
o Use
http://www.epochconverter.com
to
check
reset_time
• Formats
xml,
json,
atom
&
rss
23. Twitter
API
Near-realtime,
High Volume
Follow users,
Core Data,
REST
Streaming
topics, data
Core Twitter mining
Objects
Public
Streams
Seach & User
Streams
Trend
Twitter
Twitter
Site
Streams
REST
Search
Firehose
Build
Profile
Keywords
Create/Post
Tweets
Specific
User
Reply
Trends
Favorite,
Re-‐tweet
Rate
Limit
:
Rate
Limit
:
150/350
Complexity
&
Frequency
25. Rate Limits
• By
API
type
&
Authentication
Mode
API
No authC
authC
Error
REST
150/hr
350/hr
400
Search
Complexity
&
-‐N/A-‐
420
Frequency
Streaming
Upto
1%
Fire
hose
none
none
33. Unexplained Errors
• Traceback
(most
recent
call
last):
•
File
"oscon2012_get_user_info_01.py",
line
39,
in
<module>
•
r
=
client.get(url,
params=payload)
•
File
"build/bdist.macosx-‐10.6-‐intel/egg/requests/sessions.py",
line
244,
in
get
•
File
"build/bdist.macosx-‐10.6-‐intel/egg/requests/sessions.py",
line
230,
in
request
•
File
"build/bdist.macosx-‐10.6-‐intel/egg/requests/models.py",
line
609,
in
send
• requests.exceptions.ConnectionError:
HTTPSConnectionPool(host='api.twitter.com',
port=443):
Max
retries
exceeded
with
url:
/1/users/lookup.json?
user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44
614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854
7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8
962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C
While
trying
to
get
details
of
1,000,000
users,
I
get
this
error
–
17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C
usually
10-‐6
AM
PST
42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C
8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%
2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084%
Got
around
by
“Trap
&
wait
5
seconds”
2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%
2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155
56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260
Night
Runs
are
relatively
error
free
09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446
14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886
54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C
13727232%2C199803906%2C220435108%2C268531201
34. • {
•
•
…
"date":
"Fri,
06
Jul
2012
03:41:09
GMT",
A Day in the life of
•
"expires":
"Fri,
06
Jul
2012
03:46:09
GMT",
Twitter Rate Limit
•
"server":
"tfe",
•
"set-‐cookie":
"dnt=;
domain=.twitter.com;
path=/;
expires=Thu,
01-‐Jan-‐1970
00:00:00
GMT",
•
"status":
"400
Bad
Request",
•
"vary":
"Accept-‐Encoding",
•
"x-‐ratelimit-‐class":
"api_identified",
•
"x-‐ratelimit-‐limit":
"350",
•
"x-‐ratelimit-‐remaining":
"0",
Missed by 4 min!
•
"x-‐ratelimit-‐reset":
"1341546334",
•
"x-‐runtime":
"0.01918"
• }
• Error,
sleeping
• {
•
…
•
"date":
"Fri,
06
Jul
2012
03:46:12
GMT",
•
…
•
"status":
"200
OK",
•
…
•
"x-‐ratelimit-‐class":
"api_identified",
•
"x-‐ratelimit-‐limit":
"350",
•
"x-‐ratelimit-‐remaining":
"349",
OK after 5 min sleep
•
…
35. Strategies
I
have
no
exotic
strategies,
so
far
!
1. Obvious
:
Track
elapsed
time
&
sleep
when
rate
limit
kicks
in
2. Combine
authenticated
&
non-‐authenticated
calls
3. Use
multiple
API
types
4. Cache
5. Store
&
get
only
what
is
needed
6. Checkpoint
&
buffer
request
commands
7. Distributed
data
parallelism
–
for
example
AWS
instances
http://www.epochconverter.com/
<-‐
useful
to
debug
the
timer
Pl share your tips and tricks for conserving the Rate Limit
37. Authentication
• Three
modes
o Anonymous
o HTTP
Basic
Auth
o OAuth
• As
of
Aug
31,
2010,
only
Anonymous
or
OAuth
are
supported
•
OAuth
enables
the
user
to
authorize
an
application
without
sharing
credentials
• Also
has
the
ability
to
revoke
• Twitter
supports
OAuth
1.0a
• OAuth
2.0
is
the
new
standard,
much
simpler
o No
timeframe
for
Twitter
support,
yet
38. OAuth Pragmatics
• Helpful
Links
o https://dev.twitter.com/docs/auth/oauth
o https://dev.twitter.com/docs/auth/moving-‐from-‐basic-‐auth-‐to-‐oauth
o https://dev.twitter.com/docs/auth/oauth/single-‐user-‐with-‐examples
o http://blog.andydenmark.com/2009/03/how-‐to-‐build-‐oauth-‐consumer.html
• Discussion
on
OAuth
internal
mechanisms
is
better
left
for
another
day
• For
headless
applications
to
get
OAuth
token,
go
to
https://
dev.twitter.com/apps
•
Create
an
application
&
get
four
credential
pieces
o Consumer
Key,
Consumer
Secret,
Access
Token
&
Access
Token
Secret
• All
the
frameworks
have
support
for
OAuth.
So
plug
–in
these
values
&
use
the
framework’s
calls
• I
used
request-‐oauth
library
like
so:
39. request-‐‑oauth
def
get_oauth_client():
Get
client
using
the
consumer_key
=
"5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"
token,
key
&
secret
from
consumer_secret
=
"fceb3aedb960374e74f559caeabab3562efe97b4"
dev.twitter.com/apps
access_token
=
"df919acd38722bc0bd553651c80674fab2b465086782Ls"
access_token_secret
=
"1370adbe858f9d726a43211afea2b2d9928ed878"
header_auth
=
True
oauth_hook
=
OAuthHook(access_token,
access_token_secret,
consumer_key,
consumer_secret,
header_auth)
client
=
requests.session(hooks={'pre_request':
oauth_hook})
return
client
Use
the
client
instead
def
get_followers(user_id):
of
requests
url
=
'https://api.twitter.com/1/followers/ids.json’
payload={"user_id":user_id}
#
if
cursor
is
needed
{"cursor":-‐1,"user_id":scr_name}
r
=
requests.get(url,
params=payload)
def
get_followers_with_oauth(user_id,client):
url
=
'https://api.twitter.com/1/followers/ids.json'
payload={"user_id":user_id}
#
if
cursor
is
needed
{"cursor":-‐1,"user_id":scr_name}
r
=
client.get(url,
params=payload)
Ref: h5p://pypi.python.org/pypi/requests-‐‑oauth
40. OAuth Authorize screen
• The
user
authenticates
with
Twitter
&
grants
access
to
Forbes
Social
• Forbes
social
doesn’t
have
the
users
credentials,
but
uses
OAuth
to
access
the
user’s
account
42. HTTP status Codes
• 0
Never
made
it
to
Twitter
Servers
-‐
• 404
Not
Found
Library
error
• 406
Not
Acceptable
• 200
OK
• 413
Too
Long
• 304
Not
Modified
• 416
Range
Unacceptable
• 400
Bad
Request
• 420
Enhance
Your
Calm
o Check
error
message
for
explanation
o Rate
Limited
o REST
Rate
Limit
!
• 500
Internal
Server
Error
• 401
UnAuthorized
• 502
Bad
Gateway
o Beware
–
you
could
get
this
for
other
o Down
for
maintenance
reasons
as
well.
• 503
Service
Unavailable
• 403
Forbidden
o Overloaded
“Fail
whale”
o Hit
Update
Limit
(>
max
Tweets/day,
• 504
Gateway
Timeout
following
too
many
people)
o Overloaded
h5ps://dev.twi5er.com/docs/error-‐‑codes-‐‑responses
44. HTTP Status Code – Confusing Example
• {
• GET
https://api.twitter.com/1/users/lookup.json?
• …
screen_nme=twitterapi,twitter&include_entities=
•
"pragma":
"no-‐cache",
true
•
"server":
"tfe",
•
…
• Spelling
Mistake
•
"status":
"404
Not
Found",
o Should
be
screen_name
•
…
• But
confusing
error
!
• }
• {
• Should
be
406
Not
Acceptable
or
413
Too
Long
,
•
"errors":
[
showing
parameter
error
•
{
•
"code":
34,
•
"message":
"Sorry,
that
page
does
not
exist"
•
}
•
]
• }
45. HTTP Status Code -‐‑ Example
• {
•
"cache-‐control":
"no-‐cache,
no-‐store,
must-‐revalidate,
pre-‐check=0,
post-‐check=0",
•
"content-‐encoding":
"gzip",
•
"content-‐length":
"112",
•
"content-‐type":
"application/json;charset=utf-‐8",
Sometimes,
the
errors
are
•
"date":
"Sat,
23
Jun
2012
01:23:47
GMT",
not
correct.
I
got
this
error
•
"expires":
"Tue,
31
Mar
1981
05:00:00
GMT",
• …
for
user_timeline.json
w/
•
"status":
"401
Unauthorized",
user_id=20,15,12
•
"www-‐authenticate":
"OAuth
realm="https://api.twitter.com"",
Clearly
a
parameter
error
•
"x-‐frame-‐options":
"SAMEORIGIN",
•
"x-‐ratelimit-‐class":
"api",
(i.e.
more
parameters)
•
"x-‐ratelimit-‐limit":
"150",
•
"x-‐ratelimit-‐remaining":
"147",
•
"x-‐ratelimit-‐reset":
"1340417742",
•
"x-‐transaction":
"d545a806f9c72b98"
• }
• {
•
"error":
"Not
authorized",
•
"request":
"/1/statuses/user_timeline.json?user_id=12%2C15%2C20"
• }
47. Followers
Twitter
Platform
Friends
Are Followed By
Objects
Follow
Users
Status Update
@ user_mentions
Entities
embed
urls
Temporally
Tweets
embe
d
Ordered
media
TimeLine
#
Places
hashtags
h5ps://dev.twi5er.com/docs/platform-‐‑objects
48. Tweets
• A.k.a
Status
Updates
• Interesting
fields
o Coordinates
<-‐
geo
location
o created_at
o entities
(will
see
later)
o Id,
id_str
o possibly
sensitive
o user
(will
see
later)
• perspectival
attributes
embedded
within
a
child
object
of
an
unlike
parent
–
hard
to
maintain
at
scale
• https://dev.twitter.com/docs/faq#6981
o withheld_in_countries
• https://dev.twitter.com/blog/new-‐withheld-‐content-‐fields-‐api-‐responses
h5ps://dev.twi5er.com/docs/platform-‐‑objects/tweets
49. A word about id, id_str
• June
1,
2010
o Snowflake
the
id
generator
service
o “The
full
ID
is
composed
of
a
timestamp,
a
worker
number,
and
a
sequence
number”
o Had
problems
with
JavaScript
to
handle
numbers
>
53
bits
o “id”:819797
o “id_str”:”819797”
h5p://engineering.twi5er.com/2010/06/announcing-‐‑snowflake.html
h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-‐‑development-‐‑talk/ahbvo3VTIYI
h5ps://dev.twi5er.com/docs/twi5er-‐‑ids-‐‑json-‐‑and-‐‑snowflake
50. Tweets -‐‑ example
• Let
us
run
oscon2012-‐tweets.py
• Example
of
tweet
o coordinates
o id
o id_str
52. Users – Let us run some examples
• Run
o oscon_2012_users.py
• Lookup
users
by
screen_name
o oscon12_first_20_ids.py
• Lookup
users
by
user_id
• Inspect
the
results
o id,
name,
status,
status_count,
protected,
followers
(for
top
10
followers),
withheld
users
• Can
use
information
for
customizing
the
user’s
screen
in
your
web
app
53. Entities
• Metadata
&
Contextual
Information
• You
can
parse
them,
but
Entities
parse
them
out
as
structured
data
• REST
API/Search
API
–
include_entities=1
• Streaming
API
–
included
by
default
• hashtags,
media,
urls,
user_mentions
h5ps://dev.twi5er.com/docs/platform-‐‑objects/entities
h5ps://dev.twi5er.com/docs/tweet-‐‑entities
h5ps://dev.twi5er.com/docs/tco-‐‑url-‐‑wrapper
54. Entities
• Run
o oscon2012_entities.py
• Inspect
hashtags,
urls
et
al
55. Places
• attributes
• bounding_box
• Id
(as
a
string!)
• country
• name
h5ps://dev.twi5er.com/docs/platform-‐‑objects/places
h5ps://dev.twi5er.com/docs/about-‐‑geo-‐‑place-‐‑a5ributes
56. Places
• Can
search
for
tweets
near
a
place
like
so:
• Get
latlong
of
conv
center
[45.52929,-‐122.66289]
o Tweets
near
that
place
• Tweets
near
San
Jose
[37.395715,-‐122.102308]
• We
will
not
see
further
here.
But
very
useful
57. Timelines
• Collections
of
tweets
ordered
by
time
• Use
max_id
&
since_id
for
navigation
h5ps://dev.twi5er.com/docs/working-‐‑with-‐‑timelines
58. Other Objects & APIs
• Lists
• Notifications
• Friendships/exists
to
see
if
one
follows
the
other
59. Followers
Twitter
Platform
Friends
Are Followed By
Objects
Follow
Users
Status Update
@ user_mentions
Entities
embed
urls
Temporally
Tweets
embe
d
Ordered
media
TimeLine
#
Places
hashtags
h5ps://dev.twi5er.com/docs/platform-‐‑objects
60. Hands-‐‑on Exercise (15 min)
• Setup
environment
–
slide
#14
• Sanity
Check
Environment
&
Libraries
o oscon2012_open_this_first.py
o oscon2012_rate_limit_status.py
• Get
objects
(show
calls)
o Lookup
users
by
screen_name
-‐
oscon12_users.py
o Lookup
users
by
id
-‐
oscon12_first_20_ids.py
o Lookup
tweets
-‐
oscon12_tweets.py
o Get
entities
-‐
oscon12_entities.py
• Inspect
the
results
• Explore
a
little
bit
• Discussion
62. Twitter
API
Near-realtime,
High Volume
Follow users,
Core Data,
REST
Streaming
topics, data
Core Twitter mining
Objects
Public Streams
Seach & User Streams
Trend
Twitter
Twitter
Site Streams
REST
Search
Firehose
Build Profile
Keywords
Create/Post Tweets
Specific User
Reply
Trends
Favorite, Re-‐‑tweet
Rate Limit :
Rate Limit : 150/350
Complexity & Frequency
63. Twi5er REST API
• https://dev.twitter.com/docs/api
• What
we
were
doing
were
the
REST
API
• Request-‐Response
• Anonymous
or
OAuth
• Rate
Limited
:
o 150/350
64. Twi5er Trends
• oscon2012-‐trends.py
• Trends/weekly,
Trends/monthly
• Let
us
run
some
examples
o oscon2012_trends_daily.py
o oscon2012_trends_weekly.py
• Trends
&
hashtags
o #hashtag
euro2012
o http://hashtags.org/euro2012
o http://sproutsocial.com/insights/2011/08/twitter-‐hashtags/
o http://blog.twitter.com/2012/06/euro-‐2012-‐follow-‐all-‐action-‐on-‐pitch.html
o Top
10
:
http://twittercounter.com/pages/100,
http://twitaholic.com/
65. Brand Rank w/ Twi5er
• Walk
Through
&
results
of
following
o oscon2012_brand_01.py
• Followed
10
user-‐brands
for
a
few
days
to
find
growth
• Brand
Rank
o Growth
of
a
brand
w.r.t
the
industry
o Surge
in
popularity
–
could
be
due
to
–ve
or
+ve
buzz.
Need
to
understand
&
correlate
using
Twitter
APIs
&
metrics
• API
:
url='https://api.twitter.com/1/users/
lookup.json'
• payload={"screen_name":"miamiheat,okcthunder,n
ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati,
googleio,OReillyMedia"}
67. Brand Rank w/ Twi5er
Tech Brands
• Google
I/O
showed
a
spike
on
6/27-‐
6/28
• OReillyMedia
shares
some
spike
• Looking
at
a
few
days
worth
of
data,
our
best
inference
is
that
“oscon
doesn’t
track
with
googleio”
• “Clouderati
doesn’t
track
at
all”
68. Brand Rank w/ Twi5er
World of Soccer
• FOXSoccer,UEFAcom
track
each
other
The numbers seldom
decrease. So calculating
–ve velocity will not
work
OTOH, if you see a –ve
velocity, investigate
69. Brand Rank w/ Twi5er
World of Basketball
• NBA,
MiamiHeat,
okcthunder
track
each
other
• Used
%
than
absolute
numbers
to
compare
• The
hike
on
7/6
to
7/10
is
interesting.
70. Brand Rank w/ Twi5er
Rising Tide …
• For
some
reason,
all
numbers
are
going
up
7/6
thru
7/10
–
except
for
clouderati!
• Is
a
rising
(Twitter)
tide
lifting
all
(well,
almost
all)
?
71. Trivia : Search API
• Search(search.twitter.com)
o Built
by
Summize
which
was
acquired
by
Twitter
in
2008
o Summize
described
itself
as
“sentiment
mining”
72. Search API
• Very
simple
o GET
http://search.twitter.com/search.json?q=<blah>
• Based
on
a
search
criteria
• “The Twitter Search API is a dedicated API for
running searches against the real-time index of
recent Tweets”
• Recent
=
Last
6-‐9
days
worth
of
tweets
• Anonymous
Call
• Rate
Limit
o Not
No.
of
calls/hour,
but
Complexity
&
Frequency
h5ps://dev.twi5er.com/docs/using-‐‑search
h5ps://dev.twi5er.com/docs/api/1/get/search
73. Search API
• Filters
o Search
URL
encoded
o @
=
%40,
#=%23
o
emoticons
:)
and
:(,
o http://search.twitter.com/search.atom?q=sometimes+%3A)
o http://search.twitter.com/search.atom?q=sometimes+%3A(
• Location
Filters,
date
filters
• Content
searches
74. Streaming API
• Not
request
response;
but
stream
• Twitter
frameworks
have
the
support
• Rate
Limit
:
Upto
1%
• Stall
warning
if
the
client
is
falling
behind
• Good
Documentation
Links
o https://dev.twitter.com/docs/streaming-‐apis/connecting
o https://dev.twitter.com/docs/streaming-‐apis/parameters
o https://dev.twitter.com/docs/streaming-‐apis/processing
75. Firehose
• ~
400
million
public
tweets/day
• If
you
are
working
with
Twitter
firehose,
I
envy
you
!
• If
you
hit
real
limits,
then
explore
the
firehose
route
• AFAIK,
it
is
not
cheap,
but
worth
it
76. API Best Practices
1. Use
JSON
2. Use
user_id
than
screen_name
o User_id
is
constant
while
screen_name
can
change
3. max_id
and
since_id
o For
example
direct
messages,
if
you
have
last
message
use
since_id
for
search
o max_id
how
far
to
go
back
4. Cache
as
much
as
you
can
5. Set
the
User-‐Agent
header
for
debugging
I have listed a few good blogs that have API best practices, in the
reference section, at the end of this presentation
These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the
sources
77. Twitter
API
Near-realtime,
High Volume
Follow users,
Core Data,
REST
Streaming
topics, data
Core Twitter mining
Objects
Public Streams
Seach & User Streams
Trend
Twitter
Twitter
Site Streams
REST
Search
Firehose
Build Profile
Questions
?
Keywords
Create/Post Tweets
Specific User
Reply
Trends
Favorite, Re-‐‑tweet
Rate Limit :
Rate Limit : 150/350
Complexity & Frequency
79. 2.
Store
3.
Transform
&
1.
Collect
Analyze
the
Validate Dataset & . Keep don’t
Tip: 3 simple;
re-crawl/refresh
a
schem afrai d to
be
for m
Most
important
&
trans
the
ugliest
slide
in
this
deck
!
as
lem ent ,
1. Imp ipeline 4.
Model
Tip: age d p nolith 5.
Predict,
&
a st r a mo Reason
neve Recommend
&
Visualize
80. Trivia
• Social
Network
Analysis
originated
as
Sociometry
&
the
social
network
was
called
a
sociogram
• Back
then,
Facebook
was
called
SocioBinder!
• Jacob
Levi
Morano,
is
considered
the
originator
o NYTimes,
April
3,
1933,
P.
17
82. Twi5er Networks-‐‑Definitions
• In-‐degree
o Followers
• Out-‐Degree
o Friends/Follow
• Centrality
Measures
• Hubs
&
Authorities
o Hubs/Directories
tell
us
where
Authorities
are
o “Of
Mortals
&
Celebrities”
is
more
“Twitter-‐style”
83. Twi5er Networks-‐‑Properties
M
• Concepts
From
Citation
N
Networks
K
J
o Cocitation
L
I
• Common
papers
that
cite
a
paper
A
• Common
Followers
B G
o C
&
G
(Followed
by
F
&
H)
C H
o Bibliographic
Coupling
• Cite
the
same
papers
D F
• Common
Friends
(i.e.
follow
same
E
person)
o D,
E,
F
&
H
84. Twi5er Networks-‐‑Properties
• Concepts
From
Citation
Networks
M
o Cocitation
N
• Common
papers
that
cite
a
paper
K
• Common
Followers
J
L
o C
&
G
(Followed
by
F
&
H)
I
o Bibliographic
Coupling
A
• Cite
the
same
papers
B G
• Common
Friends
(i.e.
follow
same
person)
o D,
E,
F
&
H
follow
C
o H
&
F
follow
C
&
G
H
C
• So
H
&
F
have
high
coupling
D
• Hence,
if
H
follows
A,
we
can
F
recommend
F
to
follow
A
E
85. Twi5er Networks-‐‑Properties
• Bipartite/Affiliation
Networks
o Two
disjoint
subsets
o The
bipartite
concept
is
very
relevant
to
Twitter
social
graph
o Membership
in
Lists
• lists
vs.
users
bipartite
graph
o Common
#Tags
in
Tweets
• #tags
vs.
members
bipartite
graph
o @mention
together
• ?
Can
this
be
a
bipartite
graph
• ?
How
would
we
fold
this
?
86. Other Metrics & Mechanisms
• Kronecker
Graphs
Models
o Kronecker
product
is
a
way
of
generating
self-‐similar
matrices
o Prof.Leskovec
et
al
define
the
Kronecker
product
of
two
graphs
as
the
Kronecker
product
of
their
adjacency
matrices
o Application
:
Generating
models
for
analysis,
prediction,
anomaly
detection
et
al
• Erdos-‐Renyl
Random
Graphs
o Easy
to
build
a
Gn,p
graph
o Assumes
equal
likelihood
of
edges
between
two
nodes
o In a Twitter social network, we can create a more realistic expected distribution (adding the
“social reality” dimension) by inspecting the #tags & @mentions
• Network
Diameter
• Weak
Ties
• Follower
velocity
(+ve
&
–ve),
Association
strength
o Unfollow
not
a
reliable
measure
o But
an
interesting
property
to
investigate
when
it
happens
Not covered here, but potential for an encore !
Ref: Jure Leskovec: Kronecker Graphs, Random Graphs
87. Twi5er Networks-‐‑Properties
• Twitter != LinkedIn, Twitter != Facebook
• Twitter Network == Interest Network
• Be
cognizant
of
the
above
when
you
apply
traditional
network
properties
to
Twitter
• For
example,
o Six
degrees
of
separation
doesn't
make
sense
(most
of
the
time)
in
Twitter
–
except
may
be
for
Cliques
o Is
diameter
a
reliable
measure
for
a
Twitter
Network
?
• Probably
not
o Do
cut
sets
make
sense
?
• Probably
not
o But
citation
network
principles
do
apply;
we
can
learn
from
cliques
o Bipartite
graphs
do
make
sense
88. Cliques (1 of 2)
• “Maximal
subset
of
the
vertices
in
an
undirected
network
such
that
every
member
of
the
set
is
connected
by
an
edge
to
every
other”
• Cohesive
subgroup,
closely
connected
• Near-‐cliques
than
a
perfect
clique
(k-‐plex
i.e.
connected
to
at
least
n-‐k
others)
• k-‐plex
clique
to
discover
sub
groups
in
a
sparse
network;
1-‐plex
being
the
perfect
clique
Ref: Networks, An Introduction-‐‑Newman
89. Cliques (2 of 2)
• k-‐core
–
at
least
k
others
in
the
subset;
(n-‐k)-‐plex
• k-‐clique
–
no
more
than
k
distance
away
o Path
inside
or
outside
the
subset
o k-‐clan
or
k-‐club
(path
inside
the
subset)
• We
will
apply
k-‐plex
Cliques
for
one
of
our
hands-‐on
Ref: Networks, An Introduction-‐‑Newman
90. Sentiment Analysis
• Sentiment
Analysis
is
an
important
&
interesting
work
on
the
Twitter
platform
o Collect
Tweets
o Opinion
Estimation
-‐Pass
thru
Classifier,
Sentiment
Lexicons
• Naïve
Bayes/Max
Entropy
Class/SVM
o Aggregated
Text
Sentiment/Moving
Average
• I
chose
not
to
dive
deeper
because
of
time
constraints
o Couldn’t
do
justice
to
API,
Social
Network
and
Sentiment
Analysis,
all
in
3
hrs
• Next
3
Slides
have
couple
of
interesting
examples
92. Need I say more ?
“A
bit
of
clever
math
can
uncover
interes4ng
pa7erns
that
are
not
visible
to
the
human
eye”
h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-‐‑social-‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket
h5p://www.relevantdata.com/pdfs/IUStudy.pdf
95. Interesting Vectors of Exploration
1. Find
trending
#tags
&
then
related
#tags
–
using
cliques
over
co-‐#tag-‐citation,
which
infers
topics
related
to
trending
topics
2. Related
#tag
topics
over
a
set
of
tweets
by
a
user
or
group
of
users
3. Analysis-‐In/Out
flow,
Tweet
Flow
– Frequent
@mention
4. Find
affiliation
networks
by
List
memberships,
#tags
or
frequent
@mentions
96. Interesting Vectors of Exploration
5. Use
centrality
measures
to
determine
mortals
vs.
celebrities
6. Classify
Tweet
networks/cliques
based
on
message
passing
characteristics
– Tweets
vs.
Retweets,
No
of
reweets,…
7. Retweet
Network
– Measure
Influence
by
retweet
count
&
frequency
– Information
contagion
by
looking
at
different
retweet
network
subcomponents
–
who,
when,
how
much,…
98. Analysis Story Board
• @clouderati
is
a
popular
cloud
related
Twitter
account
• Goals:
o Analyze
the
social
graph
characteristics
of
the
users
who
are
following
the
account
In this • Dig
one
level
deep,
to
the
followers
&
friends,
of
the
tutorial
followers
of
@clouderati
o How
many
cliques
?
How
strong
are
they
?
o Does
the
@mention
support
the
clique
inferences
?
For you to o What
are
the
retweet
characteristics
?
explore !!
o How
does
the
#tag
network
graph
look
like
?
99. Twi5er Analysis Pipeline Story Board
Stages, Strategies, APIs & Tasks
Stage
4
Stag
o e
5
o Get
&
Store
User
details
For
e
(distinct
user
list)
follo ach
@c
o w loud
o Unroll
Find er
erat
frie i
inte nd=f
rsec o
tion llower
Note:
Needed
a
Note:
Unroll
-‐
se
stage
took
time
t
command
buffer
to
manage
scale
&
missteps
(~980,000
users)
Stage
3
Stage
6
raph
s ocial
g heory
o Create twork
t
ne
o Get
distinct
user
list
o Apply
ues
&
other
applying
the
liq
o Infer
c s
set(union(list))
operation
tie
proper
100. @clouderati Twi5er Social Graph
• Stats
(Retrospect
after
the
runs):
o Stage
1
• @clouderati
has
2072
followers
o Stage
2
• Limiting
followers
to
5,000
per
user
o Stage
3
• Digging
1st
level
(set
union
of
followers
&
friends
of
the
followers
of
@clouderati)
explodes
into
~980,000
distinct
users
o MongoDB
of
the
cache
and
intermediate
datasets
~10
GB
o The
database
was
hosted
at
AWS
(Hi
Mem
XLarge
–
m2.xlarge
),
8
X
15
GB,
Raid
10,
opened
to
Internet
with
DB
authentication
101. Code & Run Walk Through
o Code:
§ oscon_2012_user_list_spider_01.py
o Challenges:
Stage
1
§ Nothing
fancy
§ Get
the
record
and
store
o Get
@clouderati
Followers
o Store
in
MongoDB
§ Would
have
had
to
recurse
through
a
REST
cursor
if
there
were
more
than
5000
followers
§ @clouderati
has
2072
followers
o Interesting
Points:
102. Code & Run Walk Through
o Code:
§ oscon_2012_user_list_spider_02.py
§ oscon_2012_twitter_utils.py
§ oscon_2012_mongo.py
§ oscon_2012_validate_dataset.py
o Challenges:
§ Multiple
runs,
errors
et
al
!
Stage
2
o Interesting
Points:
§ Set
operation
between
two
mongo
collections
for
restart
buffer
o Crawl
1
level
deep
§ Protected
users,
some
had
0
followers,
or
0
friends
o Get
friends
&
followers
§ Interesting
operations
for
validate,
re-‐crawl
and
refresh
o Validate,
re-‐crawl
&
refresh
§ Added
“status_code”
to
differentiate
protected
users
§ {'$set':
{'status_code':
'401
Unauthorized,401
Unauthorized'}}
§ Getting friends & followers of 2000 users is the hardest (or so I thought,
until I got through the next stage!)