Filtering From the Firehose: Real Time Social Media Streaming

Filtering from the Firehose !
Real-time streaming of social network data!

!
!

Jim Mofﬁtt – Developer Advocate @gnip
@jimmofﬁtt

Who is this guy and what is he going to talk about?
•  Introduc)on

•  Social
media
ﬁrehoses

•  Data
sources

•  Use-‐cases

•  Needle
in
the
haystack

•  Filtering
from
the
ﬁrehose

•  Example
use-‐case

•  Server-‐side

•  Apache
KaCa

•  Apache
Cassandra

•  Client-‐side

•  HTTP
streaming
code
examples

•  Live
streaming
and
search

What is a ﬁrehose?

• 

Con)nuous
stream
of
ﬂexibly
structured

(JSON)
social
media
ac)vi)es
in
near-‐real

)me.

• 

Poten)ally
extreme
amounts
of
data.

Available ﬁrehoses and public APIs

Accessing Social Data for Analytics:!

Crawling/Scraping!

Licensed Access: !
Publisher provides
data “ﬁrehose”!

It’s Free!

Open Access!

No rate limits,
compliant,
reliable!

Rate limits, not
guaranteed!

TOS issues,
high latency,
fragile!

Financial
investment, not
all publishers
are covered!

Public API’s!

Pros

Cons

Example ﬁrehose volumes
Publisher

Daily
Ac0vity

TwiQer

450
M

Tumblr

96
M
+
54
M
votes

Foursquare

4.3
M

Disqus

1.9
M

Wordpress
Comments

1.4
M

Wordpress
Posts

0.6
M

GetGlue

0.6
M

Daily Tweet Activity Count
2006

5k
4k
3k
2k
1k
0

2007
200 k
100 k
0

Tweets/Day

2008
1.6 M
1.2 M
800.0 k
400.0 k

2009

25 M
20 M
15 M
10 M
5M

2010
80 M
60 M
40 M
20 M
2011

250 M
200 M
150 M
100 M
Jan

Feb

Mar

Apr

May

Jun

Jul

Date

Aug

Sep

Oct

Nov

Dec

Jan

Use-cases for Social Media Analysis
• 
• 
• 
• 
• 
• 

Sales
&
Marke)ng

Brand
monitoring

Customer
Service

Public
Rela)ons

Emergency
Response

All
kinds
of
academic
research…

So you are building something around social media?
Some
business
considera)ons:

•  Objec)ve
–
what
are
the
ques)ons
that
you
are
trying
to
answer?

• 
Timeframe
–
real-‐)me
or
historical
use-‐case
(or
both)?

• 
Coverage
–
do
I
need
all
the
data
or
some
sta)s)cal
sample?

•  Licensing
and
Terms
of
Service

•  Budgets

•  Data
costs.

•  Sofware
development.

•  Infrastructure
(bandwidth,
servers,
storage).

So you are building something around social media?
Some
technical
considera)ons:

•  Data
transfer
protocols:
RESTful
or
‘keep-‐alive’
Streaming?

•  What
sofware
language?

•  Bandwidth:
what
does
your
peak
volume
need
to
be?

•  Data
storage

•  How
and
where
are
you
storing
the
data?

•  What
metadata
do
you
need
to
store?*

•  Redundant
streams?

What data comes with a tweet?
{"id":"tag:search.twiQer.com,2005:388326436685103105","objectType":"ac)vity","actor":{"objectType":"person","id":"id:twiQer.com:
17200003","link":"hQp://www.twiQer.com/jimmoffiQ","displayName":"jimmoffiQ","postedTime":"2008-‐11-‐05T23:06:37.000Z","image":"hQps://
si0.twimg.com/profile_images/3678478654/6aac91cc6bd5711b82c83ebab0a55de0_normal.jpeg","summary":"Once
studied
snow
hydrology.

Recently

developed
real-‐)me
weather
monitoring
and
flood
warning
sofware.

Have
started
a
new
adventure
at
an
amazing
company...","links":
[{"href":null,"rel":"me"}],"friendsCount":69,"followersCount":71,"listedCount":1,"statusesCount":189,"twiQerTimeZone":"Mountain
Time
(US
&

Canada)","verified":false,"utcOffset":"-‐21600","preferredUsername":"jimmoffiQ","languages":["en"],"loca)on":
{"objectType":"place","displayName":"Longmont,
Colorado"},"favoritesCount":17},"verb":"post","postedTime":"2013-‐10-‐10T15:33:31.000Z","generator":
{"displayName":"TweetDeck","link":"hQp://www.tweetdeck.com"},"provider":{"objectType":"service","displayName":"TwiQer","link":"hQp://
www.twiQer.com"},"link":"hQp://twiQer.com/jimmoffiQ/statuses/388326436685103105","body":"Looking
forward
to
this
"All
Things
Cloud"
meet-‐up
in

Denver
next
Tuesday
10/15
hGp://t.co/EQSCWMW4hL
@gnip","object":{"objectType":"note","id":"object:search.twiQer.com,
2005:388326436685103105","summary":"Looking
forward
to
this
"All
Things
Cloud"
meet-‐up
in
Denver
next
Tuesday
10/15
hQp://t.co/EQSCWMW4hL

@gnip","link":"hQp://twiQer.com/jimmoffiQ/statuses/388326436685103105","postedTime":"2013-‐10-‐10T15:33:31.000Z"},"favoritesCount":
0,"twiQer_en))es":{"hashtags":[],"symbols":[],"urls":[{"url":"hQp://t.co/EQSCWMW4hL","expanded_url":"hQp://meetu.ps/
1Fywpg","display_url":"meetu.ps/1Fywpg","indices":[80,102]}],"user_men)ons":[{"screen_name":"gnip","name":"Gnip,
Inc.","id":
16958875,"id_str":"16958875","indices":[103,108]}]},"twiQer_filter_level":"medium","twiQer_lang":"en","retweetCount":0,"gnip":{"matching_rules":
[{"value":""All
Things
Cloud"","tag":null},{"value":"from:jimmoffiQ","tag":null}],"urls":[{"url":"hQp://t.co/EQSCWMW4hL","expanded_url":"hQp://
www.meetup.com/All-‐things-‐Cloud-‐PaaS-‐SaaS-‐PaaS-‐XaaS/events/124584092/"}],"klout_score":49,"klout_profile":{"topics":
[{"klout_topic_id":"10000000000000000020","displayName":"Tablets","link":"hQp://klout.com/topic/id/
10000000000000000020"}],"klout_user_id":"26177177599171892","link":"hQp://klout.com/user/id/26177177599171892"},"language":
{"value":"en"},"profileLoca)ons":[{"objectType":"place","geo":{"type":"point","coordinates":[-‐105.10193,40.16721]},"address":{"country":"United

States","countryCode":"US","locality":"Longmont","region":"Colorado"},"displayName":"Longmont,
Colorado,
United
States"}]}}

Methods for filtering data
•  Token
filter
(e.g.
"pizza",
"beer"
)

•  Substrings
(contains:sport)

•  Exact
phrases
("all
things
cloud”)

•  Operators:
metadata
(geo,
language,
profiles,
account
stats,
...
)

•  Operators:
sampling
(e.g.
sample:10%)

•  Publisher-‐specific
Operators:
hashtags,
user
men)ons/from/to,
retweets,
...

Examples:

(pizza
beer)
"all
things
cloud"
profile_region:colorado

twins
(baseball
OR
minnesota
OR
sports
OR
“small
market”)
–(cute
OR
baby
OR

olsen
OR
olson)

!

Example use-case: Early-warning systems

Is
there
a
TwiQer
‘signal’
around
local
rain
and
ﬂood
events?

Business
logic:

rain
OR
raining
OR
rained
OR
pouring
OR
weather
OR
hail
OR
lightning
OR

contains:ﬂood
OR
"cats
and
dogs"
OR
wxreport
OR
contains:storm
OR

contains:precip

See
h

Qp://blog.gnip.com/twee)ng-‐in-‐the-‐rain
Parts
1,
2
&
3

Social media and early-warning systems
There
are
generally
three
methods
for
geo-‐referencing
TwiQer
data:

•  Ac)vity
Loca)on:
tweets
that
are
geo-‐tagged.

•  Men)oned
Loca)on:
parsing
the
tweet
message
for
geographic
loca)on.

•  Profile
Loca)on:
parsing
the
TwiQer
Account
Profile
loca)on
provided
by
the
user.

•  User
account
profile:
82%

•  Tweet
text:
17%

•  Tweet
geo-‐tagging:
1%

See
hQp://blog.gnip.com/twee)ng-‐in-‐the-‐rain
Parts
1,
2
&
3

•  Profile
Loca)on
(old):

•  bio_loca)on_contains:louisville
-‐(bio_loca)on_contains:"co
"
OR

bio_loca)on_contains:colorado)
-‐(bio_loca)on_contains:"tn
"

OR
bio_loca)on_contains:tennessee)

•  Profile
Loca)on
(new):

•  profile_locality:louisville
profile_region:kentucky

See
Parts
1,
2
&
3


See
Parts
1,
2
&
3

Apache Kafka @ Gnip
KaCa
is
used
to
help
manage
streaming
traffic
with
the
outside
world.

First
applica)on
was
with
outbound
streams

Gnip
à
Customer

Helps
provide
a
“on-‐disk”
buffer
for
client
streams.
Write
data
to
disk
for
a

short
period.

If
client
disconnects,
when
they
reconnect
their
data
buffer
is

“backfilled.”

Apache Kafka @ Gnip
Next
applied
to
inbound
Publisher
streams

Publisher

à

Gnip

Buﬀers
incoming
data
and
helps
manage
massive
volume
spikes.

Spikes
are
isolated
to
this
ingest
)er.

Downstream
applica)ons
read
data
as
fast
as
they
can.

Apache Cassandra @ Gnip!

Serves
a
moving
window
of
TwiQer
day
(currently
30
days).

Will
grow.

Chosen
for
its

•  Write-‐speeds

•  Reliability

•  Redundancy

•  Scalability

Apache Cassandra @ Gnip!

•  Serves
a
variety
of
data
services,
products
and
use-‐cases.

•  For
Search
we
have
an
Apache
Lucene
index
helping
to
quickly
ﬁnd
Cassandra
data.

•  Nearly
50
Cassandra
servers
across
test/staging/produc)on
environments.

Streaming social media
curl
-‐ujmoﬃQ@gnipcentral.com
hQps://api.gnip.com:443/accounts/jim/publishers/twiQer/
streams/track/dev/rules.json

curl
-‐v
-‐X
POST

"hQps://api.gnip.com:443/accounts/jim/publishers/twiQer/streams/track/dev/rules.json"

-‐d
'{"rules":[{"tag":"demo","value":"weather
OR
rain
OR
snow"}]}'

curl
-‐-‐compressed
-‐v

"hQps://stream.gnip.com:443/accounts/jim/publishers/twiQer/streams/track/dev.json"

Code examples
Search
GitHub
for
“TwiQer
Stream”

Python
Streaming
Connec)on

We've
found
793
repository
results

HERE

Ruby
Streaming
Connec)on
(using
‘curb’
libcurl
gem)

HERE

Ruby
Streaming
Connec)on
(using
EventMachine
gem)
HERE

Live Search Demo

hQps://search-‐demo.prod.gnip.com:8443

hQps://github.com/gnip/gnip-‐search-‐demo

Filtering From the Firehose: Real Time Social Media Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Viewers also liked

Viewers also liked (7)

Similar to Filtering From the Firehose: Real Time Social Media Streaming

Similar to Filtering From the Firehose: Real Time Social Media Streaming (20)

More from Cloud Elements

More from Cloud Elements (20)

Recently uploaded

Recently uploaded (20)

Filtering From the Firehose: Real Time Social Media Streaming