• Save
Filtering From the Firehose: Real Time Social Media Streaming
 

Filtering From the Firehose: Real Time Social Media Streaming

on

  • 2,221 views

All Things Cloud Developer Meetup. ...

All Things Cloud Developer Meetup.
Filtering From the Firehose: Real Time Social Media Streaming with Jim Moffitt from Gnip. Gnip is the world's largest and most trusted provider of social data.
Learn about collecting and filtering social media data with streaming APIs. Jim will cover best practices, use case examples and live demos of filtering data from Twitter.

Statistics

Views

Total Views
2,221
Views on SlideShare
1,017
Embed Views
1,204

Actions

Likes
4
Downloads
0
Comments
0

5 Embeds 1,204

http://www.cloud-elements.com 1030
http://www.style3.com 103
http://cloud-elements.com 68
https://twitter.com 2
http://www2.cloud-elements.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Filtering From the Firehose: Real Time Social Media Streaming Filtering From the Firehose: Real Time Social Media Streaming Presentation Transcript

  • Filtering from the Firehose ! Real-time streaming of social network data! ! ! Jim Moffitt – Developer Advocate @gnip @jimmoffitt
  • Who is this guy and what is he going to talk about? •  Introduc)on   •  Social  media  firehoses   •  Data  sources   •  Use-­‐cases   •  Needle  in  the  haystack   •  Filtering  from  the  firehose   •  Example  use-­‐case   •  Server-­‐side   •  Apache  KaCa       •  Apache  Cassandra   •  Client-­‐side   •  HTTP  streaming  code  examples   •  Live  streaming  and  search        
  • What is a firehose? •  Con)nuous  stream  of  flexibly  structured   (JSON)  social  media  ac)vi)es  in  near-­‐real   )me.   •  Poten)ally  extreme  amounts  of  data.   View slide
  • Available firehoses and public APIs View slide
  • Accessing Social Data for Analytics:! Crawling/Scraping! Licensed Access: ! Publisher provides data “firehose”! It’s Free! Open Access! No rate limits, compliant, reliable! Rate limits, not guaranteed! TOS issues, high latency, fragile! Financial investment, not all publishers are covered! Public API’s! Pros Cons
  • Example firehose volumes Publisher   Daily  Ac0vity   TwiQer   450  M   Tumblr   96  M  +  54  M  votes   Foursquare   4.3  M   Disqus   1.9  M   Wordpress  Comments   1.4  M   Wordpress  Posts   0.6  M   GetGlue   0.6  M  
  • Daily Tweet Activity Count 2006 5k 4k 3k 2k 1k 0 2007 200 k 100 k 0 Tweets/Day 2008 1.6 M 1.2 M 800.0 k 400.0 k 2009 25 M 20 M 15 M 10 M 5M 2010 80 M 60 M 40 M 20 M 2011 250 M 200 M 150 M 100 M Jan Feb Mar Apr May Jun Jul Date Aug Sep Oct Nov Dec Jan
  • Use-cases for Social Media Analysis •  •  •  •  •  •  Sales  &  Marke)ng   Brand  monitoring   Customer  Service     Public  Rela)ons   Emergency  Response   All  kinds  of  academic  research…  
  • So you are building something around social media? Some  business  considera)ons:     •  Objec)ve  –  what  are  the  ques)ons  that  you  are  trying  to  answer?   •    Timeframe  –  real-­‐)me  or  historical  use-­‐case  (or  both)?   •    Coverage  –  do  I  need  all  the  data  or  some  sta)s)cal  sample?   •  Licensing  and  Terms  of  Service     •  Budgets   •  Data  costs.   •  Sofware  development.   •  Infrastructure  (bandwidth,  servers,  storage).      
  • So you are building something around social media? Some  technical  considera)ons:     •  Data  transfer  protocols:  RESTful  or  ‘keep-­‐alive’  Streaming?   •  What  sofware  language?   •  Bandwidth:  what  does  your  peak  volume  need  to  be?   •  Data  storage   •  How  and  where  are  you  storing  the  data?   •  What  metadata  do  you  need  to  store?*   •  Redundant  streams?      
  • What data comes with a tweet? {"id":"tag:search.twiQer.com,2005:388326436685103105","objectType":"ac)vity","actor":{"objectType":"person","id":"id:twiQer.com: 17200003","link":"hQp://www.twiQer.com/jimmoffiQ","displayName":"jimmoffiQ","postedTime":"2008-­‐11-­‐05T23:06:37.000Z","image":"hQps:// si0.twimg.com/profile_images/3678478654/6aac91cc6bd5711b82c83ebab0a55de0_normal.jpeg","summary":"Once  studied  snow  hydrology.    Recently   developed  real-­‐)me  weather  monitoring  and  flood  warning  sofware.    Have  started  a  new  adventure  at  an  amazing  company...","links": [{"href":null,"rel":"me"}],"friendsCount":69,"followersCount":71,"listedCount":1,"statusesCount":189,"twiQerTimeZone":"Mountain  Time  (US  &   Canada)","verified":false,"utcOffset":"-­‐21600","preferredUsername":"jimmoffiQ","languages":["en"],"loca)on": {"objectType":"place","displayName":"Longmont,  Colorado"},"favoritesCount":17},"verb":"post","postedTime":"2013-­‐10-­‐10T15:33:31.000Z","generator": {"displayName":"TweetDeck","link":"hQp://www.tweetdeck.com"},"provider":{"objectType":"service","displayName":"TwiQer","link":"hQp:// www.twiQer.com"},"link":"hQp://twiQer.com/jimmoffiQ/statuses/388326436685103105","body":"Looking  forward  to  this  "All  Things  Cloud"  meet-­‐up  in   Denver  next  Tuesday  10/15  hGp://t.co/EQSCWMW4hL  @gnip","object":{"objectType":"note","id":"object:search.twiQer.com, 2005:388326436685103105","summary":"Looking  forward  to  this  "All  Things  Cloud"  meet-­‐up  in  Denver  next  Tuesday  10/15  hQp://t.co/EQSCWMW4hL   @gnip","link":"hQp://twiQer.com/jimmoffiQ/statuses/388326436685103105","postedTime":"2013-­‐10-­‐10T15:33:31.000Z"},"favoritesCount": 0,"twiQer_en))es":{"hashtags":[],"symbols":[],"urls":[{"url":"hQp://t.co/EQSCWMW4hL","expanded_url":"hQp://meetu.ps/ 1Fywpg","display_url":"meetu.ps/1Fywpg","indices":[80,102]}],"user_men)ons":[{"screen_name":"gnip","name":"Gnip,  Inc.","id": 16958875,"id_str":"16958875","indices":[103,108]}]},"twiQer_filter_level":"medium","twiQer_lang":"en","retweetCount":0,"gnip":{"matching_rules": [{"value":""All  Things  Cloud"","tag":null},{"value":"from:jimmoffiQ","tag":null}],"urls":[{"url":"hQp://t.co/EQSCWMW4hL","expanded_url":"hQp:// www.meetup.com/All-­‐things-­‐Cloud-­‐PaaS-­‐SaaS-­‐PaaS-­‐XaaS/events/124584092/"}],"klout_score":49,"klout_profile":{"topics": [{"klout_topic_id":"10000000000000000020","displayName":"Tablets","link":"hQp://klout.com/topic/id/ 10000000000000000020"}],"klout_user_id":"26177177599171892","link":"hQp://klout.com/user/id/26177177599171892"},"language": {"value":"en"},"profileLoca)ons":[{"objectType":"place","geo":{"type":"point","coordinates":[-­‐105.10193,40.16721]},"address":{"country":"United   States","countryCode":"US","locality":"Longmont","region":"Colorado"},"displayName":"Longmont,  Colorado,  United  States"}]}}  
  • Methods for filtering data •  Token  filter  (e.g.  "pizza",  "beer"  )   •  Substrings  (contains:sport)   •  Exact  phrases  ("all  things  cloud”)   •  Operators:  metadata  (geo,  language,  profiles,  account  stats,  ...  )   •  Operators:  sampling  (e.g.  sample:10%)   •  Publisher-­‐specific  Operators:  hashtags,  user  men)ons/from/to,  retweets,  ...          Examples:                        (pizza  beer)  "all  things  cloud"  profile_region:colorado                        twins  (baseball  OR  minnesota  OR  sports  OR  “small  market”)  –(cute  OR  baby  OR    olsen  OR  olson)    
  • ! Example use-case: Early-warning systems  Is  there  a  TwiQer  ‘signal’  around  local  rain  and  flood  events?   Business  logic:     rain  OR  raining  OR  rained  OR  pouring  OR  weather  OR  hail  OR  lightning  OR   contains:flood  OR  "cats  and  dogs"  OR  wxreport  OR  contains:storm  OR   contains:precip           See  h   Qp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  • Social media and early-warning systems There  are  generally  three  methods  for  geo-­‐referencing  TwiQer  data:     •  Ac)vity  Loca)on:  tweets  that  are  geo-­‐tagged.   •  Men)oned  Loca)on:  parsing  the  tweet  message  for  geographic  loca)on.   •  Profile  Loca)on:  parsing  the  TwiQer  Account  Profile  loca)on  provided  by  the  user.       •  User  account  profile:  82%   •  Tweet  text:  17%   •  Tweet  geo-­‐tagging:  1%   See  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  • Social media and early-warning systems •  Profile  Loca)on  (old):   •  bio_loca)on_contains:louisville  -­‐(bio_loca)on_contains:"co  "  OR   bio_loca)on_contains:colorado)  -­‐(bio_loca)on_contains:"tn  "   OR  bio_loca)on_contains:tennessee)   •  Profile  Loca)on  (new):   •  profile_locality:louisville  profile_region:kentucky         See  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  • Social media and early-warning systems         See  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  • Social media and early-warning systems See  hQp://blog.gnip.com/twee)ng-­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  • Apache Kafka @ Gnip KaCa  is  used  to  help  manage  streaming  traffic  with  the  outside  world.         First  applica)on  was  with  outbound  streams                                              Gnip  à  Customer       Helps  provide  a  “on-­‐disk”  buffer  for  client  streams.  Write  data  to  disk  for  a   short  period.    If  client  disconnects,  when  they  reconnect  their  data  buffer  is     “backfilled.”    
  • Apache Kafka @ Gnip Next  applied  to  inbound  Publisher  streams                                                    Publisher    à    Gnip     Buffers  incoming  data  and  helps  manage  massive  volume  spikes.       Spikes  are  isolated  to  this  ingest  )er.     Downstream  applica)ons  read  data  as  fast  as  they  can.    
  • Apache Cassandra @ Gnip!   Serves  a  moving  window  of  TwiQer  day  (currently  30  days).    Will  grow.     Chosen  for  its     •  Write-­‐speeds     •  Reliability   •  Redundancy   •  Scalability    
  • Apache Cassandra @ Gnip!   •  Serves  a  variety  of  data  services,  products  and  use-­‐cases.       •  For  Search  we  have  an  Apache  Lucene  index  helping  to  quickly  find  Cassandra  data.   •  Nearly  50  Cassandra  servers  across  test/staging/produc)on  environments.  
  • Streaming social media curl  -­‐ujmoffiQ@gnipcentral.com  hQps://api.gnip.com:443/accounts/jim/publishers/twiQer/ streams/track/dev/rules.json     curl  -­‐v  -­‐X  POST  -­‐ujmoffiQ@gnipcentral.com     "hQps://api.gnip.com:443/accounts/jim/publishers/twiQer/streams/track/dev/rules.json"     -­‐d  '{"rules":[{"tag":"demo","value":"weather  OR  rain  OR  snow"}]}'   curl  -­‐-­‐compressed  -­‐v  -­‐ujmoffiQ@gnipcentral.com     "hQps://stream.gnip.com:443/accounts/jim/publishers/twiQer/streams/track/dev.json"  
  • Code examples Search  GitHub  for  “TwiQer  Stream”     Python  Streaming  Connec)on   We've  found  793  repository  results   HERE   Ruby  Streaming  Connec)on  (using  ‘curb’  libcurl  gem)   HERE   Ruby  Streaming  Connec)on  (using  EventMachine  gem)   HERE  
  • Live Search Demo hQps://search-­‐demo.prod.gnip.com:8443   hQps://github.com/gnip/gnip-­‐search-­‐demo  
  • Questions?