Filtering From the Firehose: Real Time Social Media Streaming


Published on

All Things Cloud Developer Meetup.
Filtering From the Firehose: Real Time Social Media Streaming with Jim Moffitt from Gnip. Gnip is the world's largest and most trusted provider of social data.
Learn about collecting and filtering social media data with streaming APIs. Jim will cover best practices, use case examples and live demos of filtering data from Twitter.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Filtering From the Firehose: Real Time Social Media Streaming

  1. 1. Filtering from the Firehose ! Real-time streaming of social network data! ! ! Jim Moffitt – Developer Advocate @gnip @jimmoffitt
  2. 2. Who is this guy and what is he going to talk about? •  Introduc)on   •  Social  media  firehoses   •  Data  sources   •  Use-­‐cases   •  Needle  in  the  haystack   •  Filtering  from  the  firehose   •  Example  use-­‐case   •  Server-­‐side   •  Apache  KaCa       •  Apache  Cassandra   •  Client-­‐side   •  HTTP  streaming  code  examples   •  Live  streaming  and  search        
  3. 3. What is a firehose? •  Con)nuous  stream  of  flexibly  structured   (JSON)  social  media  ac)vi)es  in  near-­‐real   )me.   •  Poten)ally  extreme  amounts  of  data.  
  4. 4. Available firehoses and public APIs
  5. 5. Accessing Social Data for Analytics:! Crawling/Scraping! Licensed Access: ! Publisher provides data “firehose”! It’s Free! Open Access! No rate limits, compliant, reliable! Rate limits, not guaranteed! TOS issues, high latency, fragile! Financial investment, not all publishers are covered! Public API’s! Pros Cons
  6. 6. Example firehose volumes Publisher   Daily  Ac0vity   TwiQer   450  M   Tumblr   96  M  +  54  M  votes   Foursquare   4.3  M   Disqus   1.9  M   Wordpress  Comments   1.4  M   Wordpress  Posts   0.6  M   GetGlue   0.6  M  
  7. 7. Daily Tweet Activity Count 2006 5k 4k 3k 2k 1k 0 2007 200 k 100 k 0 Tweets/Day 2008 1.6 M 1.2 M 800.0 k 400.0 k 2009 25 M 20 M 15 M 10 M 5M 2010 80 M 60 M 40 M 20 M 2011 250 M 200 M 150 M 100 M Jan Feb Mar Apr May Jun Jul Date Aug Sep Oct Nov Dec Jan
  8. 8. Use-cases for Social Media Analysis •  •  •  •  •  •  Sales  &  Marke)ng   Brand  monitoring   Customer  Service     Public  Rela)ons   Emergency  Response   All  kinds  of  academic  research…  
  9. 9. So you are building something around social media? Some  business  considera)ons:     •  Objec)ve  –  what  are  the  ques)ons  that  you  are  trying  to  answer?   •    Timeframe  –  real-­‐)me  or  historical  use-­‐case  (or  both)?   •    Coverage  –  do  I  need  all  the  data  or  some  sta)s)cal  sample?   •  Licensing  and  Terms  of  Service     •  Budgets   •  Data  costs.   •  Sofware  development.   •  Infrastructure  (bandwidth,  servers,  storage).      
  10. 10. So you are building something around social media? Some  technical  considera)ons:     •  Data  transfer  protocols:  RESTful  or  ‘keep-­‐alive’  Streaming?   •  What  sofware  language?   •  Bandwidth:  what  does  your  peak  volume  need  to  be?   •  Data  storage   •  How  and  where  are  you  storing  the  data?   •  What  metadata  do  you  need  to  store?*   •  Redundant  streams?      
  11. 11. What data comes with a tweet? {"id":",2005:388326436685103105","objectType":"ac)vity","actor":{"objectType":"person","id":" 17200003","link":"hQp://","displayName":"jimmoffiQ","postedTime":"2008-­‐11-­‐05T23:06:37.000Z","image":"hQps://","summary":"Once  studied  snow  hydrology.    Recently   developed  real-­‐)me  weather  monitoring  and  flood  warning  sofware.    Have  started  a  new  adventure  at  an  amazing  company...","links": [{"href":null,"rel":"me"}],"friendsCount":69,"followersCount":71,"listedCount":1,"statusesCount":189,"twiQerTimeZone":"Mountain  Time  (US  &   Canada)","verified":false,"utcOffset":"-­‐21600","preferredUsername":"jimmoffiQ","languages":["en"],"loca)on": {"objectType":"place","displayName":"Longmont,  Colorado"},"favoritesCount":17},"verb":"post","postedTime":"2013-­‐10-­‐10T15:33:31.000Z","generator": {"displayName":"TweetDeck","link":"hQp://"},"provider":{"objectType":"service","displayName":"TwiQer","link":"hQp://"},"link":"hQp://","body":"Looking  forward  to  this  "All  Things  Cloud"  meet-­‐up  in   Denver  next  Tuesday  10/15  hGp://  @gnip","object":{"objectType":"note","id":", 2005:388326436685103105","summary":"Looking  forward  to  this  "All  Things  Cloud"  meet-­‐up  in  Denver  next  Tuesday  10/15  hQp://   @gnip","link":"hQp://","postedTime":"2013-­‐10-­‐10T15:33:31.000Z"},"favoritesCount": 0,"twiQer_en))es":{"hashtags":[],"symbols":[],"urls":[{"url":"hQp://","expanded_url":"hQp:// 1Fywpg","display_url":"","indices":[80,102]}],"user_men)ons":[{"screen_name":"gnip","name":"Gnip,  Inc.","id": 16958875,"id_str":"16958875","indices":[103,108]}]},"twiQer_filter_level":"medium","twiQer_lang":"en","retweetCount":0,"gnip":{"matching_rules": [{"value":""All  Things  Cloud"","tag":null},{"value":"from:jimmoffiQ","tag":null}],"urls":[{"url":"hQp://","expanded_url":"hQp://­‐things-­‐Cloud-­‐PaaS-­‐SaaS-­‐PaaS-­‐XaaS/events/124584092/"}],"klout_score":49,"klout_profile":{"topics": [{"klout_topic_id":"10000000000000000020","displayName":"Tablets","link":"hQp:// 10000000000000000020"}],"klout_user_id":"26177177599171892","link":"hQp://"},"language": {"value":"en"},"profileLoca)ons":[{"objectType":"place","geo":{"type":"point","coordinates":[-­‐105.10193,40.16721]},"address":{"country":"United   States","countryCode":"US","locality":"Longmont","region":"Colorado"},"displayName":"Longmont,  Colorado,  United  States"}]}}  
  12. 12. Methods for filtering data •  Token  filter  (e.g.  "pizza",  "beer"  )   •  Substrings  (contains:sport)   •  Exact  phrases  ("all  things  cloud”)   •  Operators:  metadata  (geo,  language,  profiles,  account  stats,  ...  )   •  Operators:  sampling  (e.g.  sample:10%)   •  Publisher-­‐specific  Operators:  hashtags,  user  men)ons/from/to,  retweets,  ...          Examples:                        (pizza  beer)  "all  things  cloud"  profile_region:colorado                        twins  (baseball  OR  minnesota  OR  sports  OR  “small  market”)  –(cute  OR  baby  OR    olsen  OR  olson)    
  13. 13. ! Example use-case: Early-warning systems  Is  there  a  TwiQer  ‘signal’  around  local  rain  and  flood  events?   Business  logic:     rain  OR  raining  OR  rained  OR  pouring  OR  weather  OR  hail  OR  lightning  OR   contains:flood  OR  "cats  and  dogs"  OR  wxreport  OR  contains:storm  OR   contains:precip           See  h   Qp://­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  14. 14. Social media and early-warning systems There  are  generally  three  methods  for  geo-­‐referencing  TwiQer  data:     •  Ac)vity  Loca)on:  tweets  that  are  geo-­‐tagged.   •  Men)oned  Loca)on:  parsing  the  tweet  message  for  geographic  loca)on.   •  Profile  Loca)on:  parsing  the  TwiQer  Account  Profile  loca)on  provided  by  the  user.       •  User  account  profile:  82%   •  Tweet  text:  17%   •  Tweet  geo-­‐tagging:  1%   See  hQp://­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  15. 15. Social media and early-warning systems •  Profile  Loca)on  (old):   •  bio_loca)on_contains:louisville  -­‐(bio_loca)on_contains:"co  "  OR   bio_loca)on_contains:colorado)  -­‐(bio_loca)on_contains:"tn  "   OR  bio_loca)on_contains:tennessee)   •  Profile  Loca)on  (new):   •  profile_locality:louisville  profile_region:kentucky         See  hQp://­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  16. 16. Social media and early-warning systems         See  hQp://­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  17. 17. Social media and early-warning systems See  hQp://­‐in-­‐the-­‐rain  Parts  1,  2  &  3    
  18. 18. Apache Kafka @ Gnip KaCa  is  used  to  help  manage  streaming  traffic  with  the  outside  world.         First  applica)on  was  with  outbound  streams                                              Gnip  à  Customer       Helps  provide  a  “on-­‐disk”  buffer  for  client  streams.  Write  data  to  disk  for  a   short  period.    If  client  disconnects,  when  they  reconnect  their  data  buffer  is     “backfilled.”    
  19. 19. Apache Kafka @ Gnip Next  applied  to  inbound  Publisher  streams                                                    Publisher    à    Gnip     Buffers  incoming  data  and  helps  manage  massive  volume  spikes.       Spikes  are  isolated  to  this  ingest  )er.     Downstream  applica)ons  read  data  as  fast  as  they  can.    
  20. 20. Apache Cassandra @ Gnip!   Serves  a  moving  window  of  TwiQer  day  (currently  30  days).    Will  grow.     Chosen  for  its     •  Write-­‐speeds     •  Reliability   •  Redundancy   •  Scalability    
  21. 21. Apache Cassandra @ Gnip!   •  Serves  a  variety  of  data  services,  products  and  use-­‐cases.       •  For  Search  we  have  an  Apache  Lucene  index  helping  to  quickly  find  Cassandra  data.   •  Nearly  50  Cassandra  servers  across  test/staging/produc)on  environments.  
  22. 22. Streaming social media curl  -­‐  hQps:// streams/track/dev/rules.json     curl  -­‐v  -­‐X  POST  -­‐     "hQps://"     -­‐d  '{"rules":[{"tag":"demo","value":"weather  OR  rain  OR  snow"}]}'   curl  -­‐-­‐compressed  -­‐v  -­‐     "hQps://"  
  23. 23. Code examples Search  GitHub  for  “TwiQer  Stream”     Python  Streaming  Connec)on   We've  found  793  repository  results   HERE   Ruby  Streaming  Connec)on  (using  ‘curb’  libcurl  gem)   HERE   Ruby  Streaming  Connec)on  (using  EventMachine  gem)   HERE  
  24. 24. Live Search Demo hQps://search-­‐   hQps://­‐search-­‐demo  
  25. 25. Questions?