Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Twitter Big Data           Colin Surprenant           @colinsurprenant           Lead ninja           warning: this presen...
Twitter - Spring 2012● 350 million   tweets/day● 140 million   active users● >1 million    applications using API
Daily Twitter @ Needium50 000 000 processed tweets     5 000 opportunities       500 messages sent   100GB data
Needium Geo
Anatomy of a tweetWhats in a tweet?                                  timestamp                 user                       ...
How to get the tweets?● Streaming API  Subscribe to realtime feeds moving forward● REST Search API  Search request on past...
Streaming APIpublic statuses from all users● status/filter    track/location/follow     ○ 5000 follow user ids     ○ 400 t...
Streaming APIper user streams● User Streams    ○   all data required to update a users display    ○   requires users OAuth...
Streaming APIFirehoseneed more/full data?           only through partners● gnip.com● datasift.com●   filtering/tracking●  ...
Streaming APIFirehoseBase Twitter data license$0.10 per 1000 tweets~$1 million/monthapprox for full Firehose
Streaming APIFirehosestartup?
Search API● REST API (http request/response)  ○   search query  ○   geocode (lat, long, radius)  ○   result type (mixed/re...
Twitter GeoNO simple way to grabALL tweets for a given region
Twitter GeoStreaming API● status/filter + location (bounding box)   ○ only tweets with explicit coordinates   ○ < 10% of a...
Twitter GeoStreaming API● Firehose   ○ < 10% of all tweets contains explicit coordinates   ○ must do reverse geocoding on ...
Twitter GeoSearch API● geocode (lat, long, radius)  ● tweets with explicit coordinates  ● tweets reverse geocoded from use...
Twitter GeoSolutions?Thats your job!But seriously?
Twitter Geo● search API intelligent polling farm    ○   adjust polling interval to minimize polling in relation to traffic...
StormDistributed and fault-tolerant realtime computation                         https://github.com/nathanmarz/storm
Storm                     The promise●   Guaranteed data processing●   Horizontal scalability●   Fault-tolerance●   No int...
RedStorm    JRuby integration & DSL for Storm   Simplicity of Ruby + power of Storm    https://github.com/colinsurprenant/...
StormTypical use cases     Stream processing   Continuous computation
StormConcepts                       Streams  Tuple        Tuple    Tuple    Tuple   Tuple           Unbounded sequence of ...
StormConcepts                           Spouts                                            Tuple                           ...
StormConcepts                                          Bolts  Tuple          Tuple                  Tuple                 ...
StormConcepts                  Topology           Network of spouts and bolts
Storm                   What Storm does●   Distributes code●   Robust process management●   Monitors topologies and reassi...
TweitgeistLive top 10 trending hashtags on Twitter      DEMO                  https://github.com/colinsurprenant/tweitgeis...
Tweitgeist              TwitterStream             ExtractMessage             ExtractHashtag                   RollingCount...
TweitgeistTopology definition
Twitter Big Data
Upcoming SlideShare
Loading in …5
×

Twitter Big Data

9,214 views

Published on

How to access & analyse Twitter big data. Full working example using Storm and RedStorm in Ruby & JRuby. Code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/

Published in: Technology
  • Be the first to comment

Twitter Big Data

  1. 1. Twitter Big Data Colin Surprenant @colinsurprenant Lead ninja warning: this presentation contains a few rage faces
  2. 2. Twitter - Spring 2012● 350 million tweets/day● 140 million active users● >1 million applications using API
  3. 3. Daily Twitter @ Needium50 000 000 processed tweets 5 000 opportunities 500 messages sent 100GB data
  4. 4. Needium Geo
  5. 5. Anatomy of a tweetWhats in a tweet? timestamp user message avatarAnything else?
  6. 6. How to get the tweets?● Streaming API Subscribe to realtime feeds moving forward● REST Search API Search request on past data (1 week)
  7. 7. Streaming APIpublic statuses from all users● status/filter track/location/follow ○ 5000 follow user ids ○ 400 track keywords ○ 25 location boxes ○ rate limited● status/sample ○ 1% of all public statuses (message id mod 100) ○ two status/sample streams will result in same data
  8. 8. Streaming APIper user streams● User Streams ○ all data required to update a users display ○ requires users OAuth token ○ statuses from followings, direct messages, mentions ○ cannot open large number of user streams from same host● Site Streams ○ multiplexing of multiple User Streams
  9. 9. Streaming APIFirehoseneed more/full data? only through partners● gnip.com● datasift.com● filtering/tracking● partial to full FirehoseWhats the catch?
  10. 10. Streaming APIFirehoseBase Twitter data license$0.10 per 1000 tweets~$1 million/monthapprox for full Firehose
  11. 11. Streaming APIFirehosestartup?
  12. 12. Search API● REST API (http request/response) ○ search query ○ geocode (lat, long, radius) ○ result type (mixed/recent/popular) ○ since id● max 100 rpp and 1500 results● rate limited (~1 request/sec)
  13. 13. Twitter GeoNO simple way to grabALL tweets for a given region
  14. 14. Twitter GeoStreaming API● status/filter + location (bounding box) ○ only tweets with explicit coordinates ○ < 10% of all tweets
  15. 15. Twitter GeoStreaming API● Firehose ○ < 10% of all tweets contains explicit coordinates ○ must do reverse geocoding on user profile location ○ user profile location is free form
  16. 16. Twitter GeoSearch API● geocode (lat, long, radius) ● tweets with explicit coordinates ● tweets reverse geocoded from user profile location ● location field: free form text (Montreal / Montreal,Qc / Mtl / Mourial) ● false positives ● REST API: not for frequent polling ● rate limited (1 req/sec/ip)
  17. 17. Twitter GeoSolutions?Thats your job!But seriously?
  18. 18. Twitter Geo● search API intelligent polling farm ○ adjust polling interval to minimize polling in relation to traffic● streaming API status/filter/follow reader farm? ○ find N relevant users from city, # stream readers = N / 5000 ○ must do reverse geocoding ○ user list dynamic update● TOS gray zone
  19. 19. StormDistributed and fault-tolerant realtime computation https://github.com/nathanmarz/storm
  20. 20. Storm The promise● Guaranteed data processing● Horizontal scalability● Fault-tolerance● No intermediate message brokers● Higher level abstraction than message passing● Just work
  21. 21. RedStorm JRuby integration & DSL for Storm Simplicity of Ruby + power of Storm https://github.com/colinsurprenant/redstorm +
  22. 22. StormTypical use cases Stream processing Continuous computation
  23. 23. StormConcepts Streams Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
  24. 24. StormConcepts Spouts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Source of streams
  25. 25. StormConcepts Bolts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Processes input streams and produce new streams
  26. 26. StormConcepts Topology Network of spouts and bolts
  27. 27. Storm What Storm does● Distributes code● Robust process management● Monitors topologies and reassigns failed tasks● Provides reliability by tracking tuple trees● Routing and partitioning of stream
  28. 28. TweitgeistLive top 10 trending hashtags on Twitter DEMO https://github.com/colinsurprenant/tweitgeist Live demo: http://tweitgeist.needium.com
  29. 29. Tweitgeist TwitterStream ExtractMessage ExtractHashtag RollingCount Rank Merge Spout Bolt Bolt Bolt Bolt Bolt Twitter field hashtag field hashtag UI global shuffle shuffleRedis Redisqueue queue stream message hashtag rolling ranking merging reader extract extract counter Shuffle grouping: Tuples are randomly distributed across the bolts tasks Fields grouping: The stream is partitioned by the fields specified in the grouping Global grouping: The entire stream goes to a single one of the bolts tasks
  30. 30. TweitgeistTopology definition

×