Your SlideShare is downloading. ×
0
Twitter                    stream processing                                        Colin SurprenantBig Data Montreal     ...
Twitter - Spring 2012● 350 million   tweets/day● 140 million   active users● >1 million    applications using API
Daily Twitter @ Needium50 000 000 processed tweets     5 000 opportunities       500 messages sent   100GB data
Anatomy of a tweetWhats in a tweet?                                  timestamp                 user                       ...
How to get the tweets?● Streaming API  Subscribe to realtime feeds moving forward● REST Search API  Search request on past...
Streaming APIpublic statuses from all users● status/filter    track/location/follow     ○ 5000 follow user ids     ○ 400 t...
Streaming APIper user streams● User Streams    ○   all data required to update a users display    ○   requires users OAuth...
Streaming APIFirehoseneed more/full data?           only through partners● gnip.com● datasift.com●   filtering/tracking●  ...
Search API● REST API (http request/response)  ○   search query  ○   geocode (lat, long, radius)  ○   result type (mixed/re...
StormDistributed and fault-tolerant realtime computation                         https://github.com/nathanmarz/storm
Storm                     The promise●   Guaranteed data processing●   Horizontal scalability●   Fault-tolerance●   No int...
RedStormStorm: Java + ClosureRedStorm: JRuby integration & DSL for Storm                                     +Ruby + Storm...
StormTypical use cases    Stream          Continuous    Distributed  processing        computation      RPC
StormConcepts                       Streams  Tuple        Tuple    Tuple    Tuple   Tuple           Unbounded sequence of ...
StormConcepts                           Spouts                                            Tuple                           ...
StormConcepts                                          Bolts  Tuple          Tuple                  Tuple                 ...
StormConcepts                  Topology           Network of spouts and bolts
StormConcepts   bolt A   Shuffle       bolt B                           All                                              b...
Storm                   What Storm does●   Distributes code●   Robust process management●   Monitors topologies and reassi...
TweitgeistLive top 10 trending hashtags on Twitter      DEMO                  https://github.com/colinsurprenant/tweitgeis...
TweitgeistArchitecture                      Partitioned    Partitioned                      counting       ranking        ...
TweitgeistArchitecture● Parallel computation across partitions of the stream● Scalable architecture● Work for full Twitter...
TweitgeistStorm Topology               TwitterStream             ExtractMessage             ExtractHashtag                ...
TweitgeistTopology definition
TweitgeistWhere                          Code        https://github.com/colinsurprenant/tweitgeist                     Liv...
Twitter Stream Processing
Upcoming SlideShare
Loading in...5
×

Twitter Stream Processing

7,428

Published on

Process the Twitter stream using Storm & Redstorm with Ruby & JRuby. Full working demo, code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/

Published in: Technology

Transcript of "Twitter Stream Processing"

  1. 1. Twitter stream processing Colin SurprenantBig Data Montreal @colinsurprenantApril 2012 Lead ninja
  2. 2. Twitter - Spring 2012● 350 million tweets/day● 140 million active users● >1 million applications using API
  3. 3. Daily Twitter @ Needium50 000 000 processed tweets 5 000 opportunities 500 messages sent 100GB data
  4. 4. Anatomy of a tweetWhats in a tweet? timestamp user message avatarAnything else?
  5. 5. How to get the tweets?● Streaming API Subscribe to realtime feeds moving forward● REST Search API Search request on past data (1 week)
  6. 6. Streaming APIpublic statuses from all users● status/filter track/location/follow ○ 5000 follow user ids ○ 400 track keywords ○ 25 location boxes ○ rate limited● status/sample ○ 1% of all public statuses (message id mod 100) ○ two status/sample streams will result in same data
  7. 7. Streaming APIper user streams● User Streams ○ all data required to update a users display ○ requires users OAuth token ○ statuses from followings, direct messages, mentions ○ cannot open large number of user streams from same host● Site Streams ○ multiplexing of multiple User Streams
  8. 8. Streaming APIFirehoseneed more/full data? only through partners● gnip.com● datasift.com● filtering/tracking● $$$ partial to full FirehoseWhats the catch? lots of it
  9. 9. Search API● REST API (http request/response) ○ search query ○ geocode (lat, long, radius) ○ result type (mixed/recent/popular) ○ since id● max 100 rpp and 1500 results● rate limited (~1 request/sec)
  10. 10. StormDistributed and fault-tolerant realtime computation https://github.com/nathanmarz/storm
  11. 11. Storm The promise● Guaranteed data processing● Horizontal scalability● Fault-tolerance● No intermediate message brokers● Higher level abstraction than message passing● Just work
  12. 12. RedStormStorm: Java + ClosureRedStorm: JRuby integration & DSL for Storm +Ruby + Storm on JVM https://github.com/colinsurprenant/redstorm
  13. 13. StormTypical use cases Stream Continuous Distributed processing computation RPC
  14. 14. StormConcepts Streams Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
  15. 15. StormConcepts Spouts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Source of streams
  16. 16. StormConcepts Bolts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Processes input streams and produce new streams
  17. 17. StormConcepts Topology Network of spouts and bolts
  18. 18. StormConcepts bolt A Shuffle bolt B All bolt A bolt B random+fair distribution replication to all tasks across tasks Grouping bolt A Fields bolt B bolt A Global bolt B field X field Y partition on specified field entire stream to single task
  19. 19. Storm What Storm does● Distributes code● Robust process management● Monitors topologies and reassigns failed tasks● Provides reliability by tracking tuple trees● Routing and partitioning of stream
  20. 20. TweitgeistLive top 10 trending hashtags on Twitter DEMO https://github.com/colinsurprenant/tweitgeist Live demo: http://tweitgeist.needium.com
  21. 21. TweitgeistArchitecture Partitioned Partitioned counting ranking Parallel extraction Merging Partitioned Partitioned counting ranking Queue Queue N partitions N partitions Twitter stream Frontend
  22. 22. TweitgeistArchitecture● Parallel computation across partitions of the stream● Scalable architecture● Work for full Twitter firehose with very little mods
  23. 23. TweitgeistStorm Topology TwitterStream ExtractMessage ExtractHashtag RollingCount Rank Merge Spout Bolt Bolt Bolt Bolt Bolt Twitter field hashtag field hashtag UI global shuffle shuffle Redis Redis queue queue stream message hashtag rolling ranking merging reader extract extract counter Shuffle grouping: Tuples are randomly distributed across the bolts tasks Fields grouping: The stream is partitioned by the fields specified in the grouping Global grouping: The entire stream goes to a single one of the bolts tasks
  24. 24. TweitgeistTopology definition
  25. 25. TweitgeistWhere Code https://github.com/colinsurprenant/tweitgeist Live demo http://tweitgeist.needium.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×