Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

5,490 views
5,234 views

Published on

This talk given at Devoxx Paris 2014 gives an overview of lambda architecture, and possible alternative in their implementation

Published in: Technology

Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

  1. 1. @fdouetteau#lambdataiku Lambda Architecture @fdouetteau Dataiku, www.dataiku.com Florian Douetteau, CEO Dataiku
  2. 2. @fdouetteau#lambdataiku Topics For Today •WHAT is a lambda architecture •Examples - Principle •Motivation – Hard Points •HOW to you build a lambda architecture ? •Components per component
  3. 3. @fdouetteau#lambdataiku Lambda EVENTS PROCES S STATE SER VE
  4. 4. @fdouetteau#lambdataiku ƛ : SOME USE CASES • Online Advertising • Keep track of number of displays / clicks per positions / campaigns • Recommender Systems • Keep track of production displays / views / click / buy • Statistical Time Line • Keep Track of number of tweets per hashtag / hour
  5. 5. @fdouetteau#lambdataiku SQL WAY EVENTS PROCES S STATE SER VE USER1 ITEM1 VIEW USER1 ITEM2 BUY INSERT OR UPDATE VIEWS SET pageviews = pageviews + 1 WHERE user=USER1 … RDBMSSQL
  6. 6. @fdouetteau#lambdataiku Functional Programming Append Only EVENTS PROCES S STATE (APPEND ONLY) SER VE newstate = Fagg (oldstate, Fstore(events)) result= F (state, lastevents, scope)
  7. 7. @fdouetteau#lambdataiku E.g. counting twitter hashtags EVENTS PROCES S STATE SER VE Fmap ( ) = { (#tag, time) -> count } FReduce( hashmap, hashmap ) = fuse count in maps FDisplay( hashmap, events ) = Freduce(hashmap, Fmap(events)) TWEET COUNTS (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 NEW TWEETS TABLE 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar
  8. 8. @fdouetteau#lambdataiku E.g. counting twitter hashtags in “SQL” SER VE TWEET COUNTS TABLE (2014-02-31 13, #foo) -> 8 (2014-02-31 13, #foo2) -> 3 (2014-02-31 13, #foo3) -> 3 (2014-02-31 13, #foo4) -> 1 NEW TWEETS TABLE 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar PARTIAL TWEET COUNT TABLE (2014-02-31 13, #foo) -> 1 (2014-02-31 14, #foo) -> 3 (2014-02-31 14, #foo) -> 3 (2014-02-31 14, #foo) -> NEW TWEET COUNT TABLE (2014-02-31 13, #foo) -> 9 (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 CREATE … AS SELECT time, tag, COUNT(*) GROUP BY TIME, TAG CREATE AS SELEC time, tag, SUM(counts) FROM ( oldtable … UNION partialtable) GROUP BY TIME, TAG SELECT, time, tag, SUM(c) FROM ( SELECT time, tag, c FROM oldtable WHERE tag = … UNION SELECT time, tag, c FROM partialtable WHERE tag=… ) INSERT VALUES … RENAME TABLE … EXECUTE EACH 5 MINUTES EXECUTE EACH HOUR
  9. 9. @fdouetteau#lambdataiku ƛ : PRINCIPLE EVENTS BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION
  10. 10. @fdouetteau#lambdataiku Backtype Story Capture events and logs from twitter 25TB binary data 100 Billlios records 400 QPS Average Scale 1 -> 150 on peak Take off with a team of 3 engineers with seed funding in 2008 Christopher Golda Michael Montano Nathan Marz Acquired by Twitter ( power twitter trends …) in 2011 Cascalog Storm ElephantDB
  11. 11. @fdouetteau#lambdataiku TWITTER HASHTAGS 2014-02-31 13:14 #foo bar BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 COMPUTE EVERY 5 MINUTES HASHTAG COUNTS FOR THE LAST 5 MINUTES (IN MEMORY) COMPUTE EVERY HOUR HASHTAG COUNT FOR THE LAST HOUR (ON DISK)
  12. 12. @fdouetteau#lambdataiku RECOMMENDER SYSTEM BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION USER1 ITEM1 VIEW USER1 ITEM2 BUY USER1 ITEM1 VIEW USER1 ITEM1 VIEW ITEM-ITEM SIMILARITY MATRIX USER -> [ ITEM1, … ITEMn] RECOMMENDATION
  13. 13. @fdouetteau#lambdataiku THREE KEY DRIVERS FOR LAMBDA ARCH
  14. 14. @fdouetteau#lambdataiku DRIVER 1: Support Smooth Evolution 2014-02-31 13:14 #foo bar BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar (2014-02-31 13:14,, #foo) -> 3 (2014-02-31 13:14, #foo) -> 3 (1) RECOMPUTE NEW VERSION ON BATCH WHILE KEEPING THE OLD ONE (2014-02-31 13, #foo) -> 3 (2) THEN UPDATE THE ONLINE VERSION
  15. 15. @fdouetteau#lambdataiku DRIVER 2: Real-Time System Offline 2014-02-31 13:14 #foo bar BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 COMPUTE EVERY HOUR HASHTAG COUNT FOR THE LAST HOUR (ON DISK) FALLBACK TO PARTIAL RESULT WHEN REAL-TIME GRID IS OFFLINE
  16. 16. @fdouetteau#lambdataiku DRIVER 3 : CAN‟T RECOMPUTE BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION USER1 ITEM1 VIEW USER1 ITEM2 BUY USER1 ITEM1 VIEW USER1 ITEM1 VIEW ITEM-ITEM SIMILARITY MATRIX USER -> [ ITEM1, … ITEMn] RECOMMENDATION
  17. 17. @fdouetteau#lambdataiku PAIN POINTS
  18. 18. @fdouetteau#lambdataiku PAINT POINT 1 : EXACTLY ONCE 2014-02-31 13:14 #foo bar 2014-02-31 13:15 toto 2014-02-31 13:15 tutu 2014-02-31 13:16 #two … … Retry
  19. 19. @fdouetteau#lambdataiku PAINT POINT 2 : DYNAMIC SCALE START AT 100 events per second HOW TO GROW TO 10k events per second without rebuilding everything ?
  20. 20. @fdouetteau#lambdataiku PAINT POINT 3 : SCHEMA CHANGE BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION EVENTS V1 EVENTS V2 MIX OF VERSION 1 AND VERSION 2 !!!!
  21. 21. @fdouetteau#lambdataiku TOOLS AND FRAMEWOR K
  22. 22. @fdouetteau#lambdataiku Lambda Architecture Building Blocks Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing
  23. 23. @fdouetteau#lambdataiku Components Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing STORM HDFS MapRed HBASE MEMCACHE MONGODB WEBAPPRABBITMQ FLUME
  24. 24. @fdouetteau#lambdataiku Components Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing
  25. 25. @fdouetteau#lambdataiku Message Queues Kestrel (Single Node) Kafka (Linkedin, Distributed) RabbitMQ ActiveMQ Micro-Batch, State in Processor Persitent Event, State in Queue, Rich Routing
  26. 26. @fdouetteau#lambdataiku TOPOLOGY : SINGLE PIPE Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing STORM STORM
  27. 27. @fdouetteau#lambdataiku Storm Developped in 2008-2009 at BackType First open source release in 2011 BOLT TUPLE TUPLE TUPLE SPOUT TUPLE
  28. 28. @fdouetteau#lambdataiku Topologies SPOUT SPOUT BOLT BOLT BOLT BOLT This one likely to write in a State This one tooo
  29. 29. @fdouetteau#lambdataiku public class HashTagParseBolt extends BaseRichBolt { OutputCollector _collector public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tweet) { for(String hashtag : tweet.getString(„hashtags‟)) { _collector.emit(new Values(tweet.time, hashtag)); } } public void deplaceOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(„time‟, „hashtag‟)); } } Parse Tweet Bolt
  30. 30. @fdouetteau#lambdataiku Topologies Tweet Spout Parse Tweet Bolt Count HashT ags Bolt Store in Flat File Tweet
  31. 31. @fdouetteau#lambdataiku BALANCING CLUSTER NODE PROCESS EXECUTOR TASK TASK ONE PER TOPOLOGY PER SPOUT OR BOLT EXECUTOR TASK NODE PROCESS REBALANCE
  32. 32. @fdouetteau#lambdataiku (Optional) RELIABILITY • When emitting a tuple from an existing tuple, trace origin • “Ack” or “Fail” each tuple • If a tuple or dependent tuples not fully “acked” REPLAY
  33. 33. @YourTwitterHandle#YourSessionHashtag public class HashTagParseBolt extends BaseRichBolt { OutputCollector _collector public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tweet) { for(String hashtag : tweet.getString(„hashtags‟)) { _collector.emit(tweet, new Values(tweet.time, hashtag)); } _collector.ack(tweet); } public void deplaceOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(„time‟, „hashtag‟)); } } Reliable Parse Tweet
  34. 34. @fdouetteau#lambdataiku TOPOLOGY 2 : SHARE RT Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing TRIDENT TRIDENT TRIDENT
  35. 35. @fdouetteau#lambdataiku TRIDENT • Higher Level Operations • Use Storm as an RPC Framework • State “Management”
  36. 36. @fdouetteau#lambdataiku From Schema To Storm Topology
  37. 37. @fdouetteau#lambdataiku How is exactly-once implemented? {user=paul, item=car, event=imp} {user=pierre, item=car, event=imp} {user=1, item=car, event=imp} {user=paul, item=car, event=imp} {user=pierre, item=car, event=imp} {user=pierre, item=car, event=imp} … txid=1 txid=3 txid=2
  38. 38. @fdouetteau#lambdataiku Exactly-Once in state paul -> { car: 2, txid=2 } pierre -> {car : 5, txid=3 } paul -> { car: 3, txid=3 } pierre -> {car : 5, txid=3 } {user=paul, item=car, event=imp} {user=pierre, item=car, event=imp} {user=pierre, item=car, event=imp} txid=3 Keep Track of last transaction in state Transaction does not apply to newer state parts
  39. 39. @fdouetteau#lambdataiku TOPOLOGY 1 : SHARE STATE Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing USE A SINGLE NOSQL SERVICE FOR ALL USE CASES
  40. 40. @fdouetteau#lambdataiku REDIS VARIANT Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing REDIS REDIS REDIS REDIS ALSO USE THE NOSQL AS A MESSAGE QUEUE
  41. 41. @fdouetteau#lambdataiku TOPOLOGY 3 : SHARED PROCESSING Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing
  42. 42. @fdouetteau#lambdataiku SummingBird Single Scala specification than can run in “Batch” on “Real-Time” Mode Single Scala Code Run on Storm Topology Run on Cascading (Batch)
  43. 43. @fdouetteau#lambdataiku object TweetHashTagCount { implicit val timeOf: TimeExtractor[Status] = TimeExtractor(_.getCreatedAt.getTime) implicit val batcher = Batcher.ofHours(1) …. def hashTagCount[P <: Platform[P]]( source: Producer[P, Status], store: P#Store[String, Long]) = source .filter(_.getText != null) .flatMap { tweet: Status => tweet.getHashTags.map(_ -> 1L) } .sumByKey(store) } Tweet SummingBird
  44. 44. @fdouetteau#lambdataiku Putting this together SUMMING BIRD CASCADING MAP REDUCE TRIDENT STORM RT STORES (NoSQL .. etc.. BATCH STORES (HDFS …) Distributed Batch Computation SQL Level Abstraction Distributed RT Computation COMMON ABSTRACTION STATE RPC
  45. 45. @fdouetteau#lambdataiku WEB-SCALE VARIANT Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing Insert in Mongo Insert in Mongo Mongo MapReduc e Mongo Collection Mongo Mongo Aggregation
  46. 46. @fdouetteau#lambdataiku HADOOPY VARIANT Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing INSERT IN HBASE HIVE /MAP REDUCE HBASE HBASE HBASE Queries
  47. 47. @fdouetteau#lambdataiku Integrated Publish Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing
  48. 48. @fdouetteau#lambdataiku SploutSQL
  49. 49. @fdouetteau#lambdataiku SPARK VARIANT Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing SPARK STREAMING HDFS SPARK MEMORY
  50. 50. @fdouetteau#lambdataiku QUESTIONS QUESTION QUEUE florian.douetteau@ dataiku.com MAIL MY MEMORY ANSWER AUDIENCE HAPPY ANSWER TO MAIL Batch Processin g Real-Time Processing

×