Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview
Upcoming SlideShare
Loading in...5
×
 

Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

on

  • 1,829 views

This talk given at Devoxx Paris 2014 gives an overview of lambda architecture, and possible alternative in their implementation

This talk given at Devoxx Paris 2014 gives an overview of lambda architecture, and possible alternative in their implementation

Statistics

Views

Total Views
1,829
Views on SlideShare
1,698
Embed Views
131

Actions

Likes
13
Downloads
68
Comments
0

3 Embeds 131

https://twitter.com 96
http://www.slideee.com 31
https://www.linkedin.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview Presentation Transcript

    • @fdouetteau#lambdataiku Lambda Architecture @fdouetteau Dataiku, www.dataiku.com Florian Douetteau, CEO Dataiku
    • @fdouetteau#lambdataiku Topics For Today •WHAT is a lambda architecture •Examples - Principle •Motivation – Hard Points •HOW to you build a lambda architecture ? •Components per component
    • @fdouetteau#lambdataiku Lambda EVENTS PROCES S STATE SER VE
    • @fdouetteau#lambdataiku ƛ : SOME USE CASES • Online Advertising • Keep track of number of displays / clicks per positions / campaigns • Recommender Systems • Keep track of production displays / views / click / buy • Statistical Time Line • Keep Track of number of tweets per hashtag / hour
    • @fdouetteau#lambdataiku SQL WAY EVENTS PROCES S STATE SER VE USER1 ITEM1 VIEW USER1 ITEM2 BUY INSERT OR UPDATE VIEWS SET pageviews = pageviews + 1 WHERE user=USER1 … RDBMSSQL
    • @fdouetteau#lambdataiku Functional Programming Append Only EVENTS PROCES S STATE (APPEND ONLY) SER VE newstate = Fagg (oldstate, Fstore(events)) result= F (state, lastevents, scope)
    • @fdouetteau#lambdataiku E.g. counting twitter hashtags EVENTS PROCES S STATE SER VE Fmap ( ) = { (#tag, time) -> count } FReduce( hashmap, hashmap ) = fuse count in maps FDisplay( hashmap, events ) = Freduce(hashmap, Fmap(events)) TWEET COUNTS (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 NEW TWEETS TABLE 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar
    • @fdouetteau#lambdataiku E.g. counting twitter hashtags in “SQL” SER VE TWEET COUNTS TABLE (2014-02-31 13, #foo) -> 8 (2014-02-31 13, #foo2) -> 3 (2014-02-31 13, #foo3) -> 3 (2014-02-31 13, #foo4) -> 1 NEW TWEETS TABLE 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar PARTIAL TWEET COUNT TABLE (2014-02-31 13, #foo) -> 1 (2014-02-31 14, #foo) -> 3 (2014-02-31 14, #foo) -> 3 (2014-02-31 14, #foo) -> NEW TWEET COUNT TABLE (2014-02-31 13, #foo) -> 9 (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 CREATE … AS SELECT time, tag, COUNT(*) GROUP BY TIME, TAG CREATE AS SELEC time, tag, SUM(counts) FROM ( oldtable … UNION partialtable) GROUP BY TIME, TAG SELECT, time, tag, SUM(c) FROM ( SELECT time, tag, c FROM oldtable WHERE tag = … UNION SELECT time, tag, c FROM partialtable WHERE tag=… ) INSERT VALUES … RENAME TABLE … EXECUTE EACH 5 MINUTES EXECUTE EACH HOUR
    • @fdouetteau#lambdataiku ƛ : PRINCIPLE EVENTS BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION
    • @fdouetteau#lambdataiku Backtype Story Capture events and logs from twitter 25TB binary data 100 Billlios records 400 QPS Average Scale 1 -> 150 on peak Take off with a team of 3 engineers with seed funding in 2008 Christopher Golda Michael Montano Nathan Marz Acquired by Twitter ( power twitter trends …) in 2011 Cascalog Storm ElephantDB
    • @fdouetteau#lambdataiku TWITTER HASHTAGS 2014-02-31 13:14 #foo bar BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 COMPUTE EVERY 5 MINUTES HASHTAG COUNTS FOR THE LAST 5 MINUTES (IN MEMORY) COMPUTE EVERY HOUR HASHTAG COUNT FOR THE LAST HOUR (ON DISK)
    • @fdouetteau#lambdataiku RECOMMENDER SYSTEM BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION USER1 ITEM1 VIEW USER1 ITEM2 BUY USER1 ITEM1 VIEW USER1 ITEM1 VIEW ITEM-ITEM SIMILARITY MATRIX USER -> [ ITEM1, … ITEMn] RECOMMENDATION
    • @fdouetteau#lambdataiku THREE KEY DRIVERS FOR LAMBDA ARCH
    • @fdouetteau#lambdataiku DRIVER 1: Support Smooth Evolution 2014-02-31 13:14 #foo bar BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar (2014-02-31 13:14,, #foo) -> 3 (2014-02-31 13:14, #foo) -> 3 (1) RECOMPUTE NEW VERSION ON BATCH WHILE KEEPING THE OLD ONE (2014-02-31 13, #foo) -> 3 (2) THEN UPDATE THE ONLINE VERSION
    • @fdouetteau#lambdataiku DRIVER 2: Real-Time System Offline 2014-02-31 13:14 #foo bar BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar (2014-02-31 13, #foo) -> 3 (2014-02-31 13, #foo) -> 3 COMPUTE EVERY HOUR HASHTAG COUNT FOR THE LAST HOUR (ON DISK) FALLBACK TO PARTIAL RESULT WHEN REAL-TIME GRID IS OFFLINE
    • @fdouetteau#lambdataiku DRIVER 3 : CAN‟T RECOMPUTE BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION USER1 ITEM1 VIEW USER1 ITEM2 BUY USER1 ITEM1 VIEW USER1 ITEM1 VIEW ITEM-ITEM SIMILARITY MATRIX USER -> [ ITEM1, … ITEMn] RECOMMENDATION
    • @fdouetteau#lambdataiku PAIN POINTS
    • @fdouetteau#lambdataiku PAINT POINT 1 : EXACTLY ONCE 2014-02-31 13:14 #foo bar 2014-02-31 13:15 toto 2014-02-31 13:15 tutu 2014-02-31 13:16 #two … … Retry
    • @fdouetteau#lambdataiku PAINT POINT 2 : DYNAMIC SCALE START AT 100 events per second HOW TO GROW TO 10k events per second without rebuilding everything ?
    • @fdouetteau#lambdataiku PAINT POINT 3 : SCHEMA CHANGE BATCH VIEW REAL-TIME RESULT BATCH PROC REAL- TIME PROC FEDER ATION EVENTS V1 EVENTS V2 MIX OF VERSION 1 AND VERSION 2 !!!!
    • @fdouetteau#lambdataiku TOOLS AND FRAMEWOR K
    • @fdouetteau#lambdataiku Lambda Architecture Building Blocks Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing
    • @fdouetteau#lambdataiku Components Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing STORM HDFS MapRed HBASE MEMCACHE MONGODB WEBAPPRABBITMQ FLUME
    • @fdouetteau#lambdataiku Components Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing
    • @fdouetteau#lambdataiku Message Queues Kestrel (Single Node) Kafka (Linkedin, Distributed) RabbitMQ ActiveMQ Micro-Batch, State in Processor Persitent Event, State in Queue, Rich Routing
    • @fdouetteau#lambdataiku TOPOLOGY : SINGLE PIPE Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing STORM STORM
    • @fdouetteau#lambdataiku Storm Developped in 2008-2009 at BackType First open source release in 2011 BOLT TUPLE TUPLE TUPLE SPOUT TUPLE
    • @fdouetteau#lambdataiku Topologies SPOUT SPOUT BOLT BOLT BOLT BOLT This one likely to write in a State This one tooo
    • @fdouetteau#lambdataiku public class HashTagParseBolt extends BaseRichBolt { OutputCollector _collector public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tweet) { for(String hashtag : tweet.getString(„hashtags‟)) { _collector.emit(new Values(tweet.time, hashtag)); } } public void deplaceOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(„time‟, „hashtag‟)); } } Parse Tweet Bolt
    • @fdouetteau#lambdataiku Topologies Tweet Spout Parse Tweet Bolt Count HashT ags Bolt Store in Flat File Tweet
    • @fdouetteau#lambdataiku BALANCING CLUSTER NODE PROCESS EXECUTOR TASK TASK ONE PER TOPOLOGY PER SPOUT OR BOLT EXECUTOR TASK NODE PROCESS REBALANCE
    • @fdouetteau#lambdataiku (Optional) RELIABILITY • When emitting a tuple from an existing tuple, trace origin • “Ack” or “Fail” each tuple • If a tuple or dependent tuples not fully “acked” REPLAY
    • @YourTwitterHandle#YourSessionHashtag public class HashTagParseBolt extends BaseRichBolt { OutputCollector _collector public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tweet) { for(String hashtag : tweet.getString(„hashtags‟)) { _collector.emit(tweet, new Values(tweet.time, hashtag)); } _collector.ack(tweet); } public void deplaceOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(„time‟, „hashtag‟)); } } Reliable Parse Tweet
    • @fdouetteau#lambdataiku TOPOLOGY 2 : SHARE RT Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing TRIDENT TRIDENT TRIDENT
    • @fdouetteau#lambdataiku TRIDENT • Higher Level Operations • Use Storm as an RPC Framework • State “Management”
    • @fdouetteau#lambdataiku From Schema To Storm Topology
    • @fdouetteau#lambdataiku How is exactly-once implemented? {user=paul, item=car, event=imp} {user=pierre, item=car, event=imp} {user=1, item=car, event=imp} {user=paul, item=car, event=imp} {user=pierre, item=car, event=imp} {user=pierre, item=car, event=imp} … txid=1 txid=3 txid=2
    • @fdouetteau#lambdataiku Exactly-Once in state paul -> { car: 2, txid=2 } pierre -> {car : 5, txid=3 } paul -> { car: 3, txid=3 } pierre -> {car : 5, txid=3 } {user=paul, item=car, event=imp} {user=pierre, item=car, event=imp} {user=pierre, item=car, event=imp} txid=3 Keep Track of last transaction in state Transaction does not apply to newer state parts
    • @fdouetteau#lambdataiku TOPOLOGY 1 : SHARE STATE Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing USE A SINGLE NOSQL SERVICE FOR ALL USE CASES
    • @fdouetteau#lambdataiku REDIS VARIANT Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing REDIS REDIS REDIS REDIS ALSO USE THE NOSQL AS A MESSAGE QUEUE
    • @fdouetteau#lambdataiku TOPOLOGY 3 : SHARED PROCESSING Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing
    • @fdouetteau#lambdataiku SummingBird Single Scala specification than can run in “Batch” on “Real-Time” Mode Single Scala Code Run on Storm Topology Run on Cascading (Batch)
    • @fdouetteau#lambdataiku object TweetHashTagCount { implicit val timeOf: TimeExtractor[Status] = TimeExtractor(_.getCreatedAt.getTime) implicit val batcher = Batcher.ofHours(1) …. def hashTagCount[P <: Platform[P]]( source: Producer[P, Status], store: P#Store[String, Long]) = source .filter(_.getText != null) .flatMap { tweet: Status => tweet.getHashTags.map(_ -> 1L) } .sumByKey(store) } Tweet SummingBird
    • @fdouetteau#lambdataiku Putting this together SUMMING BIRD CASCADING MAP REDUCE TRIDENT STORM RT STORES (NoSQL .. etc.. BATCH STORES (HDFS …) Distributed Batch Computation SQL Level Abstraction Distributed RT Computation COMMON ABSTRACTION STATE RPC
    • @fdouetteau#lambdataiku WEB-SCALE VARIANT Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing Insert in Mongo Insert in Mongo Mongo MapReduc e Mongo Collection Mongo Mongo Aggregation
    • @fdouetteau#lambdataiku HADOOPY VARIANT Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing INSERT IN HBASE HIVE /MAP REDUCE HBASE HBASE HBASE Queries
    • @fdouetteau#lambdataiku Integrated Publish Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing
    • @fdouetteau#lambdataiku SploutSQL
    • @fdouetteau#lambdataiku SPARK VARIANT Message Queue Batch State Batch Pump Real-Time State Real-Time Views Service Federated View Batch Views Service Batch Processin g Real-Time Processing SPARK STREAMING HDFS SPARK MEMORY
    • @fdouetteau#lambdataiku QUESTIONS QUESTION QUEUE florian.douetteau@ dataiku.com MAIL MY MEMORY ANSWER AUDIENCE HAPPY ANSWER TO MAIL Batch Processin g Real-Time Processing