Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

Lambda Architecture is a useful framework to think about designing big data applications. This framework has been built initially at Twitter. In this presentation you will learn, based on concrete examples how to build deploy scalable and fault tolerant applications, with a focus on Big Data and Hadoop.

This presentation was delivered at the OOP conference, Munich, Feb 2016

  • Be the first to comment

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

  1. 1. © 2015 MapR Technologies ‹#›© 2016 MapR Technologies Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
  2. 2. © 2016 MapR Technologies ‹#›@tgrall {“about” : “me”} Tugdual “Tug” Grall • MapR • Technical Evangelist • MongoDB • Technical Evangelist • Couchbase • Technical Evangelist • eXo • CTO • Oracle • Developer/Product Manager • Mainly Java/SOA • Developer in consulting firms • Web • @tgrall • http://tgrall.github.io • tgrall • NantesJUG co-founder • Pet Project : • http://www.resultri.com • tug@mapr.com • tugdual@gmail.com
  3. 3. © 2016 MapR Technologies@tgrall 3 Big Data & Hadoop In Production
  4. 4. © 2016 MapR Technologies 4 Data Warehouse Optimization
  5. 5. © 2016 MapR Technologies 5 Data Hub Choose the best “connector”: • File • Sqoop • ETL • … Use the aggregated data • In your applications • To update other systems • as an Open Data API • … Customer DB Customer DB Logs … Hadoop NoSQL
  6. 6. © 2016 MapR Technologies 6 Financial Services Fraud detection Personalized offers Fraud investigation tool Fraud investigator Fraud model Recommendations table Clickstream analysis Online transactions MapR Distribution for Hadoop Analytics Real-time Operational Applications Interactive marketer
  7. 7. © 2016 MapR Technologies@tgrall 7 Fault Tolerance
  8. 8. © 2016 MapR Technologies 8 Fault Tolerance hardware software developer ?
  9. 9. © 2016 MapR Technologies 9 Human fault tolerance
  10. 10. © 2014 MapR Technologies 10
  11. 11. © 2014 MapR Technologies 11
  12. 12. © 2014 MapR Technologies 12
  13. 13. © 2016 MapR Technologies@tgrall 13 Lambda Architecture To the rescue λ
  14. 14. © 2016 MapR Technologies 14 A little bit of history…. • Defined by Nathan Marz • ex BackType, Twitter • in a new Startup • Creator of … – Storm – Cascalog – ElephantDB
  15. 15. © 2016 MapR Technologies 15 Lambda Architecture Requirements • Fault-tolerant against both hardware failures & human errors • Support variety of use cases that include low latency querying as well as updates • Linear scale-out capabilities • Extensible, so that the system is manageable and can accommodate newer features easily
  16. 16. © 2016 MapR Technologies 16
  17. 17. © 2016 MapR Technologies 17 Lambda Architecture NEW DATA STREAM QUERY BATCH VIEWS √View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWSBATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N
  18. 18. © 2016 MapR Technologies 18 Data Ingestion All data entering the system are dispatched to both • the batch layer • the speed layer NEW DATA STREAM BATCH LAYER SPEED LAYER
  19. 19. © 2016 MapR Technologies Batch Layer • managing the master dataset, an immutable, append-only set of raw data • pre-computing arbitrary query functions, called batch views. BATCH VIEWS BATCH LAYER IMMUTABLE MASTER DATA PRECOMPUTE VIEWSBATCH RECOMPUTE View 1 View 2 View N
  20. 20. © 2016 MapR Technologies 20 Speed Layer √View 1 View 2 View N REAL-TIME VIEWS SPEED LAYER PROCESS STREAM INCREMENT VIEWS • Speed layer accommodates low latency requests that are subject to low latency requirements. • Using fast and incremental algorithms, deals with recent data only
  21. 21. © 2016 MapR Technologies 21 Serving Layer QUERY BATCH VIEWS √View 1 View 2 View N REAL-TIME VIEWS SERVINGLAYER MERGE View 1 View 2 View N • Serving layer indexes batch views so that they can be queried in ad hoc with low latency
  22. 22. © 2014 MapR Technologies 22 Lambda Architecture—Compensate Batch time not absorbed now
  23. 23. © 2016 MapR Technologies 23 Lambda Architecture—Immutable Data + Views http://openflights.org
  24. 24. © 2016 MapR Technologies 24 Lambda Architecture—Immutable Data + Views timestamp airport flight action 2016-02-04T10:00:00 MUC EY123 take-off 2016-02-04T10:05:00 BRU SAS45 take-off 2016-02-04T10:07:00 AMS BA99 take-off 2016-02-04T10:09:00 LHR LH17 landing 2016-02-04T10:10:00 CDG AF03 landing 2016-02-04T10:10:00 FCO AZ501 take-off immutable master dataset
  25. 25. © 2016 MapR Technologies 25 Lambda Architecture—Immutable Data + Views timestamp airport flight action 2016-02-04T10:00:00 MUC EY123 take-off 2016-02-04T10:05:00 BRU SAS45 take-off 2016-02-04T10:07:00 AMS BA99 take-off 2016-02-04T10:09:00 LHR LH17 landing 2016-02-04T10:10:00 CDG AF03 landing 2016-02-04T10:10:00 FCO AZ501 take-off air-borne: 2307 airline planes AF 59 AZ 23 BA 167 EY 19 LH 201 SAS 28 air-borne per airline: airport planes AMS 69 CDG 44 BRU 31 FCO 10 HEL 17 LHR 101 airport load:
  26. 26. © 2016 MapR Technologies@tgrall 26 Implementation
  27. 27. © 2016 MapR Technologies 27 Lambda Architecture NEW DATA STREAM QUERY BATCH VIEWS √View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWSBATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N
  28. 28. © 2016 MapR Technologies 28 Batch Layer: View Generation Master Data View 1 View 2 Master Data Master Data Master Data Events “Raw” Storage Processing Aggregated Data
  29. 29. © 2016 MapR Technologies 29
  30. 30. © 2016 MapR Technologies 30 • Cluster Computing Platform • Extends “MapReduce” with extensions – Streaming – Interactive Analytics • Run in Memory
  31. 31. © 2015 MapR Technologies ‹#›@tgrall Spark components Spark SQL Spark Streaming (Streaming) MLlib (Machine Learning) Spark Core (General execution engine) GraphX (Graph Computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN
  32. 32. © 2016 MapR Technologies 32 Spark Jobs Driver Program (application) sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Cluster Manager Worker Executor Task Task Worker Executor Task Task
  33. 33. © 2016 MapR Technologies 33 Spark Resilient Distributed Datasets “RDD” Sensor RDD W Executor P4 W Executor P1 P3 W Executor P2 sc.textFile P1 8213034705, 95, 2.927373, jake7870, 0…… P2 8213034705, 115, 2.943484, Davidbresler2, 1…. P3 8213034705, 100, 2.951285, gladimacowgirl, 58… P4 8213034705, 117, 2.998947, daysrus, 95….
  34. 34. © 2016 MapR Technologies 34 Spark Resilient Distributed Datasets Transformation Filter() Action Count() RDD newRDD Value
  35. 35. © 2015 MapR Technologies@tgrall Transformations • Process an RDD, returns an RDD • Examples : • map() : one value => another value • mapToPair() : one value => a tuple • filter() : filters values/tuples on a given condition • groupByKey() : groups values by key • reduceByKey() : aggregates values by key • join(), cogroup(), … : joins RDDs
  36. 36. © 2015 MapR Technologies@tgrall Actions • Process an RDD, returns a value • Examples : • count() : counts number of items in dataset • first() : returns first entry • take(n) : returns array of the n first elements • foreach() : applies a function on each element • collect() : returns all elements • saveAsTextFile() : saves in files each element
  37. 37. © 2016 MapR Technologies 37 Speed Layer Real Time View1 Real Time View 2 Events Processing NoSQL
  38. 38. © 2016 MapR Technologies 38 Serving Layer: Aggregated Data • Views are stored in a Read/Write database • Apache HBase • MapR DB Binary & JSON • Cassandra • MongoDB • Elasticsearch • …
  39. 39. © 2016 MapR Technologies 39 Serving Layer Real Time View Events Processing Aggregated Batch View Query-SQL Dataviz Query/Visualisation SQL
  40. 40. © 2016 MapR Technologies // Join MapR-DB Table, Parquet and MongoDB collection > SELECT u.name, b.category, count(1) nb_review FROM mongo.yelp.`user` u , dfs.yelp.`review.parquet` r, (select business_id, flatten(categories) category from maprdb.`business` ) b WHERE u.user_id = r.user_id AND b.business_id = r.business_id GROUP BY u.user_id, u.name, b.category ORDER BY nb_review DESC LIMIT 10; +-----------+--------------+------------+ | name | category | nb_review | +-----------+--------------+------------+ | Rand | Restaurants | 1086 | | J | Restaurants | 661 | | Aileen | Restaurants | 499 | | Michael | Restaurants | 496 | +-----------+--------------+------------+ 40
  41. 41. © 2016 MapR Technologies@tgrall 41 Events Capture?
  42. 42. © 2016 MapR Technologies 42 Events Capture Customer DB API Logs … Streaming Streams Files
  43. 43. © 2016 MapR Technologies 43 What is Spark Streaming? • Enables scalable, high-throughput, fault-tolerant stream processing of live data • Extension of the core Spark Data Sources Data Sinks
  44. 44. © 2016 MapR Technologies 44 Spark Streaming Architecture • Divide data stream into batches of X seconds (micro batching) • Called DStream = sequence of RDDs Spark Streaming input data stream DStream RDD batches Batch interval data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  45. 45. © 2016 MapR Technologies 45 What are Apache Kafka & MapR Streams? • Publish Subscribe Messaging • Fast • Scalable • Durable • Distributed
  46. 46. © 2016 MapR Technologies@tgrall 46 Summary
  47. 47. © 2016 MapR Technologies 47 Lambda Architecture NEW DATA STREAM QUERY BATCH VIEWS √View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWSBATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N NoSQL Distributed File System NoSQL Streams
  48. 48. © 2016 MapR Technologies 48 Lambda Architecture in Action Batch processing (MapReduce) Tax reduction reporting Shortest path graph algorithm (Titan on MapR-DB) Route optimization . . . Geolocation Geolocation Geolocation Geolocation Online alerts Real-time stream
  49. 49. © 2016 MapR Technologies 49 Lambda Architecture • Fault-tolerant • Use batch layer to pre compute complex/large data set queries • Use speed layer to deal with “near real time” use cases • Linear scale-out capabilities • Error Prone: • Recompute data from master data set when needed
  50. 50. © 2016 MapR Technologies 50
  51. 51. © 2016 MapR Technologies 51 Q&A @tgrall maprtech tug@mapr.com Engage with us! MapR maprtech mapr-technologies

×