Lambda architecture


Published on

A lithe description of fundamental concepts and about how this new architectural approach work for Big Data problems and even real time systems.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lambda architecture

  1. 1. Lambda Architecture Una soluzione per i Big Data Mario A. Santini
  2. 2. A solution born in Twitter Nathan Marz Author of Big Data:
  3. 3. When big is big? ● ~1,5 M users, ~2,2 nodes ( ● Wikipedia 32 M pages, 20 M users ( ● Facebook 1.3 G users ( ● Twitter 645 M users ( ● But also: – – Monitoring systems Any near real time system
  4. 4. Lambda query = function(allData);
  5. 5. Input data Lambda Architecture query
  6. 6. Batch View All Data Batch Layer Batch View Batch View Serving Layer Query
  7. 7. Batch Layer ● Store an immutable input data set ● Computing continuosly the batch view ● Simple & Distributed
  8. 8. Serving Layer ● Indexing the batch views ● Access to the batch views ● Updated by Batch Layer ● Trivial read only database: – Quick – Very simple
  9. 9. Batch Layer + Service Layer ● Robust and fault tollerant ● Scalable ● General ● Extensible ● Allow ad hoc queries ● Minimal maintenance ● Debuggable
  10. 10. What's miss? While Batch Layer compute the query on the full data set a pretty big chunk of data just arrived and be stored. Should we wait a couple of hours to query this data?
  11. 11. Speed Layer Near real time views New Data Speed Layer Near real time views Near real time views Query
  12. 12. All together now! Serving Layer Batch View All Data Batch Layer Batch View Query New Data Near real time views Speed Layer Near real time views
  13. 13. How all this mess should work? ● ● All new data are sent to both: batch and speed layer (data are raw and immutalble, append only) The batch layer precompute the query functions continuosly to all the dataset, to produce the batch views ● The serving layer indexes the batch views ● At the end the data are a couple of hours old
  14. 14. How all this mess should work? ● ● ● ● The speed layer will process only the new data It use fast read/write database and incremental processing algorithms Produce the near real time views The query will merge real time and batch views results to resolve the queries
  15. 15. Batch Layer - tools ● Hadoop – YARN: framework to schedule jobs and cluster management – Map / Reduce: a way to parallel processing of huge amount of data, based on YARN – HDFS: distributed file system with an high throughput access to application data – And even more...
  16. 16. Serving Layer – tools ● ElephantDB – ● ● Readonly database, very little, very fast Here we need anything that has the same features Cloudera Impala
  17. 17. Speed Layer - tools ● Storm project – Very fast distributed computed system ● Apache Hbase ● MongoDB
  18. 18. Query - tools ● Cloudera Impala