The Lambda architecture uses a batch layer to process all incoming data and generate batch views to serve queries with high latency, a speed layer to process recent data and compensate for batch view latency with low latency real-time views, and a serving layer to merge batch and real-time views to answer queries. This document provides an example use case where RabbitMQ is used for data injection, Apache Spark is used for batch processing, Apache Spark Streaming is used for the speed layer, Apache Shark is used in the serving layer, and results are stored in Cassandra and presented using Tomcat and D3.
2. Introduction - Lambda Architecture
• Lambda Architecture (introduced by Nathan Marz) is a
generic, scalable and fault-tolerant data processing
architecture to satisfy the needs for a robust system
that is
– Fault-tolerant, both against hardware failures and human
mistakes. Mistakes are corrected via recomputation
– Being able to serve a wide range of workloads and use
cases, and in which low-latency reads and updates are
required.
– Data storage is history optimized and immutability changes
everything.
– The resulting system should be linearly scalable, and it
should scale out rather than up.
4. LA High-level perspective ( continue)
• All data entering the system is dispatched to both the
batch layer and the speed layer for processing.
• The batch layer has two functions: (i) managing the
master dataset (an immutable, append-only set of raw
data), and (ii) to pre-compute the batch views.
• The serving layer indexes the batch views so that they
can be queried in low-latency, ad-hoc way.
• The speed layer compensates for the high latency of
updates to the serving layer and deals with recent data
only.
• Any incoming query can be answered by merging
results from batch views and real-time views.
5. Lambda use case
• Data Injection – Queue & Pub/Sub models are
nature fit. RabbitMQ is used
• Use Apache Spark in Batch Layer and Jenkins for
scheduler
• Use Apache Spark Streaming in Speed Layer. Use
Cassandra to store the real time results
• Adopt Apache Shark in Serving Layer
• In Presentation layer, use Tomcat and D3
• ( Refer to next slide for the diagram )
7. Apache Spark
• Hadoop integration
• Spark interactive Shell
• The Spark Analytic Suite includes
– Interactive query analysis (Shark),
– Large-scale graph processing and analysis (Bagel)
– Real-time analysis (Spark Streaming).
– Machine Learning library
• Resilient Distributed Data sets
– Distributed objects that can be cached in-memory, across a cluster of compute
nodes
– Fault-tolerance is built-in: RDD’s are automatically rebuilt if something goes
wrong
• Distributed Operators
• Spark is already used in production
• The Spark codebase is small and extensible
8. Apache Shark
Shark is a component of Spark, an open source, distributed and fault-
tolerant, in-memory analytics system, that can be installed on the
same cluster as Hadoop.
In particular, Shark is fully compatible with Hive and
supports HiveQL, Hive data formats, and user-defined functions. In
addition Shark can be used to query data4 in HDFS, HBase, and
Amazon S3
• Interactive SQL systems for Hadoop
• In-memory column store and column compression
• Control over data partitioning => Fast, distributed JOINS
• Fault-tolerance
• SQL “optimizer”
• Machine-learning support