Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements.
In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are
the de facto batch-processing system.
In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is
used in this layer.
In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.
One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset.
Why Spark for lambda architecture? Traditionally, different
technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will
discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.