Lambda architecture

•

3 likes•1,580 views

A lithe description of fundamental concepts and about how this new architectural approach work for Big Data problems and even real time systems.

Technology

Lambda Architecture
Una soluzione per i Big Data

Mario A. Santini

A solution born in Twitter

Nathan Marz
Author of Big Data:
http://www.manning.com/marz/

When big is big?
●

OpenStreetMap.org ~1,5 M users, ~2,2 nodes
(http://j.mp/OSM-stats)
http://j.mp/OSM-stats

●

Wikipedia 32 M pages, 20 M users
(http://en.wikipedia.org/wiki/Wikipedia:Statistics)
http://en.wikipedia.org/wiki/Wikipedia:Statistics

●

Facebook 1.3 G users (http://www.statisticbrain.com/facebook-statistics/)
http://www.statisticbrain.com/facebook-statistics/

●

Twitter 645 M users (http://www.statisticbrain.com/twitter-statistics/)

●

But also:
–
–

Monitoring systems
Any near real time system

Batch View

All Data

Batch Layer

Batch View

Batch View

Serving Layer

Query

Batch Layer
●

Store an immutable input data set

●

Computing continuosly the batch view

●

Simple & Distributed

Serving Layer
●

Indexing the batch views

●

Access to the batch views

●

Updated by Batch Layer

●

Trivial read only database:
–

Quick

–

Very simple

Batch Layer + Service Layer
●

Robust and fault tollerant

●

Scalable

●

General

●

Extensible

●

Allow ad hoc queries

●

Minimal maintenance

●

Debuggable

What's miss?
While Batch Layer compute the query on the full
data set a pretty big chunk of data just arrived
and be stored.
Should we wait a couple of hours to query this
data?

Speed Layer
Near real time views

New Data

Speed Layer

Near real time views

Near real time views

Query

All together now!
Serving Layer
Batch View
All Data

Batch Layer
Batch View
Query
New Data

Near real time views
Speed Layer

Near real time views

How all this mess should work?
●

●

All new data are sent to both: batch and speed
layer (data are raw and immutalble, append
only)
The batch layer precompute the query
functions continuosly to all the dataset, to
produce the batch views

●

The serving layer indexes the batch views

●

At the end the data are a couple of hours old

How all this mess should work?
●
●

●
●

The speed layer will process only the new data
It use fast read/write database and
incremental processing algorithms
Produce the near real time views
The query will merge real time and batch
views results to resolve the queries

Batch Layer - tools
●

Hadoop
–

YARN: framework to schedule jobs and cluster
management

–

Map / Reduce: a way to parallel processing of
huge amount of data, based on YARN

–

HDFS: distributed file system with an high
throughput access to application data

–

And even more...

Serving Layer – tools
●

ElephantDB
–

●

●

Readonly database, very little, very fast

Here we need anything that has the same
features
Cloudera Impala

Speed Layer - tools
●

Storm project
–

Very fast distributed computed system

●

Apache Hbase

●

MongoDB

What's hot

Spark Streaming and IoT by Mike Freedman

Spark Summit

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Big Data Spain

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1FQYcP0. Gian Merlino presents the advantages, challenges, and best practices to deploying and maintaining lambda architectures in the real world, using the infrastructure at Metamarkets as a case study. Filmed at qconsf.com. Gian Merlino is a senior software engineer at Metamarkets, responsible for the infrastructure behind its data ingestion pipelines and is a committer on the Druid project.

Lambda Architectures in Practice

C4Media

As a leader in the financial industry, Capital One applications generate huge amounts of data that require fast and accurate handling, storage and analysis. We are transforming how we report operational data to our internal users so that they can make quick and precise business decisions to serve our customers. As part of this transformation, we are building a new Go-based data processing framework that will enable us to transfer data from multiple data stores (RDBMS, files, etc.) to a single NoSQL database - Cassandra. This new NoSQL store will act as a reporting database that will receive data on a near real-time basis and serve the data through scorecards and reports. We would like to share our experience in defining this fast data platform and the methodologies used to model financial data in Cassandra.

Capital One: Using Cassandra In Building A Reporting Platform

DataStax Academy

Spark Streaming the Industrial IoT

Jim Haughwout

In this talk Josep draws on his experience of building a data platform based on Cassandra and Spark to service the UK's foremost player in the connected homes market. Bringing streams of data online; productionising data science algorithms on spark; and delivering outputs via API's or Kafka messages. Josep will explore the ups and the downs of bringing all this together and share what he's learned from 12 months of Cassandra and Spark development and operations.

British Gas Connected Homes: Data Engineering

DataStax Academy

Extracting Insights from Data at Twitter

Prasad Wagle

ASPgems - kappa architecture

Juantomás García Molina

Cassandra & Spark for IoT

Matthias Niehoff

eBay has been using Analytical DBMS (ADBMS) data warehouse solution for over a decade, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. Based upon that, data services and products enables eBay business decisions and site features, so it has to be always available and accurate. Apache Spark provides an open source and more scalable way of solution for such amount of data. Starting from beginning of this year, eBay has been working on migrating ADBMS batch workload to Spark, about 90% of them migrated in automatic way. Our team is leading the automation tools and pipeline to commit the accomplishment within this year. In today’s session, we will introduce: 1. Tool sets which enables the auto migration engine: including metadata services, SQL convertor, Table/View generator, data mover, optimizer, pipeline generator, data validator, workflow controller many not only contributes in auto migration but also enables development work of individual engineers 2. End to end auto migration steps till cut over on production, starting from initializing on dev environment, unit test, data validation, integration test, release, parallel run, monitoring and cut over

Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...

Databricks

We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Natalino Busa

New developers and teams are now polyglot : - they use multiple programming languages (Java, Javascript, Ruby, ...) - they use multiple persistence store (RDBMS, NoSQL, Hadoop) In this talk you will learn about the benefits if being polyglot: use the good language or framework for the good cause, select the good persistence for specific constraints. This presentation will show how developer could mix the Python, NodeJS, AngularJS, SQL with Drill for Hadoop and MongoDB.

Proud to be Polyglot - Riviera Dev 2015

Tugdual Grall

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...

Nathan Bijnens

In this session, we’ll follow the flow of data through an end-to-end system built to handle tens of terabytes an hour of event-oriented data, providing real-time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive can be stitched together to form the base platform; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. Finally, a brief demo of Rocana Ops, an application for large scale data center operations, will be given, along with an explanation about how it uses the underlying platform.

Building a system for machine and event-oriented data with Rocana

Treasure Data, Inc.

Real-time analytics with Druid at Appsflyer

Michael Spector

Lambda architecture: from zero to One

Serg Masyutin

Hadoop 2 @Twitter, Elephant Scale. Presented at

lohitvijayarenu

DIscover Spark and Spark streaming

Maturin BADO

Lambda at Weather Scale - Cassandra Summit 2015

Robbie Strickland

Introduction to Apache Apex by Thomas Weise

Big Data Spain

What's hot (20)

Spark Streaming and IoT by Mike Freedman

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Lambda Architectures in Practice

Capital One: Using Cassandra In Building A Reporting Platform

Spark Streaming the Industrial IoT

British Gas Connected Homes: Data Engineering

Extracting Insights from Data at Twitter

ASPgems - kappa architecture

Cassandra & Spark for IoT

Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Proud to be Polyglot - Riviera Dev 2015

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...

Building a system for machine and event-oriented data with Rocana

Real-time analytics with Druid at Appsflyer

Lambda architecture: from zero to One

Hadoop 2 @Twitter, Elephant Scale. Presented at

DIscover Spark and Spark streaming

Lambda at Weather Scale - Cassandra Summit 2015

Introduction to Apache Apex by Thomas Weise

Similar to Lambda architecture

Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems. All of a sudden to monitor all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like: Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services. Not only the tools, what should you monitor about the actual data that flows in the system? And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy. Demi Ben-Ari is a Co-Founder and CTO @ Panorays. Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems. Describing himself as a software development groupie, Interested in tackling cutting edge technologies. Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/

Monitoring Big Data Systems - "The Simple Way"

Demi Ben-Ari

Big Data, a recent phenomenon. Everyone talks about it, but do you really know what Big Data is? Join our four-part series about Big Data and you will get answers to your questions! We will cover Introduction to Big Data and available platforms which we can use to deal with Big Data. And in the end, we are going to give you an insight into the possible future of dealing with Big Data. After the two previous episodes you know the basics about Big Data. Yet, it might get a bit more complicated than that. Usually when you have to deal with data which is generated in real-time. In this case, you are dealing with Big Stream. This episode of our series will be focussed on processing systems capable of dealing with Big Streams. But analysing data lacking graphical representation will not be very convenient for us. And this is where we have to use a platform capable of visualising Big Graphs. All these topics will be covered in today’s presentation. #CHEDTEB www.chedteb.eu

Big Stream Processing Systems, Big Graphs

Petr Novotný

Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...

Seattle Apache Flink Meetup

Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...

Bowen Li

Web-scale data processing: practical approaches for low-latency and batch

Edward Capriolo

Understanding Hadoop

Ahmed Ossama

Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015. The deck served as a backdrop to the interactive session http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/ The scope was to drive an architectural conversation about : o What it actually takes to get the data you need to add that one metric to your report/dashboard? o What's it like to navigate the early conversations of an analytic solution? o How is one technology selected over another and how do those selections impact or define other selections?

Architecting Big Data Ingest & Manipulation

George Long

Apache Storm Concepts

André Dias

Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc. Bio: Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.

Intro to Apache Apex - Next Gen Platform for Ingest and Transform

Apache Apex

Analyzing Data at Scale with Apache Spark

Nicola Ferraro

Interactive Data Analysis in Spark Streaming

datamantra

Internet of Things (IoT) devices are becoming more ubiquitous in consumer, business and industrial landscapes. They are being widely used in applications ranging from home automation to the industrial internet. They pose a unique challenge in terms of the volume of data they produce, and the velocity with which they produce it, and the variety of sources they need to handle. The challenge is to ingest and process this data at the speed at which it is being produced in a real-time and fault tolerant fashion. Apache Apex is an industrial grade, scalable and fault tolerant big data processing platform that runs natively on Hadoop. In this deck, you will see how Apex is being used in IoT applications and also see how the enterprise features such as dimensional analytics, real-time dashboards and monitoring play a key role. Presented by Pramod Immaneni, Principal Architect at DataTorrent and PPMC member Apache Apex, on BrightTALK webinar on Apr 6th, 2016

IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform

Apache Apex

The world in which we monitor software is growing more complex every year. There are increasingly more ways to run server-side software, with many more independent services and more points of failures, the list goes on! On the plus side, there’s a lot of great tools and patterns being developed to try and make things simple to assess and understand. This talk covers how metrics and monitoring can be leveraged in a variety of different ways, auto-discovering applications and their usage of databases, caches, load balancers, etc, setting up and tearing down dashboards and monitoring automatically for services and instances, and more. We’ll also talk about how you can accomplish all this with a global view of your systems using both Prometheus and Graphite with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, distributed aggregation with the M3 aggregator and the M3 Kubernetes operator to horizontally scale a metrics platform in a way that doesn’t cost outrageous amounts to run with a system that’s still sane to operate with petabytes of metrics data.

FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...

Rob Skillington

Processing 19 billion messages in real time and NOT dying in the process

Jampp

E commerce data migration in moving systems across data centres

Regunath B

This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...

Data Con LA

Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases. http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent

Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex

Apache Apex

Everybody wants to go on the “Big Data” hype cycle, “To do Scale”, to use the coolest tools in the market like Hadoop, Apache Spark, Apache Cassandra, etc. But do they ask themselves is there really a reason for that? In the talk we’ll make a brief overview to all of the technologies in the Big Data world nowadays and we’ll talk about the problems that really emerge when you’d like to enter the great world of Big Data handling. Showing you the Hadoop ecosystem and Apache Spark and all of the distributed tools leading the market today, will give you all a notion of what will be the real costs entering that world. Promise that I’ll share some stories from the trenches :) (And about the “pool” thing...I don’t really know how to swim)

Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Demi Ben-Ari

Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data

Stavros Kontopoulos

Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data

Voxxed Days Thessaloniki

Similar to Lambda architecture (20)