Big Data Real Time Architectures
Lambda, Kappa motivation and practical applications
@dmarcous
Problems
Volume
Variety Velocity
Solutions
Batch processing
NoSQL
Stream
processing
More Problems?
● Machines FAIL
● Humans make mistakes
● We want everything in real time!
○ We can’t do everything in real time :(
● We might think of a new way to analyse old data
● We might want to take a look of older versions of the raw / aggregated data
● Looking at raw data is cool, looking at aggregated data is cooler, looking at indexed/ data with
ad-hoc filter is the coolest. What if we want them all on the same set of data?
Batch processing
◦ Large amount of static data
◦ Scalable solution
◦ Volume
Real-time processing
◦ Computing streaming data
◦ Low latency
◦ Velocity
Hybrid computation
◦ Lambda Architecture
◦ Kappa Architecture
Big Data Timeline
2006
2010
1st
Generation
2003
Inception
2nd
Generation
2012
3rd
Generation
● Nathan Marz (Twitter)
● How to beat the CAP theorem
○ http://nathanmarz.com/blog/how-to-
beat-the-cap-theorem.html
Lambda Architecture
● Concepts :
○ Immutable data
○ Everything can be re-run
○ Using the best tool for purpose
○ Query = Function(All Data)
○ real time isn’t accurate, batch will
fix any mistakes
● Layers
○ Batch
○ Speed
○ Serving
Lambda Architecture is:
A complementary pair of:
- in-memory real-time processing
- large HDD/SSD batch processing
Proposed by Nathan Marz
Slow, but large and persistent.
Fast, but small and volatile.
● Data duplication
○ Columnar + Compressed
○ Don’t be cheap...
● Too many tools!
○ Stay on 1 platform - Hadoop/YARN
● Do I really need to write everything twice? (Cross DB ORM)
○ Frameworks
■ Twitter Summingbird (MR + Storm)
■ Apache Spark (batch / Streaming)
■ Google Dataflow
● No place for ad-hoc analysis
○ Add more specialised data sources
■ Solr / Elasticsearch
● Incremental Algorithms are HARD - stream process based on smart thresholds (= history)
○ Mix it up - Key value access during speed process
● A new event may be related to an old one, that might be realted to an older one…
○(Add graph processing (GraphX/ Giraph/ Titan
Lambda Pitfalls
● Jay Kreps (LinkedIn)
● Questioning The Lambda Architecture
○ http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
● Concepts
○ retain the full log of the data
○ processing = new instance of the same stream
○ input - choose where to start reading from the log (now, 1 day ago, 1 year ago..)
○ real time is accurate!
○ re-processing only when code changes
Kappa Architecture
Lambda Kappa
● Different● Common
○ Greek letters
○ Real time processing at scale
○ Immutable Architectures
■ “Replay” possible
○ Born out of need
○ Both use Materialised views /
indexed results for serving
Lambda Kappa
Lambda Kappa
Processing
Paradigm
Batch +
Streaming
Streaming
Re-processing
Paradigm
Every Batch
Cycle
Only when code
changes
Reliability Batch is reliable,
streaming is
approximate
Streaming with
consistency
(exactly once)
Resource
Consumption
Function = Query
(All data)
Incremental
algorithms,
running on deltas
● Data Ingestion
○ Kafka
○ Apache Flume
○ Samza
● Batch
○ MR (Hive, Pig etc.)
○ Tez
○ Spark
○ Dataflow (=Google Flume)
● Stream
○ Storm
○ Spark Streaming
○ Samza
○ Dataflow (=Google Flume)
○ Flink
Tooling
● Serving
○ DBs
■ ElephantDB
■ SploutSQL
■ HBase /
Cassandra
○ Queries
■ Impala
■ Presto
■ Big Query
● Lambdas
○ Twitter
○ Spotify (music recommendations)
○ Liveperson
○ Inneractive
● Kappas
○ LinkedIn
○ Yahoo
● Platforms
○ Oryx2 (Cloudera)
■ Lambda ML Platform using Kafka + Spark
○ Novelti.io (Previously Lambdoop)
■ Streaming intelligence for everything (mainly IoT)
Users
● Zeta Architecture
○ Includes cluster management
■ Monitoring
■ Scheduling
■ Container system etc.
○ Inspired by Google
● iot-a
○ Internet of Things
○ Layered
■ MQ (kafka - RT)
■ DB (HBase - Interactive)
■ DFS (Batch)
● Mu Architecture
○Lambda with only 1 set of aggregated views
?More Architectures
● lambda
○ http://www.infoq.com/interviews/marz-lambda-architecture
● Kappa
○ http://www.kappa-architecture.com/
Appendix - Videos

Big data real time architectures

  • 1.
    Big Data RealTime Architectures Lambda, Kappa motivation and practical applications @dmarcous
  • 2.
  • 3.
  • 4.
    More Problems? ● MachinesFAIL ● Humans make mistakes ● We want everything in real time! ○ We can’t do everything in real time :( ● We might think of a new way to analyse old data ● We might want to take a look of older versions of the raw / aggregated data ● Looking at raw data is cool, looking at aggregated data is cooler, looking at indexed/ data with ad-hoc filter is the coolest. What if we want them all on the same set of data?
  • 5.
    Batch processing ◦ Largeamount of static data ◦ Scalable solution ◦ Volume Real-time processing ◦ Computing streaming data ◦ Low latency ◦ Velocity Hybrid computation ◦ Lambda Architecture ◦ Kappa Architecture Big Data Timeline 2006 2010 1st Generation 2003 Inception 2nd Generation 2012 3rd Generation
  • 6.
    ● Nathan Marz(Twitter) ● How to beat the CAP theorem ○ http://nathanmarz.com/blog/how-to- beat-the-cap-theorem.html Lambda Architecture ● Concepts : ○ Immutable data ○ Everything can be re-run ○ Using the best tool for purpose ○ Query = Function(All Data) ○ real time isn’t accurate, batch will fix any mistakes ● Layers ○ Batch ○ Speed ○ Serving
  • 7.
    Lambda Architecture is: Acomplementary pair of: - in-memory real-time processing - large HDD/SSD batch processing Proposed by Nathan Marz Slow, but large and persistent. Fast, but small and volatile.
  • 8.
    ● Data duplication ○Columnar + Compressed ○ Don’t be cheap... ● Too many tools! ○ Stay on 1 platform - Hadoop/YARN ● Do I really need to write everything twice? (Cross DB ORM) ○ Frameworks ■ Twitter Summingbird (MR + Storm) ■ Apache Spark (batch / Streaming) ■ Google Dataflow ● No place for ad-hoc analysis ○ Add more specialised data sources ■ Solr / Elasticsearch ● Incremental Algorithms are HARD - stream process based on smart thresholds (= history) ○ Mix it up - Key value access during speed process ● A new event may be related to an old one, that might be realted to an older one… ○(Add graph processing (GraphX/ Giraph/ Titan Lambda Pitfalls
  • 9.
    ● Jay Kreps(LinkedIn) ● Questioning The Lambda Architecture ○ http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html ● Concepts ○ retain the full log of the data ○ processing = new instance of the same stream ○ input - choose where to start reading from the log (now, 1 day ago, 1 year ago..) ○ real time is accurate! ○ re-processing only when code changes Kappa Architecture
  • 10.
  • 12.
    ● Different● Common ○Greek letters ○ Real time processing at scale ○ Immutable Architectures ■ “Replay” possible ○ Born out of need ○ Both use Materialised views / indexed results for serving Lambda Kappa Lambda Kappa Processing Paradigm Batch + Streaming Streaming Re-processing Paradigm Every Batch Cycle Only when code changes Reliability Batch is reliable, streaming is approximate Streaming with consistency (exactly once) Resource Consumption Function = Query (All data) Incremental algorithms, running on deltas
  • 13.
    ● Data Ingestion ○Kafka ○ Apache Flume ○ Samza ● Batch ○ MR (Hive, Pig etc.) ○ Tez ○ Spark ○ Dataflow (=Google Flume) ● Stream ○ Storm ○ Spark Streaming ○ Samza ○ Dataflow (=Google Flume) ○ Flink Tooling ● Serving ○ DBs ■ ElephantDB ■ SploutSQL ■ HBase / Cassandra ○ Queries ■ Impala ■ Presto ■ Big Query
  • 14.
    ● Lambdas ○ Twitter ○Spotify (music recommendations) ○ Liveperson ○ Inneractive ● Kappas ○ LinkedIn ○ Yahoo ● Platforms ○ Oryx2 (Cloudera) ■ Lambda ML Platform using Kafka + Spark ○ Novelti.io (Previously Lambdoop) ■ Streaming intelligence for everything (mainly IoT) Users
  • 15.
    ● Zeta Architecture ○Includes cluster management ■ Monitoring ■ Scheduling ■ Container system etc. ○ Inspired by Google ● iot-a ○ Internet of Things ○ Layered ■ MQ (kafka - RT) ■ DB (HBase - Interactive) ■ DFS (Batch) ● Mu Architecture ○Lambda with only 1 set of aggregated views ?More Architectures
  • 16.
    ● lambda ○ http://www.infoq.com/interviews/marz-lambda-architecture ●Kappa ○ http://www.kappa-architecture.com/ Appendix - Videos