SlideShare a Scribd company logo
1 of 49
Lambda Architecture
with Apache Spark
IMAGE
About Me
https://ua.linkedin.com/in/tarasmatyashovsky
Apache Hadoop: A Brief History
http://www.slideshare.net/fadicce/hadoop-user-group-uae-meeting
A lot of customers implemented
successful Hadoop-based M/R pipelines
which are operating today
Examples from Real Life
• Oozie workflow, operates daily and processes up to
150 TB to generate analytics
• bash managed workflow, operates daily and processes
up to 8 TB to generate analytics
It Is 2016 Now!
• Making decisions faster is more valuable
• Kafka, Storm, Trident, Samza, Spark, Flink, Parquet,
Avro, Cloud providers, etc.
Examples from Real Life
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
Lambda Architecture
A data-processing architecture
designed to handle massive quantities of data
by taking advantage of both
batch and stream processing methods
http://lambda-architecture.net/
https://www.manning.com/books/big-data
https://www.manning.com/books/big-data
Layers of Lambda Architecture
Batch layer
• manages the master dataset (an immutable, append-only set of
raw data)
• pre-computes the batch views
Serving layer
• indexes the batch views so that they can be queried in ad-hoc with
low-latency
Speed layer
• deals with recent data only
http://lambda-architecture.net/
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Relevance of Data
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
query =
real time view =
batch view =
function(batch view, real time view)
function(real time view, new data)
function(all data)
Trade-offs
Full recomputation vs. partial recomputation
e.g. using Bloom filters
Recomputational algorithms vs. incremental algorithms
Additive algorithms vs. approximation algorithms
e.g. HyperLogLog for count-distinct problem
Implementation of Lambda Architecture
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Can be considered as an integrated solution
for processing on all lambda architecture layers
Apache Spark: a Brief History
Why Apache Spark?
As of mid 2014,
Spark is the most active Big Data project
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Total Contributors
Core Concepts
automatically distribute data across cluster
and
parallelize operations performed on them
Components Stack
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
Enables scalable, high-throughput, fault-tolerant
stream processing of live data streams
50% users consider most important part of Spark
Spark Streaming
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Streaming Architecture
• micro-batch architecture
• series of batch computations on small chunks of data
• batch interval is configurable
• exactly once semantics
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Streaming Architecture
http://spark.apache.org/docs/latest/streaming-programming-guide.html
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
http://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
DStream as a Continuous Series of RDDs
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
Provide hashtags statistics
used in a #morningatlohika tweets
All time till today + right now
Sample Application
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Batch View
apache –
architecture –
aws –
java –
jeeconf –
lambda –
morningatlohika –
simpleworkflow –
spark –
6
12
3
4
7
6
15
14
5
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Real-time View
“Cool presentation by @tmatyashovsky about
#lambda #architecture using #apache #spark
at #morningatlohika”
apache –
architecture –
morningatlohika –
lambda –
spark –
1
1
1
1
1
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Batch View + Real-time View
apache –
architecture –
aws –
java –
jeeconf –
lambda –
morningatlohika –
simpleworkflow –
spark –
7
13
3
4
7
7
16
14
6
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Simplified Steps
• Create batch view (.parquet) via Apache Spark
• Cache batch view in Apache Spark
• Start streaming application connected to Twitter
• Focus on real-time #morningatlohika tweets*
• Build incremental real-time views
• Query, i.e. merge batch and real-time views on a fly
* Stream from file system (used for testing) can be used as a backup
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Demo Time
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
http://shop.oreilly.com/product/0636920028512.do
http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
Structured Streaming in Spark 2.0
The simplest way to perform streaming analytics
is not having to reason about streaming
Static DataFrame API = Infinite DataFrame API
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
Structured Streaming
• Introduces streaming API built on top of Spark SQL
• Unifies streaming, interactive and batch queries
logs = context.read.format("json")
.stream("s3://logs")
logs.groupBy(logs.user_id)
.agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
https://www.youtube.com/watch?v=oXkxXDG0gNk
Instead of Epilogue
http://milinda.pathirage.org/kappa-architecture.com/
http://milinda.pathirage.org/kappa-architecture.com/
Taras Matyashovsky
taras.matyashovsky@gmail.com
@tmatyashovsky
http://www.filevych.com/
Thank you!
References
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
https://www.manning.com/books/big-data
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly
Media)
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
http://www.rittmanmead.com/2015/08/combining-spark-streaming-and-data-frames-for-near-real-time-log-analysis/
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://milinda.pathirage.org/kappa-architecture.com/
http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
http://thenewstack.io/spark-2-0-will-offer-interactive-querying-live-data/
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
https://databricks.com/blog/2015/10/13/interactive-audience-analytics-with-spark-and-hyperloglog.html
https://www.youtube.com/watch?v=ZFBgY0PwUeY
https://www.youtube.com/watch?v=oXkxXDG0gN
http://milinda.pathirage.org/kappa-architecture.com/

More Related Content

More from Taras Matyashovsky

More from Taras Matyashovsky (9)

Confession of an Engineer
Confession of an EngineerConfession of an Engineer
Confession of an Engineer
 
Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
 
Morning at Lohika 1st anniversary
Morning at Lohika 1st anniversaryMorning at Lohika 1st anniversary
Morning at Lohika 1st anniversary
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
New life inside monolithic application
New life inside monolithic applicationNew life inside monolithic application
New life inside monolithic application
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using Hazelcast
 
Morning at Lohika
Morning at LohikaMorning at Lohika
Morning at Lohika
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 

Recently uploaded

Recently uploaded (20)

SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 

Lambda Architecture with Apache Spark

Editor's Notes

  1. Receiver: Task that collects data from the input source and represents it as RDDs Is launched automatically for each input source Replicates data to another executor for fault tolerance
  2. Cluster Manager: Standalone, Apache Mesos, Hadoop Yarn Cluster Manager should be chosen and configured properly Monitoring via web UI(s) and metrics Web UI: master web UI worker web UI driver web UI - available only during execution history server - spark.eventLog.enabled = true Metrics based on Coda Hale Metrics library. Can be reported via HTTP, JMX, and CSV files.
  3. Spark 2.0: Project Tungsten 2.0 Whole stage code generation Optimized input / output -> Parquet + built-in cache Spark Streaming DataFrame API unified with Dataset API