Lambda Architecture with Apache Spark

Lambda Architecture
with Apache Spark
IMAGE

About Me
https://ua.linkedin.com/in/tarasmatyashovsky

Apache Hadoop: A Brief History
http://www.slideshare.net/fadicce/hadoop-user-group-uae-meeting

A lot of customers implemented
successful Hadoop-based M/R pipelines
which are operating today

Examples from Real Life
• Oozie workflow, operates daily and processes up to
150 TB to generate analytics
• bash managed workflow, operates daily and processes
up to 8 TB to generate analytics

It Is 2016 Now!
• Making decisions faster is more valuable
• Kafka, Storm, Trident, Samza, Spark, Flink, Parquet,
Avro, Cloud providers, etc.

Examples from Real Life
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop

Lambda Architecture
A data-processing architecture
designed to handle massive quantities of data
by taking advantage of both
batch and stream processing methods
http://lambda-architecture.net/

https://www.manning.com/books/big-data

Layers of Lambda Architecture
Batch layer
• manages the master dataset (an immutable, append-only set of
raw data)
• pre-computes the batch views
Serving layer
• indexes the batch views so that they can be queried in ad-hoc with
low-latency
Speed layer
• deals with recent data only
http://lambda-architecture.net/

https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark

Relevance of Data
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
query =
real time view =
batch view =
function(batch view, real time view)
function(real time view, new data)
function(all data)

Trade-offs
Full recomputation vs. partial recomputation
e.g. using Bloom filters
Recomputational algorithms vs. incremental algorithms
Additive algorithms vs. approximation algorithms
e.g. HyperLogLog for count-distinct problem

Implementation of Lambda Architecture

Can be considered as an integrated solution
for processing on all lambda architecture layers

Why Apache Spark?
As of mid 2014,
Spark is the most active Big Data project
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Total Contributors

Core Concepts
automatically distribute data across cluster
and
parallelize operations performed on them

Components Stack
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

Enables scalable, high-throughput, fault-tolerant
stream processing of live data streams
50% users consider most important part of Spark
Spark Streaming
http://spark.apache.org/docs/latest/streaming-programming-guide.html

Streaming Architecture
• micro-batch architecture
• series of batch computations on small chunks of data
• batch interval is configurable
• exactly once semantics

Streaming Architecture

https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

http://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers

http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams

DStream as a Continuous Series of RDDs
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams

Provide hashtags statistics
used in a #morningatlohika tweets
All time till today + right now
Sample Application
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv

Batch View
apache –
architecture –
aws –
java –
jeeconf –
lambda –
morningatlohika –
simpleworkflow –
spark –
6
12
3
4
7
6
15
14
5

Real-time View
“Cool presentation by @tmatyashovsky about
#lambda #architecture using #apache #spark
at #morningatlohika”
apache –
architecture –
morningatlohika –
lambda –
spark –
1
1
1
1
1

Batch View + Real-time View
apache –
architecture –
aws –
java –
jeeconf –
lambda –
morningatlohika –
simpleworkflow –
spark –
7
13
3
4
7
7
16
14
6

Simplified Steps
• Create batch view (.parquet) via Apache Spark
• Cache batch view in Apache Spark
• Start streaming application connected to Twitter
• Focus on real-time #morningatlohika tweets*
• Build incremental real-time views
• Query, i.e. merge batch and real-time views on a fly
* Stream from file system (used for testing) can be used as a backup

Demo Time

http://shop.oreilly.com/product/0636920028512.do

http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics

Structured Streaming in Spark 2.0
The simplest way to perform streaming analytics
is not having to reason about streaming
Static DataFrame API = Infinite DataFrame API
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark

Structured Streaming
• Introduces streaming API built on top of Spark SQL
• Unifies streaming, interactive and batch queries
logs = context.read.format("json")
.stream("s3://logs")
logs.groupBy(logs.user_id)
.agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
https://www.youtube.com/watch?v=oXkxXDG0gNk

http://milinda.pathirage.org/kappa-architecture.com/

Taras Matyashovsky
taras.matyashovsky@gmail.com
@tmatyashovsky
http://www.filevych.com/
Thank you!

References
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
https://www.manning.com/books/big-data
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly
Media)
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
http://www.rittmanmead.com/2015/08/combining-spark-streaming-and-data-frames-for-near-real-time-log-analysis/
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
http://thenewstack.io/spark-2-0-will-offer-interactive-querying-live-data/
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
https://databricks.com/blog/2015/10/13/interactive-audience-analytics-with-spark-and-hyperloglog.html
https://www.youtube.com/watch?v=ZFBgY0PwUeY
https://www.youtube.com/watch?v=oXkxXDG0gN

Lambda Architecture with Apache Spark

Recommended

Recommended

More Related Content

More from Taras Matyashovsky

More from Taras Matyashovsky (9)

Recently uploaded

Recently uploaded (20)

Lambda Architecture with Apache Spark

Editor's Notes