2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

Learn more about Advanced Analytics at http://www.alpinenow.com
Lambda Architecture with
DB Tsai
dbtsai@alpinenow.com
Machine Learning Engineering Lead @ Alpine Data Labs
Next.ML Conference
Jan 17, 2015

•  Batch Layer, managing all available big dataset which is an
immutable, append-only set of raw data using distributed
processing system.
•  Speed layer, processing data in streaming fashion with low
latency, and the real-time views are provided by the most
recent data.
•  Serving layer, the result from batch layer and speed layer
will be stored here, and it responds to queries in a low-
latency and ad-hoc way.
Lambda Architecture

Lambda Architecture
https://www.mapr.com/developercentral/lambda-architecture

•  Different technologies are used in batch layer and speed
layer traditionally.
•  If your batch system is implemented with Apache Pig, and
your speed layer is implemented with Apache Storm, you
have to write and maintain the same logics in SQL and in
Java/Scala
•  This will very quickly becomes a maintenance nightmare.
Traditional Lambda Architecture

Unified Development Framework

Batch Layer
•  Empower users to iterate
through the data by utilizing
the in-memory cache.
•  Logistic regression runs up
to 100x faster than Hadoop
M/R in memory.
•  We’re able to train exact
models without doing any
approximation.

Apache Spark Utilizing in-memory Cache for M/R job
Iterative algorithms
scan through the
data each time
With Spark, data is
cached in memory
after first iteration
Quasi-Newton methods
enhance in-memory
benefits
921s
150m
m
rows
97s

Speed Layer
•  An extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data stream.
•  Spark Streaming receives streaming input, and divides the data
into batches which are then processed by Spark engine.
•  As a result, developers can maintain the same Java/Scala code
in Batch and Speed layer.

MapReduce Review
•  MapReduce – Simplified Data Processing on Large
Clusters, 2004.
•  Scales Linearly
•  Data Locality
•  Fault Tolerance in Data and Computation

Hard Disks Failures from Google’s 2007 Study
•  1.7% of disks failed in the first
year of their life.
•  Three-year-old disks were
failing at a rate of 8.6%.
•  For the hypothetical eight-disk server, the probability that
none of disks fail in first year will be 81%.
•  The key contributions of the MapReduce framework are not
the actual map and reduce functions, but the scalability and
fault-tolerance achieved with commodity hardware.

Hadoop MapReduce Review
•  Mapper: Loads the data and emits a set of key-value pairs
•  Reducer: Collects the key-value pairs with the same key to
process, and output the result.
•  Combiner: Can reduce shuffle traffic by combining key-value
pairs locally before going to reducer.
•  Good: Built in fault tolerance, scalable, and production proven
in industry.
•  Bad: Optimized for disk IO without leveraging memory well;
iterative algorithms go through disk IO again and again;
primitive API is not easy and clean to develop.

Spark MapReduce
•  Spark also uses MapReduce as a programming model but
with much richer APIs in Java Scala, and Python.
•  With Scala expressive APIs, 5-10x less code.
•  Not just a distributed computation framework, Spark provides
several pre-built components empowering users to implement
application faster and easier.
- Spark Streaming
- Spark SQL
- MLlib (Machine Learning)
- GraphX (Graph Processing)

Hadoop M/R vs Spark M/R
•  Hadoop
•  Spark

Supervised Learning
•  Binary Classification: linear SVMs (SGD), logistic regression (L-
BFGS and SGD), decision trees, random forests (Spark 1.2), and
naïve Bayes.
•  Multiclass Classification: Decision trees, naïve Bayes (coming
soon - multinomial logistic regression in GLMNET)
•  Regression: linear least squares (SGD), Lasso (SGD + soft-
threshold), ridge regression (SGD), decision trees, and random
forests (Spark 1.2)
•  Currently, the regularization in linear model will penalize all the
weights including the intercept which is not desired in some use-
cases. Alpine has GLMNET implementation using OWLQN which
can exactly reproduce R’s GLMNET package result with scalability.
We’re in the process of merging it into MLlib community.

Unsupervised Learning
•  K-Means,
•  Collaborative filtering (ALS)
•  SVD
•  PCA
•  Feature extraction and transformation
http://spark.apache.org/docs/1.2.0/mllib-guide.html

Resilient Distributed Datasets (RDDs)
•  RDD is a fault-tolerant collection of elements that can be
operated on in parallel.
•  RDDs can be created by parallelizing an existing
collection in your driver program, or referencing a dataset
in an external storage system, such as a shared
filesystem, HDFS, HBase, HIVE, or any data source
offering a Hadoop InputFormat.
•  RDDs can be cached in memory or on disk

RDD Persistence/Cache
•  RDD can be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it
will be kept in memory on the nodes. Spark’s cache is
fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations
that originally created it.
•  Persisted RDD can be stored using a different storage
level, allowing you, for example, to persist the dataset on
disk, persist it in memory but as serialized Java objects
(to save space), replicate it across nodes, or store it off-
heap in Tachyon.

RDD Operations - two types of operations
•  Transformations: Creates a new dataset from an existing
one. They are lazy, in that they do not compute their
results right away. By default, each transformed RDD may
be recomputed each time you run an action on it. You
may also persist an RDD in memory using the persist (or
cache) method, in which case Spark will keep the
elements around on the cluster for much faster access
the next time you query it. (PS, after transformations, the
dataset can be imbalanced in each executor, and this can
be addressed by repartition.)
•  Actions: Returns a value to the driver program after
running a computation on the dataset.

Transformations
•  map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•  filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•  flatMap(func) - Similar to map, but each input item can be
mapped to 0 or more output items (so func should return a Seq
rather than a single item).
•  mapPartitions(func) - Similar to map, but runs separately on
each partition (block) of the RDD, so func must be of type
Iterator<T> => Iterator<U> when running on an RDD of type T.
http://spark.apache.org/docs/latest/programming-
guide.html#transformations

Actions
•  reduce(func) - Aggregate the elements of the dataset
using a function func (which takes two arguments and
returns one). The function should be commutative and
associative so that it can be computed correctly in
parallel.
•  collect() - Return all the elements of the dataset as an
array at the driver program. This is usually useful after a
filter or other operation that returns a sufficiently small
subset of the data.
•  count(), first(), take(n), saveAsTextFile(path), etc.
http://spark.apache.org/docs/latest/programming-
guide.html#actions

Computing the mean of data

Lab 1)

Spark Streaming: Discretized Streams
•  DStream is the basic abstraction provided by Spark
Streaming over Spark’s RDDs.
•  Each RDD in a DStream contains data from a certain
interval. Any operation applied on a DStream translates
to operations on the underlying RDDs internally.

Word Count in Batch Processing

Word Count in Streaming Processing

Lab 2)

Lab 2)
•  Need another bash shell in docker to run Netcat as a
data server.
•  In production, people often use Kafka as data server.
•  docker ps // to find the current docker PID
•  docker exec –it <PID> bash // to lunch a new shell

UpdateStateByKey Operation
The updateStateByKey operation allows you to maintain
arbitrary state while continuously updating it with new
information.
•  Define the state - The state can be of arbitrary data type.
•  Define the state update function - Specify with a function
how to update the state using the previous state and the
new values from input stream.

UpdateStateByKey Operation

Computing the Mean of Streaming Data
•  Current sum and count at time t has to be accessible
at time (t + 1) to compute new mean of stream.
•  Without UpdateSateByKey, the operations at time t
and (t + 1) are independent.
•  Checkpoint directory has to be configured for
persistence of the state at different time.

Computing the Mean of Streaming Data

Lab 3)

Online Learning Example

Thank you.

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

More Related Content

What's hot

Viewers also liked

Similar to 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

More from DB Tsai

Recently uploaded

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference