TechLabs by
A la découverte de Machine Learning, de Redis et de Spark
TechLabs by
2
Maturin BADO
@mccstanmbg
github.com/mccstan
SPARK
Spark : Introduction
Outline
❏ Data processing today
❏ Spark, hadoop, MapReduce
❏ Spark ecosystem
❏ Spark basics
Data processing today
Data intensive application
Definition :
“We call an application data-intensive if data is its primary challenge—the
quantity of data, the complexity of data, or the speed at which it is changing—as
opposed to compute-intensive, where CPU cycles are the bottleneck.”
Martin Klepmann
Data processing today
Today apps needs :
❏ Store data (databases)
❏ Caches
❏ search data (search index)
❏ Asynchronously message handling (stream processing)
❏ batch processing
Spark, hadoop, MapReduce
Spark, hadoop, MapReduce
Spark : main differences with Map Reduce
❏ Spark load most of the dataset in memory
❏ Implement cache mechanisms which reduce read from disk
❏ Is much faster than MapReduce : Job scheduling
❏ Does not implement any data distribution technology but
can run on top of hadoop clusters (HDFS )
Spark ecosystem : open source
Spark ecosystem : features
Spark ecosystem : deployment
Spark basics : RDD
RDD : Resilient Distributed data
❏ Primary spark abstraction
❏ Fault tolerant collection of elements
❏ Partitioned and Immutable
❏ Two types operations
❏ Lazy Transformation
Spark basics : An execution flow
Spark Streaming
Outline
❏ Why In-stream processing ?
❏ Runtime and Programming Model
❏ Spark Streaming : Overview
❏ Benefits of Discretized Stream Processing
❏ Processing flow
❏ Transform operations
❏ Window operations
Why In-stream processing ?
Why In-stream processing ?
Runtime and Programming Model
Native Streaming
Runtime and Programming Model
Micro-batch Streaming
Spark Streaming : Overview
Benefits of Discretized Stream Processing
Dynamic load balancing
Benefits of Discretized Stream Processing
Fast failure and straggler recovery
Benefits of Discretized Stream Processing
❏ Unification of batch, streaming and interactive analytics
❏ Advanced analytics like machine learning and interactive SQL
❏ Streaming + SQL and DataFrames
❏ Streaming + MLlib
Spark Streaming : Processing flow
Spark Streaming : DStreams
Discretized Streams (DStreams) :
❏ The basic spark streaming abstraction
❏ A continuous series of RDDs
Spark Streaming : Transformations
Transform Operations : Any operation applied on a DStream translates
to operations on the underlying RDDs
Spark Streaming : Transformations
Window Operations :
Spark Streaming : Time abstractions
Batch interval
Sliding interval
Window size
Spark Streaming : Time abstractions
Batch interval
Window size
Sliding interval
Spark Streaming : Some examples
❏ Wordcount
❏ stateless operation, counting words for every batch
❏ Basic Error count
❏ stateless operation, using a filter : contains(“ERROR”)
❏ Cumulative Error count
❏ Stateful operation, errors from the beginning of the processing
❏ Windowed Errors counts
❏ Stateful operation, errors from the sliding window of time
The git repo
https://github.com/SoatGroup/spark-streaming-java-examples
https://github.com/SoatGroup/spark-streaming-python

DIscover Spark and Spark streaming