Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming
1. Fully Fault tolerant Streaming Workflows
at
Scale using Apache Mesos & Spark Streaming
AkhilDas
akhil@sigmoidanalytics.com
2. About Me and Sigmoid
● GitHub: github.com/akhld
● Twitter: @AkhlD
● Email: akhil@sigmoidanalytics.com OUR CUSTOMERS
3. Overview
● Apache Spark
● Spark Streaming
● High Availability Mesos Cluster
● Running Spark Streaming over a High Availability Mesos Cluster
● Simple Fault-tolerant Streaming Pipeline
● Scaling the pipeline
4. Apache Spark
Spark Stack
Resilient Distributed Datasets (RDDs)
- Big collection of data which is:
- Immutable
- Distributed
- Lazily evaluated
- Type Inferred
- Cacheable
RDD1 RDD2 RDD3
5. Why Spark Streaming?
Many big-data applications need to process large data streams in near-
real time
Monitoring Systems
Alert Systems
Computing Systems
7. What is Spark Streaming?
Framework for large scale stream processing
➔ Created at UC Berkeley by Tathagata Das (TD)
➔ Scales to 100s of nodes
➔ Can achieve second scale latencies
➔ Provides a simple batch-like API for implementing complex algorithm
➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.
8. Framework (SparkStreamingJob)
Spark Streaming
Run a streaming computation as a series of very small, deterministic batch jobs
SparkStreaming
Spark
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs
and processes them using RDD operations
- Finally, the processed results of the RDD
operations are returned in batches
10. Mesos High Availability Cluster
Masters Quorum
Leader
Standby Standby
SparkStreamingJob
Executor
Task
Executor
Task
HadoopJob
Slave 1
Slave N
Offer
Offer
Framework
(SparkStreamingJob)
Driver program
Scheduler
Offer
11.
12. Spark Streaming over a HA Mesos Cluster
● To use Mesos from Spark, you need a Spark binary package available in a
place accessible (http/s3/hdfs) by Mesos, and a Spark driver program
configured to connect to Mesos.
● Configuring the driver program to connect to Mesos:
val sconf = new SparkConf()
.setMaster("mesos://zk://10.121.93.241:2181,10.181.2.12:2181,10.107.48.112:2181/mesos")
.setAppName("MyStreamingApp")
.set("spark.executor.uri","hdfs://Sigmoid/executors/spark-1.3.0-bin-hadoop2.4.tgz")
.set("spark.mesos.coarse", "true")
.set("spark.cores.max", "30")
.set("spark.executor.memory", "10g")
val sc = new SparkContext(sconf)
val ssc = new StreamingContext(sc, Seconds(1))
...
13. Spark Streaming Fault-tolerance
Real-time stream processing systems must be operational 24/7, which requires
them to recover from all kinds of failures in the system.
● Spark and its RDD abstraction is designed to seamlessly handle failures of any
worker nodes in the cluster.
● In Streaming, driver failure can be recovered with checkpointing application state.
● Write Ahead Logs (WAL) & Acknowledgements can ensure 0 data loss.
14. Kafka Cluster
Simple Fault-tolerant Streaming Infra
Spark Streaming
Storage
(HDFS/DB)
N
O
D
E
S
High Availability Mesos Cluster
Executor
Task
SparkStreamingJob
15. Scaling the pipeline
Spark Streaming
Storage
(HDFS/DB)
N
O
D
E
S
High Availability Mesos Cluster
Understanding the bottlenecks
- Network : 1Gbps
- # Cores/Slave : 4
- DISK IO : 100MB/S on SSD
Goal:
- Receive & Process data at 1M events/Second
Choosing the correct # Resources
- Since single slave can handle up to
100MB/S network and disk IO, a minimal of
6 slaves could take me to ~600MB/S
Kafka
Cluster