Fully fault tolerant real time data pipeline with docker and mesos

Fully Fault Tolerant Real Time
Data Pipeline with Docker and
Mesos
Rahul Kumar
Technical Lead
LinuxCon / ContainerCon - Berlin, Germany

Agenda
Data Pipeline
Mesos + Docker
Reactive Data Pipeline

Goal
Analyzing data always have great benefits and is one of the
greatest challenge for an organization.

Today’s business generates massive amount of digital data.

which is cumbersome to store, transport and analyze

Making distributed system and off-
loading workload to commodity
clusters is one of the better approach to
solve data problem

Characteristics Of a distributed
system
❏ Resource Sharing
❏ Openness
❏ Concurrency
❏ Scalability
❏ Fault Tolerance
❏ Transparency

Manually Scale Frameworks & Install services

Complex
Very Limited
Inefficient
Low Utilization

Static Partitioning Blocker for Fault
Tolerant data pipeline

Failure make it even more complex to
manage

Apache Mesos
“Apache Mesos abstracts CPU, memory, storage,
and other compute resources away from machines
(physical or virtual), enabling fault-tolerant and
elastic distributed systems to easily be built and run
effectively.”

Mesos Features
Scalability: scale up to 10,000s of nodes
Fault-tolerant: replicated master and slaves using ZooKeeper
Docker support: Support for Docker containers
Native Container: Linux Native isolation between tasks with Linux Containers
Scheduling: Multi-resource scheduling (memory, CPU, disk, and ports)
API supports: Java, Python and C++ APIs for developing new parallel applications
Monitoring: Web UI for viewing cluster state

Docker Containerizer
Mesos adds the support for launching tasks that contains Docker
images
Users can either launch a Docker image as a Task, or as an
Executor.
To run the mesos-agent to enable the Docker Containerizer,
“docker” must be set as one of the containerizers option
mesos-agent --containerizers=docker,mesos

Mesos Frameworks
Aurora: Aurora was developed at Twitter and the migrated to Apache Project later.
Aurora is a framework that keeps service running across a shared pool of
machines, and responsible for keeping them running forever.
Marathon: It is a framework for container orchestration for Mesos. Marathon helps
to run other framework on Mesos. Marathon also runs other application container
such as Jetty, JBoss Server, Play Server.
Chronos: Fault tolerance job scheduler for Mesos, It was developed at Airbnb as
replacement of cron.

Resilient Distributed Datasets
(RDDs)
- Big collection of data
which is:
- Immutable
- Distributed
- Lazily evaluated
- Type Inferred
- Cacheable
Spark Stack

Many big-data applications need to process large data streams in near-real time
Monitoring Systems
Alert Systems
Computing Systems
Why Spark Streaming?

Taken from Apache Spark.
What is Spark Streaming?

Framework for large scale stream processing
➔ Created at UC Berkeley
➔ Scales to 100s of nodes
➔ Can achieve second scale latencies
➔ Provides a simple batch-like API for implementing complex algorithm
➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.
What is Spark Streaming?

Run a streaming computation as a series of very small, deterministic batch jobs
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs
and processes them using RDD operations
- Finally, the processed results of the RDD
operations are returned in batches
Spark Streaming

Point of Failure
Simple Streaming Pipeline

● To use Mesos from Spark, you need a Spark binary package available in a
place accessible (http/s3/hdfs) by Mesos, and a Spark driver program
configured to connect to Mesos.
● Configuring the driver program to connect to Mesos:
val sconf = new SparkConf()
.setMaster("mesos://zk://10.121.93.241:2181,10.181.2.12:2181,10.107.48.112:2181/mesos")
.setAppName("MyStreamingApp")
.set("spark.executor.uri","hdfs://Sigmoid/executors/spark-1.3.0-bin-hadoop2.4.tgz")
.set("spark.mesos.coarse", "true")
.set("spark.cores.max", "30")
.set("spark.executor.memory", "10g")
val sc = new SparkContext(sconf)
val ssc = new StreamingContext(sc, Seconds(1))
...
Spark Streaming over a HA Mesos Cluster

Real-time stream processing systems must be operational 24/7, which
requires them to recover from all kinds of failures in the system.
● Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in
the cluster.
● In Streaming, driver failure can be recovered with checkpointing application state.
● Write Ahead Logs (WAL) & Acknowledgements can ensure 0 data loss.
Spark Streaming Fault-tolerance

Simple Fault-tolerant Streaming Infra

● Figure out the bottleneck : CPU, Memory, IO, Network
● If parsing is involved, use the one which gives
high performance.
● Proper Data modeling
● Compression, Serialization
Creating a scalable pipeline

Thank You
@rahul_kumar_aws
LinuxCon / ContainerCon - Berlin, Germany

Fully fault tolerant real time data pipeline with docker and mesos

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Fully fault tolerant real time data pipeline with docker and mesos

Similar to Fully fault tolerant real time data pipeline with docker and mesos (20)

Recently uploaded

Recently uploaded (20)

Fully fault tolerant real time data pipeline with docker and mesos