SlideShare a Scribd company logo
Fully Fault Tolerant Real Time
Data Pipeline with Docker and
Mesos
Rahul Kumar
Technical Lead
LinuxCon  /  ContainerCon  -­ Berlin,  Germany
Agenda
Data Pipeline
Mesos + Docker
Reactive Data Pipeline
Goal
Analyzing data always have great benefits and is one of the
greatest challenge for an organization.
Today’s business generates massive amount of digital data.
which is cumbersome to store, transport and analyze
Making distributed system and off-
loading workload to commodity
clusters is one of the better approach to
solve data problem
Characteristics Of a distributed
system
❏ Resource Sharing
❏ Openness
❏ Concurrency
❏ Scalability
❏ Fault Tolerance
❏ Transparency
Collect
Store
Process
Analyze
Data Center
Manually Scale Frameworks & Install services
Complex
Very Limited
Inefficient
Low Utilization
Static Partitioning Blocker for Fault
Tolerant data pipeline
Failure make it even more complex to
manage
Apache Mesos
“Apache Mesos abstracts CPU, memory, storage,
and other compute resources away from machines
(physical or virtual), enabling fault-tolerant and
elastic distributed systems to easily be built and run
effectively.”
Mesos Features
Scalability: scale up to 10,000s of nodes
Fault-tolerant: replicated master and slaves using ZooKeeper
Docker support: Support for Docker containers
Native Container: Linux Native isolation between tasks with Linux Containers
Scheduling: Multi-resource scheduling (memory, CPU, disk, and ports)
API supports: Java, Python and C++ APIs for developing new parallel applications
Monitoring: Web UI for viewing cluster state
Resource Isolation
Docker  Containerizer
Mesos adds the support for launching tasks that contains Docker
images
Users can either launch a Docker image as a Task, or as an
Executor.
To run the mesos-agent to enable the Docker Containerizer,
“docker” must be set as one of the containerizers option
mesos-agent --containerizers=docker,mesos
Mesos Frameworks
Aurora: Aurora was developed at Twitter and the migrated to Apache Project later.
Aurora is a framework that keeps service running across a shared pool of
machines, and responsible for keeping them running forever.
Marathon: It is a framework for container orchestration for Mesos. Marathon helps
to run other framework on Mesos. Marathon also runs other application container
such as Jetty, JBoss Server, Play Server.
Chronos: Fault tolerance job scheduler for Mesos, It was developed at Airbnb as
replacement of cron.
Resilient Distributed Datasets
(RDDs)
- Big collection of data
which is:
- Immutable
- Distributed
- Lazily evaluated
- Type Inferred
- Cacheable
Spark Stack
Many big-data applications need to process large data streams in near-real time
Monitoring  Systems
Alert  Systems
Computing  Systems
Why  Spark  Streaming?
Taken  from  Apache  Spark.
What  is  Spark  Streaming?
Framework  for  large  scale  stream  processing  
➔ Created  at  UC  Berkeley  
➔ Scales  to  100s  of  nodes
➔ Can  achieve  second  scale  latencies
➔ Provides  a  simple  batch-­like  API  for  implementing  complex  algorithm
➔ Can  absorb  live  data  streams  from  Kafka,  Flume,  ZeroMQ,  Kinesis  etc.
What  is  Spark  Streaming?
Run  a  streaming  computation  as  a  series  of  very  small,  deterministic  batch  jobs
-­ Chop  up  the  live  stream  into  batches  of  X  seconds
-­ Spark  treats  each  batch  of  data  as  RDDs  
and  processes  them  using  RDD  operations
-­ Finally,  the  processed  results  of  the  RDD  
operations  are  returned  in  batches
Spark  Streaming
Point  of  Failure
Simple  Streaming  Pipeline
● To  use  Mesos  from  Spark,  you  need  a  Spark  binary  package  available  in  a  
place  accessible  (http/s3/hdfs)  by  Mesos,  and  a  Spark  driver  program  
configured  to  connect  to  Mesos.
● Configuring  the  driver  program  to  connect  to  Mesos:
val sconf  =  new  SparkConf()
.setMaster("mesos://zk://10.121.93.241:2181,10.181.2.12:2181,10.107.48.112:2181/mesos")
.setAppName("MyStreamingApp")
.set("spark.executor.uri","hdfs://Sigmoid/executors/spark-­1.3.0-­bin-­hadoop2.4.tgz")
.set("spark.mesos.coarse", "true")
.set("spark.cores.max", "30")
.set("spark.executor.memory",  "10g")
val  sc  =  new  SparkContext(sconf)
val  ssc  =  new  StreamingContext(sc, Seconds(1))
...
Spark  Streaming  over  a  HA  Mesos  Cluster
Real-­time  stream  processing  systems  must  be  operational  24/7,  which  
requires  them  to  recover  from  all  kinds  of  failures  in  the  system.
● Spark  and  its  RDD  abstraction  is  designed  to  seamlessly  handle  failures  of  any  worker  nodes  in  
the  cluster.  
● In  Streaming,  driver  failure  can  be  recovered  with  checkpointing  application  state.
● Write  Ahead  Logs  (WAL)  &  Acknowledgements  can  ensure  0  data  loss.
Spark  Streaming  Fault-­tolerance
Simple  Fault-­tolerant  Streaming  Infra
● Figure out the bottleneck : CPU, Memory, IO, Network
● If parsing is involved, use the one which gives
high performance.
● Proper Data modeling
● Compression, Serialization
Creating  a  scalable  pipeline
Thank You
@rahul_kumar_aws
LinuxCon  /  ContainerCon  -­ Berlin,  Germany

More Related Content

What's hot

Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
DataStax
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
DataStax Academy
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Codemotion Dubai
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
Anirvan Chakraborty
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
Legacy Typesafe (now Lightbend)
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
Knoldus Inc.
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
Patricia Gorla
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbot
yieldbot
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
Szilveszter Molnár
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
Claudiu Barbura
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Building a Real-Time Data Pipeline with Spark, Kafka, and PythonBuilding a Real-Time Data Pipeline with Spark, Kafka, and Python
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
SingleStore
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 

What's hot (20)

Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbot
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Building a Real-Time Data Pipeline with Spark, Kafka, and PythonBuilding a Real-Time Data Pipeline with Spark, Kafka, and Python
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 

Viewers also liked

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
ReactiveStream-meetup-Jan102015ppt
ReactiveStream-meetup-Jan102015pptReactiveStream-meetup-Jan102015ppt
ReactiveStream-meetup-Jan102015pptRahul Kumar
 
Building High Scalable Distributed Framework on Apache Mesos
Building High Scalable Distributed Framework on Apache MesosBuilding High Scalable Distributed Framework on Apache Mesos
Building High Scalable Distributed Framework on Apache MesosRahul Kumar
 
Composing and Scaling Data Platforms-2015
Composing and Scaling Data Platforms-2015Composing and Scaling Data Platforms-2015
Composing and Scaling Data Platforms-2015Rahul Kumar
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
Emanuele Bezzi
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
Tao Li
 
Big data, Analytics and Beyond
Big data, Analytics and BeyondBig data, Analytics and Beyond
Big data, Analytics and Beyond
QuantUniversity
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
Uwe Printz
 
Apache Toree
Apache ToreeApache Toree
Apache Toree
Asim Jalis
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Apache Spark
Apache SparkApache Spark
Apache Spark
Uwe Printz
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
QuantUniversity
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
Mohit Jain
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep Learning
Asim Jalis
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Apache kudu
Apache kuduApache kudu
Apache kudu
Asim Jalis
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 

Viewers also liked (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
ReactiveStream-meetup-Jan102015ppt
ReactiveStream-meetup-Jan102015pptReactiveStream-meetup-Jan102015ppt
ReactiveStream-meetup-Jan102015ppt
 
Building High Scalable Distributed Framework on Apache Mesos
Building High Scalable Distributed Framework on Apache MesosBuilding High Scalable Distributed Framework on Apache Mesos
Building High Scalable Distributed Framework on Apache Mesos
 
Composing and Scaling Data Platforms-2015
Composing and Scaling Data Platforms-2015Composing and Scaling Data Platforms-2015
Composing and Scaling Data Platforms-2015
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Big data, Analytics and Beyond
Big data, Analytics and BeyondBig data, Analytics and Beyond
Big data, Analytics and Beyond
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Apache Toree
Apache ToreeApache Toree
Apache Toree
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep Learning
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 

Similar to Fully fault tolerant real time data pipeline with docker and mesos

Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
Joe Stein
 
OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System
OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating SystemOSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System
OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System
NETWAYS
 
MANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData ServicesMANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData Services
Cisco DevNet
 
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...Akhil Das
 
Deploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerDeploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and docker
Vu Nguyen Duy
 
Enabling Microservices Frameworks to Solve Business Problems
Enabling Microservices Frameworks to Solve  Business ProblemsEnabling Microservices Frameworks to Solve  Business Problems
Enabling Microservices Frameworks to Solve Business Problems
Ken Owens
 
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
DataStax
 
DevOps in Age of Kubernetes
DevOps in Age of KubernetesDevOps in Age of Kubernetes
DevOps in Age of Kubernetes
Mesosphere Inc.
 
Choosing PaaS: Cisco and Open Source Options: an overview
Choosing PaaS:  Cisco and Open Source Options: an overviewChoosing PaaS:  Cisco and Open Source Options: an overview
Choosing PaaS: Cisco and Open Source Options: an overview
Cisco DevNet
 
Scaling ArangoDB on Mesosphere DCOS
Scaling ArangoDB on Mesosphere DCOSScaling ArangoDB on Mesosphere DCOS
Scaling ArangoDB on Mesosphere DCOS
Max Neunhöffer
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein
 
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps.com
 
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員MeetupDatacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Paco Nathan
 
The New Stack Container Summit Talk
The New Stack Container Summit TalkThe New Stack Container Summit Talk
The New Stack Container Summit Talk
The New Stack
 
Doing Dropbox the Native Cloud Native Way
Doing Dropbox the Native Cloud Native WayDoing Dropbox the Native Cloud Native Way
Doing Dropbox the Native Cloud Native Way
Minio
 
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
epamspb
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Episode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OSEpisode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OS
Mesosphere Inc.
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data Services
Mesosphere Inc.
 
mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
Samir Bessalah
 

Similar to Fully fault tolerant real time data pipeline with docker and mesos (20)

Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
 
OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System
OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating SystemOSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System
OSDC 2015: Bernd Mathiske | Why the Datacenter Needs an Operating System
 
MANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData ServicesMANTL Data Platform, Microservices and BigData Services
MANTL Data Platform, Microservices and BigData Services
 
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
 
Deploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerDeploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and docker
 
Enabling Microservices Frameworks to Solve Business Problems
Enabling Microservices Frameworks to Solve  Business ProblemsEnabling Microservices Frameworks to Solve  Business Problems
Enabling Microservices Frameworks to Solve Business Problems
 
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
 
DevOps in Age of Kubernetes
DevOps in Age of KubernetesDevOps in Age of Kubernetes
DevOps in Age of Kubernetes
 
Choosing PaaS: Cisco and Open Source Options: an overview
Choosing PaaS:  Cisco and Open Source Options: an overviewChoosing PaaS:  Cisco and Open Source Options: an overview
Choosing PaaS: Cisco and Open Source Options: an overview
 
Scaling ArangoDB on Mesosphere DCOS
Scaling ArangoDB on Mesosphere DCOSScaling ArangoDB on Mesosphere DCOS
Scaling ArangoDB on Mesosphere DCOS
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
 
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
 
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員MeetupDatacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
 
The New Stack Container Summit Talk
The New Stack Container Summit TalkThe New Stack Container Summit Talk
The New Stack Container Summit Talk
 
Doing Dropbox the Native Cloud Native Way
Doing Dropbox the Native Cloud Native WayDoing Dropbox the Native Cloud Native Way
Doing Dropbox the Native Cloud Native Way
 
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Episode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OSEpisode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OS
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data Services
 
mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
 

Recently uploaded

Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
NaapbooksPrivateLimi
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 

Recently uploaded (20)

Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 

Fully fault tolerant real time data pipeline with docker and mesos

  • 1. Fully Fault Tolerant Real Time Data Pipeline with Docker and Mesos Rahul Kumar Technical Lead LinuxCon  /  ContainerCon  -­ Berlin,  Germany
  • 2. Agenda Data Pipeline Mesos + Docker Reactive Data Pipeline
  • 3. Goal Analyzing data always have great benefits and is one of the greatest challenge for an organization.
  • 4. Today’s business generates massive amount of digital data.
  • 5. which is cumbersome to store, transport and analyze
  • 6. Making distributed system and off- loading workload to commodity clusters is one of the better approach to solve data problem
  • 7.
  • 8. Characteristics Of a distributed system ❏ Resource Sharing ❏ Openness ❏ Concurrency ❏ Scalability ❏ Fault Tolerance ❏ Transparency
  • 11. Manually Scale Frameworks & Install services
  • 13. Static Partitioning Blocker for Fault Tolerant data pipeline
  • 14. Failure make it even more complex to manage
  • 15. Apache Mesos “Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.”
  • 16.
  • 17. Mesos Features Scalability: scale up to 10,000s of nodes Fault-tolerant: replicated master and slaves using ZooKeeper Docker support: Support for Docker containers Native Container: Linux Native isolation between tasks with Linux Containers Scheduling: Multi-resource scheduling (memory, CPU, disk, and ports) API supports: Java, Python and C++ APIs for developing new parallel applications Monitoring: Web UI for viewing cluster state
  • 18.
  • 20.
  • 21.
  • 22. Docker  Containerizer Mesos adds the support for launching tasks that contains Docker images Users can either launch a Docker image as a Task, or as an Executor. To run the mesos-agent to enable the Docker Containerizer, “docker” must be set as one of the containerizers option mesos-agent --containerizers=docker,mesos
  • 23.
  • 24. Mesos Frameworks Aurora: Aurora was developed at Twitter and the migrated to Apache Project later. Aurora is a framework that keeps service running across a shared pool of machines, and responsible for keeping them running forever. Marathon: It is a framework for container orchestration for Mesos. Marathon helps to run other framework on Mesos. Marathon also runs other application container such as Jetty, JBoss Server, Play Server. Chronos: Fault tolerance job scheduler for Mesos, It was developed at Airbnb as replacement of cron.
  • 25. Resilient Distributed Datasets (RDDs) - Big collection of data which is: - Immutable - Distributed - Lazily evaluated - Type Inferred - Cacheable Spark Stack
  • 26. Many big-data applications need to process large data streams in near-real time Monitoring  Systems Alert  Systems Computing  Systems Why  Spark  Streaming?
  • 27. Taken  from  Apache  Spark. What  is  Spark  Streaming?
  • 28. Framework  for  large  scale  stream  processing   ➔ Created  at  UC  Berkeley   ➔ Scales  to  100s  of  nodes ➔ Can  achieve  second  scale  latencies ➔ Provides  a  simple  batch-­like  API  for  implementing  complex  algorithm ➔ Can  absorb  live  data  streams  from  Kafka,  Flume,  ZeroMQ,  Kinesis  etc. What  is  Spark  Streaming?
  • 29. Run  a  streaming  computation  as  a  series  of  very  small,  deterministic  batch  jobs -­ Chop  up  the  live  stream  into  batches  of  X  seconds -­ Spark  treats  each  batch  of  data  as  RDDs   and  processes  them  using  RDD  operations -­ Finally,  the  processed  results  of  the  RDD   operations  are  returned  in  batches Spark  Streaming
  • 30. Point  of  Failure Simple  Streaming  Pipeline
  • 31.
  • 32. ● To  use  Mesos  from  Spark,  you  need  a  Spark  binary  package  available  in  a   place  accessible  (http/s3/hdfs)  by  Mesos,  and  a  Spark  driver  program   configured  to  connect  to  Mesos. ● Configuring  the  driver  program  to  connect  to  Mesos: val sconf  =  new  SparkConf() .setMaster("mesos://zk://10.121.93.241:2181,10.181.2.12:2181,10.107.48.112:2181/mesos") .setAppName("MyStreamingApp") .set("spark.executor.uri","hdfs://Sigmoid/executors/spark-­1.3.0-­bin-­hadoop2.4.tgz") .set("spark.mesos.coarse", "true") .set("spark.cores.max", "30") .set("spark.executor.memory",  "10g") val  sc  =  new  SparkContext(sconf) val  ssc  =  new  StreamingContext(sc, Seconds(1)) ... Spark  Streaming  over  a  HA  Mesos  Cluster
  • 33. Real-­time  stream  processing  systems  must  be  operational  24/7,  which   requires  them  to  recover  from  all  kinds  of  failures  in  the  system. ● Spark  and  its  RDD  abstraction  is  designed  to  seamlessly  handle  failures  of  any  worker  nodes  in   the  cluster.   ● In  Streaming,  driver  failure  can  be  recovered  with  checkpointing  application  state. ● Write  Ahead  Logs  (WAL)  &  Acknowledgements  can  ensure  0  data  loss. Spark  Streaming  Fault-­tolerance
  • 35.
  • 36. ● Figure out the bottleneck : CPU, Memory, IO, Network ● If parsing is involved, use the one which gives high performance. ● Proper Data modeling ● Compression, Serialization Creating  a  scalable  pipeline
  • 37. Thank You @rahul_kumar_aws LinuxCon  /  ContainerCon  -­ Berlin,  Germany