SlideShare a Scribd company logo
Four Things to Know about
Reliable Spark Streaming
Dean Wampler, Typesafe
Tathagata Das, Databricks
Agenda for today
• The Stream Processing Landscape
• How Spark Streaming Works - A Quick Overview
• Features in Spark Streaming that Help Prevent
Data Loss
• Design Tips for Successful Streaming
Applications
The Stream
Processing Landscape
Stream Processors
Stream Storage
Stream Sources
MQTT$
How Spark Streaming Works:
A Quick Overview
Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
Kinesis
HDFS/S3
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrates with MLlib, SQL,
DataFrames, GraphX
Spark Streaming
Receivers receive data streams and chop them up into batches
Spark processes the batches and pushes out the results
9
data streams
receivers
batches results
Word Count with Kafka
val!context!=!new!StreamingContext(conf,!Seconds(1))!
val!lines!=!KafkaUtils.createStream(context,!...)!
10
entry point of streaming
functionality
create DStream
from Kafka data
Word Count with Kafka
val!context!=!new!StreamingContext(conf,!Seconds(1))!
val!lines!=!KafkaUtils.createStream(context,!...)!
val!words!=!lines.flatMap(_.split("!"))!
11
split lines into words
Word Count with Kafka
val!context!=!new!StreamingContext(conf,!Seconds(1))!
val!lines!=!KafkaUtils.createStream(context,!...)!
val!words!=!lines.flatMap(_.split("!"))!
val!wordCounts!=!words.map(x!=>!(x,!1))!
!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
wordCounts.print()!
context.start()!
12
print some counts on screen
count the words
start receiving and
transforming the data
Word Count with Kafka
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(new!SparkConf(),!Seconds(1))!
!!!!val!lines!=!KafkaUtils.createStream(context,!...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1)).reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
13
Features in Spark Streaming
that Help Prevent Data Loss
15
A Deeper View of
Spark Streaming
Any Spark Application
16
Spark
Driver
User code runs in
the driver process
YARN / Mesos /
Spark Standalone
cluster
Any Spark Application
17
Spark
Driver
User code runs in
the driver process
YARN / Mesos /
Spark Standalone
cluster
Spark
Executor
Spark
Executor
Spark
Executor
Driver launches
executors in
cluster
Any Spark Application
18
Spark
Driver
User code runs in
the driver process
YARN / Mesos /
Spark Standalone
cluster
Tasks sent to
executors for
processing data
Spark
Executor
Spark
Executor
Spark
Executor
Driver launches
executors in
cluster
Spark Streaming Application: Receive data
19
Executor
Executor
Driver runs
receivers as long
running tasks Receiver Data stream
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Spark Streaming Application: Receive data
20
Executor
Executor
Driver runs
receivers as long
running tasks Receiver Data stream
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Receiver divides
stream into blocks and
keeps in memory
Data Blocks!
Spark Streaming Application: Receive data
21
Executor
Executor
Driver runs
receivers as long
running tasks Receiver Data stream
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Receiver divides
stream into blocks and
keeps in memory
Data Blocks!
Blocks also
replicated to
another executor Data Blocks!
Spark Streaming Application: Process data
22
Executor
Executor
Receiver
Data Blocks!
Data Blocks!
Every batch
interval, driver
launches tasks to
process the blocks
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Spark Streaming Application: Process data
23
Executor
Executor
Receiver
Data Blocks!
Data Blocks!
Data
store
Every batch
interval, driver
launches tasks to
process the blocks
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Fault Tolerance
and Reliability
Failures? Why care?
Many streaming applications need zero data loss guarantees
despite any kind of failures in the system
At least once guarantee – every record processed at least once
Exactly once guarantee – every record processed exactly once
Different kinds of failures – executor and driver
Some failures and guarantee requirements need additional
configurations and setups
25
Executor
Receiver
Data Blocks!
What if an executor fails?
26
Executor
Failed Ex.
Receiver
Blocks!
Blocks!
Driver
If executor fails,
receiver is lost and
all blocks are lost
Executor
Receiver
Data Blocks!
What if an executor fails?
Tasks and receivers restarted by Spark
automatically, no config needed
27
Executor
Failed Ex.
Receiver
Blocks!
Blocks!
Driver
If executor fails,
receiver is lost and
all blocks are lost
Receiver
Receiver
restarted
Tasks restarted
on block replicas
What if the driver fails?
28
Executor
Blocks!How do we
recover?
When the driver
fails, all the
executors fail
All computation,
all received
blocks are lost
Executor
Receiver
Blocks!
Failed Ex.
Receiver
Blocks!
Failed
Executor
Blocks!
Driver
Failed
Driver
Recovering Driver w/ DStream Checkpointing
DStream Checkpointing:
Periodically save the DAG of DStreams to
fault-tolerant storage
29
Executor
Blocks!
Executor
Receiver
Blocks!
Active
Driver
Checkpoint info
to HDFS / S3
Recovering Driver w/ DStream Checkpointing
30
Failed driver can be restarted
from checkpoint information
Failed
Driver
Restarted
Driver
DStream Checkpointing:
Periodically save the DAG of DStreams to
fault-tolerant storage
Recovering Driver w/ DStream Checkpointing
31
Failed driver can be restarted
from checkpoint information
Failed
Driver
Restarted
Driver
New
Executor
New
Executor
Receiver
New executors
launched and
receivers
restarted
DStream Checkpointing:
Periodically save the DAG of DStreams to
fault-tolerant storage
Recovering Driver w/ DStream Checkpointing
1.  Configure automatic driver restart
All cluster managers support this
2.  Set a checkpoint directory in a HDFS-compatible file
system
!streamingContext.checkpoint(hdfsDirectory)!
3.  Slightly restructure of the code to use checkpoints for
recovery
32
Configurating Automatic Driver Restart
Spark Standalone – Use spark-submit with “cluster” mode and “--
supervise”
See http://spark.apache.org/docs/latest/spark-standalone.html
YARN – Use spark-submit in “cluster” mode
See YARN config “yarn.resourcemanager.am.max-attempts”
Mesos – Marathon can restart applications or use the “--supervise” flag.
33
Restructuring code for Checkpointing
34
val!context!=!new!StreamingContext(...)!
val!lines!=!KafkaUtils.createStream(...)!
val!words!=!lines.flatMap(...)!
...!
context.start()!
Create
+
Setup
Start
Restructuring code for Checkpointing
35
val!context!=!new!StreamingContext(...)!
val!lines!=!KafkaUtils.createStream(...)!
val!words!=!lines.flatMap(...)!
...!
context.start()!
Create
+
Setup
Start
def!creatingFunc():!StreamingContext!=!{!!
!!!val!context!=!new!StreamingContext(...)!!!
!!!val!lines!=!KafkaUtils.createStream(...)!
!!!val!words!=!lines.flatMap(...)!
!!!...!
!!!context.checkpoint(hdfsDir)!
}!
Put all setup code into a function that returns a new StreamingContext
Restructuring code for Checkpointing
36
val!context!=!new!StreamingContext(...)!
val!lines!=!KafkaUtils.createStream(...)!
val!words!=!lines.flatMap(...)!
...!
context.start()!
Create
+
Setup
Start
def!creatingFunc():!StreamingContext!=!{!!
!!!val!context!=!new!StreamingContext(...)!!!
!!!val!lines!=!KafkaUtils.createStream(...)!
!!!val!words!=!lines.flatMap(...)!
!!!...!
!!!context.checkpoint(hdfsDir)!
}!
Put all setup code into a function that returns a new StreamingContext
Get context setup from HDFS dir OR create a new one with the function
val!context!=!
StreamingContext.getOrCreate(!
!!hdfsDir,!creatingFunc)!
context.start()!
Restructuring code for Checkpointing
StreamingContext.getOrCreate():
If HDFS directory has checkpoint info
recover context from info
else
call creatingFunc() to create
and setup a new context
Restarted process can figure out whether
to recover using checkpoint info or not
37
def!creatingFunc():!StreamingContext!=!{!!
!!!val!context!=!new!StreamingContext(...)!!!
!!!val!lines!=!KafkaUtils.createStream(...)!
!!!val!words!=!lines.flatMap(...)!
!!!...!
!!!context.checkpoint(hdfsDir)!
}!
val!context!=!
StreamingContext.getOrCreate(!
!!hdfsDir,!creatingFunc)!
context.start()!
Received blocks lost on Restart!
38
Failed
Driver
Restarted
Driver
New
Executor
New Ex.
Receiver
No Blocks!
In-memory blocks of
buffered data are
lost on driver restart
Recovering data with Write Ahead Logs
Write Ahead Log (WAL): Synchronously
save received data to fault-tolerant storage
39
Executor
Blocks saved
to HDFS
Executor
Receiver
Blocks!
Active
Driver
Data stream
Recovering data with Write Ahead Logs
40
Failed
Driver
Restarted
Driver
New
Executor
New Ex.
Receiver
Blocks!
Blocks recovered
from Write Ahead Log
Write Ahead Log (WAL): Synchronously
save received data to fault-tolerant storage
Recovering data with Write Ahead Logs
1.  Enable checkpointing, logs written in checkpoint directory
3.  Enabled WAL in SparkConf configuration
!!!!!sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",!"true")!
3.  Receiver should also be reliable
Acknowledge source only after data saved to WAL
Unacked data will be replayed from source by restarted receiver
5.  Disable in-memory replication (already replicated by HDFS)
!Use StorageLevel.MEMORY_AND_DISK_SER for input DStreams
41
RDD Checkpointing
•  Stateful stream processing can lead to long RDD lineages
•  Long lineage = bad for fault-tolerance, too much recomputation
•  RDD checkpointing saves RDD data to the fault-tolerant storage to
limit lineage and recomputation
•  More:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
42
Fault-tolerance Semantics
43
Zero data loss = every stage processes each
event at least once despite any failure
Sources
Transforming
Sinks
Outputting
Receiving
Fault-tolerance Semantics
44
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Receiving
Outputting
End-to-end semantics:
At-least once
Fault-tolerance Semantics
45
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Receiving
Outputting
Exactly once, if outputs are idempotent or transactional
End-to-end semantics:
At-least once
Fault-tolerance Semantics
46
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
At least once, w/ Checkpointing + WAL + Reliable receiversReceiving
Outputting
Exactly once, if outputs are idempotent or transactional
End-to-end semantics:
At-least once
Fault-tolerance Semantics
47
Exactly once receiving with new Kafka Direct approach
Treats Kafka like a replicated log, reads it like a file
Does not use receivers
No need to create multiple DStreams and union them
No need to enable Write Ahead Logs
!
@val!directKafkaStream!=!KafkaUtils.createDirectStream(...)!
!
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Sources
Transforming
Sinks
Outputting
Receiving
Fault-tolerance Semantics
48
Exactly once receiving with new Kafka Direct approach
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Exactly once, if outputs are idempotent or transactional
End-to-end semantics:
Exactly once!
Design Tips for Successful
Streaming Applications
Areas for consideration
• Enhance resilience with additional
components.
• Mini-batch vs. per-message handling.
• Exploit Reactive Streams.
• Use Storm, Akka, Samza, etc. for handling
individual messages, especially with sub-
second latency requirements.
• Use Spark Streaming’s mini-batch model for
the Lambda architecture and highly-scalable
analytics.
Mini-batch vs. per-message handling
• Consider Kafka or Kinesis for resilient buffering
in front of Spark Streaming.
•  Buffer for traffic spikes.
•  Re-retrieval of data if an RDD partition is lost and must be
reconstructed from the source.
• Going to store the raw data anyway?
•  Do it first, then ingest to Spark from that storage.
Enhance Resiliency with Additional
Components.
• Spark Streaming v1.5 will have support for back pressure to more
easily build end-to-end reactive applications
Exploit Reactive Streams
• Spark Streaming v1.5 will have support for back
pressure to more easily build end-to-end reactive
applications
• Backpressure from consumer to producer:
•  Prevents buffer overflows.
•  Avoids unnecessary throttling.
Exploit Reactive Streams
• Spark Streaming v1.4? Buffer with Akka
Streams:
Exploit Reactive Streams
Event
Source
Akka
App
Spark
Streaming App
Event
Event
Event
Event
feedback
(back pressure)
• Spark Streaming v1.4 has a rate limit property:
•  spark.streaming.receiver.maxRate
•  Consider setting it for long-running streaming apps with a
variable input flow rate.
• Have a graph of Reactive Streams? Consider using an
Akka app to buffer the data fed to Spark Streaming over
a socket (until 1.5…).
Exploit Reactive Streams
Thank you!
Dean Wampler, Typesafe
Tathagata Das, Databricks
Databricks is
available as a hosted
platform on 
AWS with a monthly
subscription. 
What to do next?
Start with a free trial
Typesafe now offers
certified support for
Spark, Mesos &
DCOS, read more
about it
READ MORE
REACTIVE PLATFORM

Full Lifecycle Support for Play, Akka, Scala and Spark
Give your project a boost with Reactive Platform:
• Monitor Message-Driven Apps
• Resolve Network Partitions Decisively
• Integrate Easily with Legacy Systems
• Eliminate Incompatibility & Security Risks
• Protect Apps Against Abuse
• Expert Support from Dedicated Product Teams
Enjoy learning? See about the availability of
on-site training for Scala, Akka, Play and Spark!
Learn more about our offersCONTACT US

More Related Content

What's hot

Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
Databricks
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best PracticesOracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Markus Michalewicz
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
Yaroslav Tkachenko
 

What's hot (20)

Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best PracticesOracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 

Viewers also liked

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
Helena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
Rahul Kumar
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Robert "Chip" Senkbeil
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
Joel Koshy
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
Prakash Chockalingam
 

Viewers also liked (20)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 

Similar to Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
phanleson
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
Dhafer Malouche
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
Prakash Chockalingam
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
Claudiu Barbura
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 

Similar to Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks (20)

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 

More from Legacy Typesafe (now Lightbend)

The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
Legacy Typesafe (now Lightbend)
 
Reactive Design Patterns
Reactive Design PatternsReactive Design Patterns
Reactive Design Patterns
Legacy Typesafe (now Lightbend)
 
Revitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with MicroservicesRevitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with Microservices
Legacy Typesafe (now Lightbend)
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Legacy Typesafe (now Lightbend)
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Legacy Typesafe (now Lightbend)
 
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Legacy Typesafe (now Lightbend)
 
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Legacy Typesafe (now Lightbend)
 
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Legacy Typesafe (now Lightbend)
 
Microservices 101: Exploiting Reality's Constraints with Technology
Microservices 101: Exploiting Reality's Constraints with TechnologyMicroservices 101: Exploiting Reality's Constraints with Technology
Microservices 101: Exploiting Reality's Constraints with Technology
Legacy Typesafe (now Lightbend)
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
Legacy Typesafe (now Lightbend)
 
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Legacy Typesafe (now Lightbend)
 
Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)
Legacy Typesafe (now Lightbend)
 
Going Reactive in Java with Typesafe Reactive Platform
Going Reactive in Java with Typesafe Reactive PlatformGoing Reactive in Java with Typesafe Reactive Platform
Going Reactive in Java with Typesafe Reactive Platform
Legacy Typesafe (now Lightbend)
 
Why Play Framework is fast
Why Play Framework is fastWhy Play Framework is fast
Why Play Framework is fast
Legacy Typesafe (now Lightbend)
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
Legacy Typesafe (now Lightbend)
 

More from Legacy Typesafe (now Lightbend) (15)

The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
 
Reactive Design Patterns
Reactive Design PatternsReactive Design Patterns
Reactive Design Patterns
 
Revitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with MicroservicesRevitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with Microservices
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
 
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
 
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive Platform
 
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
 
Microservices 101: Exploiting Reality's Constraints with Technology
Microservices 101: Exploiting Reality's Constraints with TechnologyMicroservices 101: Exploiting Reality's Constraints with Technology
Microservices 101: Exploiting Reality's Constraints with Technology
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
 
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
 
Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)
 
Going Reactive in Java with Typesafe Reactive Platform
Going Reactive in Java with Typesafe Reactive PlatformGoing Reactive in Java with Typesafe Reactive Platform
Going Reactive in Java with Typesafe Reactive Platform
 
Why Play Framework is fast
Why Play Framework is fastWhy Play Framework is fast
Why Play Framework is fast
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
 

Recently uploaded

Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
KrzysztofKkol1
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
NaapbooksPrivateLimi
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 

Recently uploaded (20)

Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

  • 1. Four Things to Know about Reliable Spark Streaming Dean Wampler, Typesafe Tathagata Das, Databricks
  • 2. Agenda for today • The Stream Processing Landscape • How Spark Streaming Works - A Quick Overview • Features in Spark Streaming that Help Prevent Data Loss • Design Tips for Successful Streaming Applications
  • 7. How Spark Streaming Works: A Quick Overview
  • 8. Spark Streaming Scalable, fault-tolerant stream processing system File systems Databases Dashboards Flume Kinesis HDFS/S3 Kafka Twitter High-level API joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrates with MLlib, SQL, DataFrames, GraphX
  • 9. Spark Streaming Receivers receive data streams and chop them up into batches Spark processes the batches and pushes out the results 9 data streams receivers batches results
  • 10. Word Count with Kafka val!context!=!new!StreamingContext(conf,!Seconds(1))! val!lines!=!KafkaUtils.createStream(context,!...)! 10 entry point of streaming functionality create DStream from Kafka data
  • 11. Word Count with Kafka val!context!=!new!StreamingContext(conf,!Seconds(1))! val!lines!=!KafkaUtils.createStream(context,!...)! val!words!=!lines.flatMap(_.split("!"))! 11 split lines into words
  • 12. Word Count with Kafka val!context!=!new!StreamingContext(conf,!Seconds(1))! val!lines!=!KafkaUtils.createStream(context,!...)! val!words!=!lines.flatMap(_.split("!"))! val!wordCounts!=!words.map(x!=>!(x,!1))! !!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! wordCounts.print()! context.start()! 12 print some counts on screen count the words start receiving and transforming the data
  • 13. Word Count with Kafka object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(new!SparkConf(),!Seconds(1))! !!!!val!lines!=!KafkaUtils.createStream(context,!...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1)).reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }! 13
  • 14. Features in Spark Streaming that Help Prevent Data Loss
  • 15. 15 A Deeper View of Spark Streaming
  • 16. Any Spark Application 16 Spark Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster
  • 17. Any Spark Application 17 Spark Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster Spark Executor Spark Executor Spark Executor Driver launches executors in cluster
  • 18. Any Spark Application 18 Spark Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster Tasks sent to executors for processing data Spark Executor Spark Executor Spark Executor Driver launches executors in cluster
  • 19. Spark Streaming Application: Receive data 19 Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }!
  • 20. Spark Streaming Application: Receive data 20 Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }! Receiver divides stream into blocks and keeps in memory Data Blocks!
  • 21. Spark Streaming Application: Receive data 21 Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }! Receiver divides stream into blocks and keeps in memory Data Blocks! Blocks also replicated to another executor Data Blocks!
  • 22. Spark Streaming Application: Process data 22 Executor Executor Receiver Data Blocks! Data Blocks! Every batch interval, driver launches tasks to process the blocks Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }!
  • 23. Spark Streaming Application: Process data 23 Executor Executor Receiver Data Blocks! Data Blocks! Data store Every batch interval, driver launches tasks to process the blocks Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }!
  • 25. Failures? Why care? Many streaming applications need zero data loss guarantees despite any kind of failures in the system At least once guarantee – every record processed at least once Exactly once guarantee – every record processed exactly once Different kinds of failures – executor and driver Some failures and guarantee requirements need additional configurations and setups 25
  • 26. Executor Receiver Data Blocks! What if an executor fails? 26 Executor Failed Ex. Receiver Blocks! Blocks! Driver If executor fails, receiver is lost and all blocks are lost
  • 27. Executor Receiver Data Blocks! What if an executor fails? Tasks and receivers restarted by Spark automatically, no config needed 27 Executor Failed Ex. Receiver Blocks! Blocks! Driver If executor fails, receiver is lost and all blocks are lost Receiver Receiver restarted Tasks restarted on block replicas
  • 28. What if the driver fails? 28 Executor Blocks!How do we recover? When the driver fails, all the executors fail All computation, all received blocks are lost Executor Receiver Blocks! Failed Ex. Receiver Blocks! Failed Executor Blocks! Driver Failed Driver
  • 29. Recovering Driver w/ DStream Checkpointing DStream Checkpointing: Periodically save the DAG of DStreams to fault-tolerant storage 29 Executor Blocks! Executor Receiver Blocks! Active Driver Checkpoint info to HDFS / S3
  • 30. Recovering Driver w/ DStream Checkpointing 30 Failed driver can be restarted from checkpoint information Failed Driver Restarted Driver DStream Checkpointing: Periodically save the DAG of DStreams to fault-tolerant storage
  • 31. Recovering Driver w/ DStream Checkpointing 31 Failed driver can be restarted from checkpoint information Failed Driver Restarted Driver New Executor New Executor Receiver New executors launched and receivers restarted DStream Checkpointing: Periodically save the DAG of DStreams to fault-tolerant storage
  • 32. Recovering Driver w/ DStream Checkpointing 1.  Configure automatic driver restart All cluster managers support this 2.  Set a checkpoint directory in a HDFS-compatible file system !streamingContext.checkpoint(hdfsDirectory)! 3.  Slightly restructure of the code to use checkpoints for recovery 32
  • 33. Configurating Automatic Driver Restart Spark Standalone – Use spark-submit with “cluster” mode and “-- supervise” See http://spark.apache.org/docs/latest/spark-standalone.html YARN – Use spark-submit in “cluster” mode See YARN config “yarn.resourcemanager.am.max-attempts” Mesos – Marathon can restart applications or use the “--supervise” flag. 33
  • 34. Restructuring code for Checkpointing 34 val!context!=!new!StreamingContext(...)! val!lines!=!KafkaUtils.createStream(...)! val!words!=!lines.flatMap(...)! ...! context.start()! Create + Setup Start
  • 35. Restructuring code for Checkpointing 35 val!context!=!new!StreamingContext(...)! val!lines!=!KafkaUtils.createStream(...)! val!words!=!lines.flatMap(...)! ...! context.start()! Create + Setup Start def!creatingFunc():!StreamingContext!=!{!! !!!val!context!=!new!StreamingContext(...)!!! !!!val!lines!=!KafkaUtils.createStream(...)! !!!val!words!=!lines.flatMap(...)! !!!...! !!!context.checkpoint(hdfsDir)! }! Put all setup code into a function that returns a new StreamingContext
  • 36. Restructuring code for Checkpointing 36 val!context!=!new!StreamingContext(...)! val!lines!=!KafkaUtils.createStream(...)! val!words!=!lines.flatMap(...)! ...! context.start()! Create + Setup Start def!creatingFunc():!StreamingContext!=!{!! !!!val!context!=!new!StreamingContext(...)!!! !!!val!lines!=!KafkaUtils.createStream(...)! !!!val!words!=!lines.flatMap(...)! !!!...! !!!context.checkpoint(hdfsDir)! }! Put all setup code into a function that returns a new StreamingContext Get context setup from HDFS dir OR create a new one with the function val!context!=! StreamingContext.getOrCreate(! !!hdfsDir,!creatingFunc)! context.start()!
  • 37. Restructuring code for Checkpointing StreamingContext.getOrCreate(): If HDFS directory has checkpoint info recover context from info else call creatingFunc() to create and setup a new context Restarted process can figure out whether to recover using checkpoint info or not 37 def!creatingFunc():!StreamingContext!=!{!! !!!val!context!=!new!StreamingContext(...)!!! !!!val!lines!=!KafkaUtils.createStream(...)! !!!val!words!=!lines.flatMap(...)! !!!...! !!!context.checkpoint(hdfsDir)! }! val!context!=! StreamingContext.getOrCreate(! !!hdfsDir,!creatingFunc)! context.start()!
  • 38. Received blocks lost on Restart! 38 Failed Driver Restarted Driver New Executor New Ex. Receiver No Blocks! In-memory blocks of buffered data are lost on driver restart
  • 39. Recovering data with Write Ahead Logs Write Ahead Log (WAL): Synchronously save received data to fault-tolerant storage 39 Executor Blocks saved to HDFS Executor Receiver Blocks! Active Driver Data stream
  • 40. Recovering data with Write Ahead Logs 40 Failed Driver Restarted Driver New Executor New Ex. Receiver Blocks! Blocks recovered from Write Ahead Log Write Ahead Log (WAL): Synchronously save received data to fault-tolerant storage
  • 41. Recovering data with Write Ahead Logs 1.  Enable checkpointing, logs written in checkpoint directory 3.  Enabled WAL in SparkConf configuration !!!!!sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",!"true")! 3.  Receiver should also be reliable Acknowledge source only after data saved to WAL Unacked data will be replayed from source by restarted receiver 5.  Disable in-memory replication (already replicated by HDFS) !Use StorageLevel.MEMORY_AND_DISK_SER for input DStreams 41
  • 42. RDD Checkpointing •  Stateful stream processing can lead to long RDD lineages •  Long lineage = bad for fault-tolerance, too much recomputation •  RDD checkpointing saves RDD data to the fault-tolerant storage to limit lineage and recomputation •  More: http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing 42
  • 43. Fault-tolerance Semantics 43 Zero data loss = every stage processes each event at least once despite any failure Sources Transforming Sinks Outputting Receiving
  • 44. Fault-tolerance Semantics 44 Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost Receiving Outputting End-to-end semantics: At-least once
  • 45. Fault-tolerance Semantics 45 Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost Receiving Outputting Exactly once, if outputs are idempotent or transactional End-to-end semantics: At-least once
  • 46. Fault-tolerance Semantics 46 Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost At least once, w/ Checkpointing + WAL + Reliable receiversReceiving Outputting Exactly once, if outputs are idempotent or transactional End-to-end semantics: At-least once
  • 47. Fault-tolerance Semantics 47 Exactly once receiving with new Kafka Direct approach Treats Kafka like a replicated log, reads it like a file Does not use receivers No need to create multiple DStreams and union them No need to enable Write Ahead Logs ! @val!directKafkaStream!=!KafkaUtils.createDirectStream(...)! ! https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html http://spark.apache.org/docs/latest/streaming-kafka-integration.html Sources Transforming Sinks Outputting Receiving
  • 48. Fault-tolerance Semantics 48 Exactly once receiving with new Kafka Direct approach Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost Exactly once, if outputs are idempotent or transactional End-to-end semantics: Exactly once!
  • 49. Design Tips for Successful Streaming Applications
  • 50. Areas for consideration • Enhance resilience with additional components. • Mini-batch vs. per-message handling. • Exploit Reactive Streams.
  • 51. • Use Storm, Akka, Samza, etc. for handling individual messages, especially with sub- second latency requirements. • Use Spark Streaming’s mini-batch model for the Lambda architecture and highly-scalable analytics. Mini-batch vs. per-message handling
  • 52. • Consider Kafka or Kinesis for resilient buffering in front of Spark Streaming. •  Buffer for traffic spikes. •  Re-retrieval of data if an RDD partition is lost and must be reconstructed from the source. • Going to store the raw data anyway? •  Do it first, then ingest to Spark from that storage. Enhance Resiliency with Additional Components.
  • 53. • Spark Streaming v1.5 will have support for back pressure to more easily build end-to-end reactive applications Exploit Reactive Streams
  • 54. • Spark Streaming v1.5 will have support for back pressure to more easily build end-to-end reactive applications • Backpressure from consumer to producer: •  Prevents buffer overflows. •  Avoids unnecessary throttling. Exploit Reactive Streams
  • 55. • Spark Streaming v1.4? Buffer with Akka Streams: Exploit Reactive Streams Event Source Akka App Spark Streaming App Event Event Event Event feedback (back pressure)
  • 56. • Spark Streaming v1.4 has a rate limit property: •  spark.streaming.receiver.maxRate •  Consider setting it for long-running streaming apps with a variable input flow rate. • Have a graph of Reactive Streams? Consider using an Akka app to buffer the data fed to Spark Streaming over a socket (until 1.5…). Exploit Reactive Streams
  • 57. Thank you! Dean Wampler, Typesafe Tathagata Das, Databricks
  • 58. Databricks is available as a hosted platform on  AWS with a monthly subscription.  What to do next? Start with a free trial Typesafe now offers certified support for Spark, Mesos & DCOS, read more about it READ MORE
  • 59. REACTIVE PLATFORM
 Full Lifecycle Support for Play, Akka, Scala and Spark Give your project a boost with Reactive Platform: • Monitor Message-Driven Apps • Resolve Network Partitions Decisively • Integrate Easily with Legacy Systems • Eliminate Incompatibility & Security Risks • Protect Apps Against Abuse • Expert Support from Dedicated Product Teams Enjoy learning? See about the availability of on-site training for Scala, Akka, Play and Spark! Learn more about our offersCONTACT US

Editor's Notes

  1. You’ve heard from TD about Streaming internals and how to make your Spark Streaming apps resilient. Let’s expand the view a bit with additional tips.