SlideShare a Scribd company logo
C H A P T E R 1 0 : S P A R K S T R E A M I N G
Learning Spark
by Holden Karau et. al.
Overview: Spark Streaming
 A Simple Example
 Architecture and
Abstraction
 Transformations
 Stateless
 Stateful
 Output Operations
 Input Sources
 Core Sources
 Additional Sources
 Multiple Sources and Cluster
Sizing
 24/7 Operation
 Checkpointing
 Driver Fault Tolerance
 Worker Fault Tolerance
 Receiver Fault Tolerance
 Processing Guarantees
 Streaming UI
 Performance
Considerations
 Batch and Window Sizes
 Level of Parallelism
 Garbage Collection and
Memory Usage
 Conclusion
10.1 A Simple Example
 Before we dive into the details of Spark Streaming,
let’s consider a simple example. We will receive a
stream of newline-delimited lines of text from a
server running at port 7777, filter only the lines that
contain the word error, and print them.
 Spark Streaming programs are best run as
standalone applications built using Maven or sbt.
Spark Streaming, while part of Spark, ships as a
separate Maven artifact and has some additional
imports you will want to add to your project.
10.2 Architecture and Abstraction
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala
10.2 Architecture and Abstraction (cont.)
10.3 Transformations
 Stateless
 the processing of each batch does not depend on the data of its
previous batches
 include the common RDD transformations like map(), filter(),
and reduceByKey()
 Stateful
 use data or intermediate results from previous batches to
compute the results of the current batch
 include transformations based on:
 sliding windows
 tracking state across time
10.3.1 Stateless Transformations
10.3.2 Stateless Transformations
 Windowed Transformation
 compute results across a longer time period than the
StreamingContext’s batch interval, by combining results from
multiple batches
A windowed stream with a window duration of
3 batches and a slide duration of 2 batches;
every two time steps, we compute a result over
the previous 3 time steps
10.3.2 Stateless Transformations (cont.)
 UpdateStateByKey transformation
 updateStateByKey() maintains state across the batches in a
DStream by providing access to a state variable for DStreams
of key/value pairs
 update(events, oldState)  returns a newState
 events is a list of events that arrived in the current batch (may be
empty)
 oldState is an optional state object, stored within an Option; it
might be missing if there was no previous state for the key
 newState is also an Option; we can return an empty Option to
specify that we want to delete the state
10.4 Output Operations
 Specify what needs to be done with the final transformed
data in a stream
 print()
 save()
 Saving DStream to text files in Scala
ipAddressRequestCount.saveAsTextFiles("outputDir", "txt")
 Saving SequenceFiles from a DStream in Scala
val writableIpAddressRequestCount = ipAddressRequestCount.map {
(ip, count) => (new Text(ip), new LongWritable(count)) }
writableIpAddressRequestCount.saveAsHadoopFiles[
SequenceFileOutputFormat[Text,
LongWritable]]("outputDir", "txt")
10.5 Input Sources
 Spark Streaming has built-in support for a number
of different data sources.
 “core” sources are built into the Spark Streaming Maven
artifact
 others are available through additional artifacts
 Eg: spark-streaming-kafka.
10.5.1 Core Sources
 Stream of files
 allows a stream to be created from files written in a directory of a
Hadoop-compatible filesystem
 needs to have a consistent date format for the directory names and
the files have to be created atomically
 Eg: Streaming text files written to a directory in Scala
val logData = ssc.textFileStream(logDirectory)
 Akka actor stream
 allows using Akka actors as a source for streaming
 To construct an actor stream:
 create an Akka actor
 implement the org.apache.spark.streaming.receiver.ActorHelper
interface
10.5.2 Additional Sources
 Apache Kafka
 Apache Plume
 Push-based receiver
 Pull-based receiver
 Custom input sources
10.5.3 Multiple Sources and Cluster Sizing
 We can combine multiple DStreams using operations
like union()  combine data from multiple input
DStreams
 The receivers are executed in the Spark cluster to use
multiple ones
 Each receiver runs as a long-running task within Spark’s
executors, and hence occupies CPU cores allocated to the
application
 Note: Do not run Spark Streaming programs locally
with master config‐ ured as "local" or "local[1]”
10.6 “24/7” Operations
 Spark provides strong fault tolerance guarantees.
 As long as the input data is stored reliably, Spark Streaming
will always compute the correct result from it, offering “exactly
once” semantics, even if workers or the driver fail.
 To run Spark Streaming applications 24/7
1. setting up checkpointing to a reliable storage system, such as
HDFS or Amazon S3
2. worry about the fault tolerance of the driver program and of
unreliable input sources
10.6.1 Checkpointing
 Main mechanism needs to be set up for fault
tolerance
 Allows periodically saving data about the application
to a reliable storage system, such as HDFS or
Amazon S3  for use in recovering
 Two purposes:
 Limiting the state that must be recomputed on failure
 Providing fault tolerance for the driver
10.6.2 Driver Fault Tolerance
 Requires creating our StreamingContext, which
takes in the checkpoint directory
 use the StreamingContext.getOrCreate() function
 Write initialization code using getOrCreate(), need to
actually restart your driver program when it crashes
10.6.3 Worker Fault Tolerance
 Spark Streaming uses the same techniques as Spark
for its fault tolerance.
 All the data received from external sources is
replicated among the Spark workers
 All RDDs created through transformations of this
replicated input data are tolerant to failure of a
worker node, as the RDD lineage allows the system
to recompute the lost data all the way from the
surviving replica of the input data.
10.6.4 Receiver Fault Tolerance
 Spark Streaming restarts the failed receivers on
other nodes in the cluster
 Receivers provide the guarantees:
 All data read from a reliable filesystem (e.g., with
StreamingContext.hadoop Files) is reliable, because the underlying
filesystem is replicated.
 For unreliable sources such as Kafka, push-based Flume, or
Twitter, Spark repli‐ cates the input data to other nodes, but it
can briefly lose data if a receiver task is down.
10.6.5 Processing Guarantees
 Spark Streaming provide exactly- once semantics for
all transformations
 Even if a worker fails and some data gets reprocessed, the final
transformed result (that is, the transformed RDDs) will be the
same as if the data were processed exactly once.
 When the transformed result is to be pushed to
external systems using out‐ put operations, the task
pushing the result may get executed multiple times
due to failures, and some data can get pushed
multiple times.
10.7 Streaming UI
 UI page that lets us look at what applications are
doing. (typically http:// <driver>:4040)
10.8 Performance Considerations
 Batch in window sizes
 Level of parallelism
 Garbage Collection and Memory Usage
10.8.1 Batch and Window Sizes
 Minimum batch size Spark Streaming can use: 500
milliseconds
 The best approach:
 start with a larger batch size (around 10 seconds)
 work your way down to a smaller batch size.
 If the processing times reported in the Streaming UI
remain consistent, then you can continue to decrease
the batch size
 Note: if they are increasing you may have reached the limit for
your application.
10.8.2 Level of Parallelism
 Increasing the parallelism - a common way to reduce
the processing time of batches
 3 ways:
 Increasing the number of receivers
 Explicitly repartitioning received data
 Increasing parallelism in aggregation
10.8.3 Garbage Collection and Memory Usage
 Java’s garbage collection - an aspect that can cause
problems
 To minimize large pauses due to GC  enabling
Java’s Concurrent Mark- Sweep garbage collector.
 The Concurrent Mark-Sweep garbage collector does consume
more resources overall, but introduces fewer pauses.
 To reduce GC pressure
 Cache RDDs in serialized form
 Use Kryo serialization
 Use an LRU cache
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala
10.9 Conclusion
 In this chapter, we have seen how to work with
streaming data using DStreams.
 Since DStreams are composed of RDDs, the
techniques and knowledge you have gained from the
earlier chapters remains applicable for streaming
and real-time applications.
 In the next chapter, we will look at machine learning
with Spark.

More Related Content

What's hot

Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
Knoldus Inc.
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
Rahul Kumar
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
Gerard Maas
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert Xue
Databricks
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 

What's hot (20)

Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert Xue
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

Viewers also liked

Flexibility in the Family Office Structure is Key - Sean Cortis News Release
Flexibility in the Family Office Structure is Key - Sean Cortis News ReleaseFlexibility in the Family Office Structure is Key - Sean Cortis News Release
Flexibility in the Family Office Structure is Key - Sean Cortis News Release
Investments Network marcus evans
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
phanleson
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hacking
phanleson
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocols
phanleson
 
Session 1 Tp1
Session 1 Tp1Session 1 Tp1
Session 1 Tp1phanleson
 
enterprise java bean
enterprise java beanenterprise java bean
enterprise java bean
Jitender Singh Lodhi
 
Hibernate Tutorial
Hibernate TutorialHibernate Tutorial
Hibernate TutorialRam132
 
Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewalls
phanleson
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applications
phanleson
 
Introduction to hibernate
Introduction to hibernateIntroduction to hibernate
Introduction to hibernatehr1383
 
Intro To Hibernate
Intro To HibernateIntro To Hibernate
Intro To Hibernate
Amit Himani
 
Hibernate performance tuning
Hibernate performance tuningHibernate performance tuning
Hibernate performance tuning
Sander Mak (@Sander_Mak)
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginners
Rahul Jain
 

Viewers also liked (15)

Flexibility in the Family Office Structure is Key - Sean Cortis News Release
Flexibility in the Family Office Structure is Key - Sean Cortis News ReleaseFlexibility in the Family Office Structure is Key - Sean Cortis News Release
Flexibility in the Family Office Structure is Key - Sean Cortis News Release
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hacking
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocols
 
Session 1 Tp1
Session 1 Tp1Session 1 Tp1
Session 1 Tp1
 
COM Introduction
COM IntroductionCOM Introduction
COM Introduction
 
enterprise java bean
enterprise java beanenterprise java bean
enterprise java bean
 
Hibernate Tutorial
Hibernate TutorialHibernate Tutorial
Hibernate Tutorial
 
Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewalls
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applications
 
JPA and Hibernate
JPA and HibernateJPA and Hibernate
JPA and Hibernate
 
Introduction to hibernate
Introduction to hibernateIntroduction to hibernate
Introduction to hibernate
 
Intro To Hibernate
Intro To HibernateIntro To Hibernate
Intro To Hibernate
 
Hibernate performance tuning
Hibernate performance tuningHibernate performance tuning
Hibernate performance tuning
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginners
 

Similar to Learning spark ch10 - Spark Streaming

Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Copper: A high performance workflow engine
Copper: A high performance workflow engineCopper: A high performance workflow engine
Copper: A high performance workflow engine
dmoebius
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Spark Summit
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
Nan Zhu
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rulli
mfrancis
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
PhilipBasford
 
weblogic perfomence tuning
weblogic perfomence tuningweblogic perfomence tuning
weblogic perfomence tuning
prathap kumar
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systemsSri Prasanna
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
Knoldus Inc.
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward
 

Similar to Learning spark ch10 - Spark Streaming (20)

Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 
Copper: A high performance workflow engine
Copper: A high performance workflow engineCopper: A high performance workflow engine
Copper: A high performance workflow engine
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rulli
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
 
weblogic perfomence tuning
weblogic perfomence tuningweblogic perfomence tuning
weblogic perfomence tuning
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systems
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
 

More from phanleson

E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacks
phanleson
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
phanleson
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBase
phanleson
 
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
phanleson
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
phanleson
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLphanleson
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Web
phanleson
 
Lecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many PurposesLecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many Purposes
phanleson
 
SOA Course - SOA governance - Lecture 19
SOA Course - SOA governance - Lecture 19SOA Course - SOA governance - Lecture 19
SOA Course - SOA governance - Lecture 19
phanleson
 
Lecture 18 - Model-Driven Service Development
Lecture 18 - Model-Driven Service DevelopmentLecture 18 - Model-Driven Service Development
Lecture 18 - Model-Driven Service Development
phanleson
 
Lecture 15 - Technical Details
Lecture 15 - Technical DetailsLecture 15 - Technical Details
Lecture 15 - Technical Details
phanleson
 
Lecture 10 - Message Exchange Patterns
Lecture 10 - Message Exchange PatternsLecture 10 - Message Exchange Patterns
Lecture 10 - Message Exchange Patterns
phanleson
 
Lecture 9 - SOA in Context
Lecture 9 - SOA in ContextLecture 9 - SOA in Context
Lecture 9 - SOA in Context
phanleson
 
Lecture 07 - Business Process Management
Lecture 07 - Business Process ManagementLecture 07 - Business Process Management
Lecture 07 - Business Process Management
phanleson
 
Lecture 04 - Loose Coupling
Lecture 04 - Loose CouplingLecture 04 - Loose Coupling
Lecture 04 - Loose Coupling
phanleson
 
Lecture 2 - SOA
Lecture 2 - SOALecture 2 - SOA
Lecture 2 - SOA
phanleson
 
Lecture 3 - Services
Lecture 3 - ServicesLecture 3 - Services
Lecture 3 - Services
phanleson
 
Lecture 01 - Motivation
Lecture 01 - MotivationLecture 01 - Motivation
Lecture 01 - Motivation
phanleson
 

More from phanleson (18)

E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacks
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBase
 
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XML
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Web
 
Lecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many PurposesLecture 2 - Using XML for Many Purposes
Lecture 2 - Using XML for Many Purposes
 
SOA Course - SOA governance - Lecture 19
SOA Course - SOA governance - Lecture 19SOA Course - SOA governance - Lecture 19
SOA Course - SOA governance - Lecture 19
 
Lecture 18 - Model-Driven Service Development
Lecture 18 - Model-Driven Service DevelopmentLecture 18 - Model-Driven Service Development
Lecture 18 - Model-Driven Service Development
 
Lecture 15 - Technical Details
Lecture 15 - Technical DetailsLecture 15 - Technical Details
Lecture 15 - Technical Details
 
Lecture 10 - Message Exchange Patterns
Lecture 10 - Message Exchange PatternsLecture 10 - Message Exchange Patterns
Lecture 10 - Message Exchange Patterns
 
Lecture 9 - SOA in Context
Lecture 9 - SOA in ContextLecture 9 - SOA in Context
Lecture 9 - SOA in Context
 
Lecture 07 - Business Process Management
Lecture 07 - Business Process ManagementLecture 07 - Business Process Management
Lecture 07 - Business Process Management
 
Lecture 04 - Loose Coupling
Lecture 04 - Loose CouplingLecture 04 - Loose Coupling
Lecture 04 - Loose Coupling
 
Lecture 2 - SOA
Lecture 2 - SOALecture 2 - SOA
Lecture 2 - SOA
 
Lecture 3 - Services
Lecture 3 - ServicesLecture 3 - Services
Lecture 3 - Services
 
Lecture 01 - Motivation
Lecture 01 - MotivationLecture 01 - Motivation
Lecture 01 - Motivation
 

Recently uploaded

Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 

Recently uploaded (20)

Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 

Learning spark ch10 - Spark Streaming

  • 1. C H A P T E R 1 0 : S P A R K S T R E A M I N G Learning Spark by Holden Karau et. al.
  • 2. Overview: Spark Streaming  A Simple Example  Architecture and Abstraction  Transformations  Stateless  Stateful  Output Operations  Input Sources  Core Sources  Additional Sources  Multiple Sources and Cluster Sizing  24/7 Operation  Checkpointing  Driver Fault Tolerance  Worker Fault Tolerance  Receiver Fault Tolerance  Processing Guarantees  Streaming UI  Performance Considerations  Batch and Window Sizes  Level of Parallelism  Garbage Collection and Memory Usage  Conclusion
  • 3. 10.1 A Simple Example  Before we dive into the details of Spark Streaming, let’s consider a simple example. We will receive a stream of newline-delimited lines of text from a server running at port 7777, filter only the lines that contain the word error, and print them.  Spark Streaming programs are best run as standalone applications built using Maven or sbt. Spark Streaming, while part of Spark, ships as a separate Maven artifact and has some additional imports you will want to add to your project.
  • 4. 10.2 Architecture and Abstraction
  • 5. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  • 6. 10.2 Architecture and Abstraction (cont.)
  • 7. 10.3 Transformations  Stateless  the processing of each batch does not depend on the data of its previous batches  include the common RDD transformations like map(), filter(), and reduceByKey()  Stateful  use data or intermediate results from previous batches to compute the results of the current batch  include transformations based on:  sliding windows  tracking state across time
  • 9. 10.3.2 Stateless Transformations  Windowed Transformation  compute results across a longer time period than the StreamingContext’s batch interval, by combining results from multiple batches A windowed stream with a window duration of 3 batches and a slide duration of 2 batches; every two time steps, we compute a result over the previous 3 time steps
  • 10. 10.3.2 Stateless Transformations (cont.)  UpdateStateByKey transformation  updateStateByKey() maintains state across the batches in a DStream by providing access to a state variable for DStreams of key/value pairs  update(events, oldState)  returns a newState  events is a list of events that arrived in the current batch (may be empty)  oldState is an optional state object, stored within an Option; it might be missing if there was no previous state for the key  newState is also an Option; we can return an empty Option to specify that we want to delete the state
  • 11. 10.4 Output Operations  Specify what needs to be done with the final transformed data in a stream  print()  save()  Saving DStream to text files in Scala ipAddressRequestCount.saveAsTextFiles("outputDir", "txt")  Saving SequenceFiles from a DStream in Scala val writableIpAddressRequestCount = ipAddressRequestCount.map { (ip, count) => (new Text(ip), new LongWritable(count)) } writableIpAddressRequestCount.saveAsHadoopFiles[ SequenceFileOutputFormat[Text, LongWritable]]("outputDir", "txt")
  • 12. 10.5 Input Sources  Spark Streaming has built-in support for a number of different data sources.  “core” sources are built into the Spark Streaming Maven artifact  others are available through additional artifacts  Eg: spark-streaming-kafka.
  • 13. 10.5.1 Core Sources  Stream of files  allows a stream to be created from files written in a directory of a Hadoop-compatible filesystem  needs to have a consistent date format for the directory names and the files have to be created atomically  Eg: Streaming text files written to a directory in Scala val logData = ssc.textFileStream(logDirectory)  Akka actor stream  allows using Akka actors as a source for streaming  To construct an actor stream:  create an Akka actor  implement the org.apache.spark.streaming.receiver.ActorHelper interface
  • 14. 10.5.2 Additional Sources  Apache Kafka  Apache Plume  Push-based receiver  Pull-based receiver  Custom input sources
  • 15. 10.5.3 Multiple Sources and Cluster Sizing  We can combine multiple DStreams using operations like union()  combine data from multiple input DStreams  The receivers are executed in the Spark cluster to use multiple ones  Each receiver runs as a long-running task within Spark’s executors, and hence occupies CPU cores allocated to the application  Note: Do not run Spark Streaming programs locally with master config‐ ured as "local" or "local[1]”
  • 16. 10.6 “24/7” Operations  Spark provides strong fault tolerance guarantees.  As long as the input data is stored reliably, Spark Streaming will always compute the correct result from it, offering “exactly once” semantics, even if workers or the driver fail.  To run Spark Streaming applications 24/7 1. setting up checkpointing to a reliable storage system, such as HDFS or Amazon S3 2. worry about the fault tolerance of the driver program and of unreliable input sources
  • 17. 10.6.1 Checkpointing  Main mechanism needs to be set up for fault tolerance  Allows periodically saving data about the application to a reliable storage system, such as HDFS or Amazon S3  for use in recovering  Two purposes:  Limiting the state that must be recomputed on failure  Providing fault tolerance for the driver
  • 18. 10.6.2 Driver Fault Tolerance  Requires creating our StreamingContext, which takes in the checkpoint directory  use the StreamingContext.getOrCreate() function  Write initialization code using getOrCreate(), need to actually restart your driver program when it crashes
  • 19. 10.6.3 Worker Fault Tolerance  Spark Streaming uses the same techniques as Spark for its fault tolerance.  All the data received from external sources is replicated among the Spark workers  All RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.
  • 20. 10.6.4 Receiver Fault Tolerance  Spark Streaming restarts the failed receivers on other nodes in the cluster  Receivers provide the guarantees:  All data read from a reliable filesystem (e.g., with StreamingContext.hadoop Files) is reliable, because the underlying filesystem is replicated.  For unreliable sources such as Kafka, push-based Flume, or Twitter, Spark repli‐ cates the input data to other nodes, but it can briefly lose data if a receiver task is down.
  • 21. 10.6.5 Processing Guarantees  Spark Streaming provide exactly- once semantics for all transformations  Even if a worker fails and some data gets reprocessed, the final transformed result (that is, the transformed RDDs) will be the same as if the data were processed exactly once.  When the transformed result is to be pushed to external systems using out‐ put operations, the task pushing the result may get executed multiple times due to failures, and some data can get pushed multiple times.
  • 22. 10.7 Streaming UI  UI page that lets us look at what applications are doing. (typically http:// <driver>:4040)
  • 23. 10.8 Performance Considerations  Batch in window sizes  Level of parallelism  Garbage Collection and Memory Usage
  • 24. 10.8.1 Batch and Window Sizes  Minimum batch size Spark Streaming can use: 500 milliseconds  The best approach:  start with a larger batch size (around 10 seconds)  work your way down to a smaller batch size.  If the processing times reported in the Streaming UI remain consistent, then you can continue to decrease the batch size  Note: if they are increasing you may have reached the limit for your application.
  • 25. 10.8.2 Level of Parallelism  Increasing the parallelism - a common way to reduce the processing time of batches  3 ways:  Increasing the number of receivers  Explicitly repartitioning received data  Increasing parallelism in aggregation
  • 26. 10.8.3 Garbage Collection and Memory Usage  Java’s garbage collection - an aspect that can cause problems  To minimize large pauses due to GC  enabling Java’s Concurrent Mark- Sweep garbage collector.  The Concurrent Mark-Sweep garbage collector does consume more resources overall, but introduces fewer pauses.  To reduce GC pressure  Cache RDDs in serialized form  Use Kryo serialization  Use an LRU cache
  • 27. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  • 28. 10.9 Conclusion  In this chapter, we have seen how to work with streaming data using DStreams.  Since DStreams are composed of RDDs, the techniques and knowledge you have gained from the earlier chapters remains applicable for streaming and real-time applications.  In the next chapter, we will look at machine learning with Spark.

Editor's Notes

  1. Spark Streaming uses a “micro-batch” architecture, where the streaming computa‐ tion is treated as a continuous series of batch computations on small batches of data. Spark Streaming receives data from various input sources and groups it into small batches. New batches are created at regular time intervals. At the beginning of each time interval a new batch is created, and any data that arrives during that interval gets added to that batch. At the end of the time interval the batch is done growing. The size of the time intervals is determined by a parameter called the batch interval. The batch interval is typically between 500 milliseconds and several seconds, as config‐ ured by the application developer. Each input batch forms an RDD, and is processed using Spark jobs to create other RDDs. The processed results can then be pushed out to external systems in batches.
  2. Limiting the state that must be recomputed on failure. As discussed in “Architec‐ ture and Abstraction” on page 186, Spark Streaming can recompute state using the lineage graph of transformations, but checkpointing controls how far back it must go. Providing fault tolerance for the driver. If the driver program in a streaming application crashes, you can launch it again and tell it to recover from a check‐ point, in which case Spark Streaming will read how far the previous run of the program got in processing the data and take over from there.