SlideShare a Scribd company logo
1 of 33
Stream processing from
single node to a cluster
What are we going to talk about?
• What is stream processing?
• What are the challenges?
• Reactive streams
• Implementing reactive streams with Akka streams
• Spark streaming
• Questions ?
What is a stream?
• A sequence of data elements that becomes available over time
• Can be finite (not interesting)
• List of items
• or infinite
• A live video stream
• Web analytics stream
• IOT event stream
• Processed one by one
• So what is the best way to process a stream?
Synchronous processing
• Items in the stream are processed one by one
• Every processing action blocks and waits to finish
• Plus: easy to implement
• Minus: can’t handle load
A-Synchronous processing
• Items in the stream are stored in a buffer
• The consumer fetches items from the buffer in his own time
• Plus: not blocking any more
• Minus: what happens if the buffer fills up ?
Solving the fast publisher problem
1. Increase the buffer size
• temporary solution
• good for picks
• May cause OOM error
2. Drop messages and signal the publisher to resend
• Messages are “wasted”
• TCP works this way
Reactive streams
• Ask the publisher for a specific amount of messages
• No out of memory
• No messages wasted
• Part of the Java 9 JDK :
• Processor
• Publisher
• Subscriber
• Subscription
Reactive streams
@FunctionalInterface
public static interface Flow.Publisher<T> {
public void subscribe(Flow.Subscriber<? super T> subscriber);
}
public static interface Flow.Subscriber<T> {
public void onSubscribe(Flow.Subscription subscription);
public void onNext(T item) ;
public void onError(Throwable throwable) ;
public void onComplete() ;
}
public static interface Flow.Subscription {
public void request(long n);
public void cancel() ;
}
public static interface Flow.Processor<T,R> extends Flow.Subscriber<T>, Flow.Publisher<R> {}
Akka streams
• High level stream API that implements reactive streams
• Based on the Akka actor toolkit
Actor A
Hello msg
Actor B
Talk streams to me
• Graph - description how the stream is processed, composed of
processing stages
• Processing stage – the basic unit of the graph, may transform,
receive or emit elements – must not block
• Source – a processing stage that has single output – emits
elements when the downstream stages are ready
• Sink – a processing stage with a single input – requests and
accepts data
• Flow - a processing stage with a single input and output
Demo
Runnable Graph
• Runnable Graph = Source + Flow + Sink
• Executed by calling run()
• Till calling run the graph doesn’t run
• Materialization is when he materializer takes the stream “recipe”
and actually executes it.
• How? remember the akka actors?
Complex stream graphs
• We want that the lines of the file will get to two different flows
• Its called “Broadcast” in the Akka streams
• The sign “~>” is used as a connector in the GraphDSL
• Once the graph is connected we can return closed shape
File
Lines
mapper
Word
counter
Cleaner
Print
Top
words
Longest
Line
Demo
Batching
• There some cases when we want to collect several items and only
then apply our business logic
• Aggregative logic
• Batch writes to a db
• We can use the batch(max,seedFunction)(aggFunction) – In case
of back pressure aggregates the elements till max elements
• max- defines the maximal number of elements
• seed – a function to create a batch of single element
• aggFunction – combines the existing batch with the next element
To summarize
• Backpressure enables us to handle stream in an efficent manner
• Akka streams implement the reactive streams api using Source,
Flow, Graph, Sink
• Graph is a blue print (“recipe”) of processing stages
• We can build complex flows using the Graph DSL
• We also can batch
Stream processing requirements
• What if I need to have the same logic for stream processing and
batch processing?
• I want to run a cluster of stream processors
• I want it to recover from fail automatically
• Handle multiple stream sources out of the box
• High level API
Spark streaming
• A Spark module for building scalable, fault tolerant stream
processing
Taken from official spark documentation
Remember Spark?
•Spark is a cluster computing engine.
•Provides high-level API in Scala, Java, Python and R.
•The basic abstraction in Spark is the RDD.
•Stands for: Resilient Distributed Dataset.
•It is a distributed collection of items which their source may for
example: Hadoop (HDFS), Kafka, Kinesis …
D is for Partitioned
• Partition is a sub-collection of data that should fit into memory
• Partition + transformation = Task
• This is the distributed part of the RDD
• Partitions are recomputed in case of failure - Resilient
Foo bar ..
Line 2
Hello
…
…
Line 100..
Line #...
…
…
Line 200..
Line #...
…
…
Line 300..
Line #...
…
RDD Actions
•Return values by evaluating the RDD (not lazy):
•collect() – returns an list containing all the elements of the
RDD. This is the main method that evaluates the RDD.
•count() – returns the number of the elements in the RDD.
•first() – returns the first element of the RDD.
•foreach(f) – performs the function on each element of the
RDD.
RDD Transformations
•Return pointer to new RDD with transformation meta-data
•map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•flatMap(func) - Similar to map, but each input item can be mapped
to 0 or more output items (so func should return a Seq rather than a
single item).
Micro batching with Spark Streaming
• Takes a partitioned stream of data
• Slices it up by time – usually seconds
• DStream – composed of RDD slices that contains a collection of
items
Taken from official spark documentation
Example
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(checkpoint.toString())
val dstream: DStream[Int] =
ssc.textFileStream(s"file://$folder/").map(_.trim.toInt)
dstream.print()
ssc.start()
ssc.awaitTermination()
DSTream operations
• Similar to RDD operations with small changes
•map(func) - returns a new DSTream applying func on every
element of the original stream.
•filter(func) - returns a new DSTream formed by selecting those
elements of the source stream on which func returns true.
•reduce(func) – returns a new Dstream of single-element
RDDs by applying the reduce func on every source RDD
Using your existing batch logic
• transform(func) - operation that creates a new DStream by a
applying func to DStream RDDs.
dstream.transform(existingBuisnessFunction)
Updating the state
• All the operations so far didn't have state
• How do I accumulate results with the current batch?
• updateStateByKey(updateFunc) – a transformation that
creates a new DStream with key-value where the value is
updated according to the state and the new values.
def updateFunction(newValues: Seq[Int], count: Option[Int]): Option[Int] = {
runningCount.map(_ + newValues.sum).orElse(Some(newValues.sum))
}
Checkpoints
• Checkpoints – periodically saves to reliable storage
(HDFS/S3/…) necessary data to recover from failures
• Metadata checkpoints
• Configuration of the stream context
• DStream definition and operations
• Incomplete batches
• Data checkpoints
• saving stateful RDD data
Checkpoints
• To configure checkpoint usage :
• streamingContext.checkpoint(directory)
• To create a recoverable streaming application:
• StreamingContext.getOrCreate(checkpointDirectory,
functionToCreateContext)
Working with the foreach RDD
• A common practice is to use the foreachRDD(func) to push
data to an external system.
• Don’t do:
dstream.foreachRDD { rdd =>
val myExternalResource = ... // Created on the driver
rdd.foreachPartition { partition =>
myExternalResource.save(partition)
}
}
Working with the foreach RDD
• Instead do:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partition =>
val myExternalResource = ... // Created on the executor
myExternalResource.save(partition)
}
}
To summarize
• Spark streaming provides high level micro-batch API
• It is distributed by using RDD
• It is fault tolerant because due to the checkpoints
• You can have state that is updated over time
• Use for each RDD carefully
Questions?

More Related Content

What's hot

Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in ScalaAlex Payne
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at TwitterAlex Payne
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark StreamingKnoldus Inc.
 
Performance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersPerformance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersNLJUG
 
Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuHavoc Pennington
 
Internship final report@Treasure Data Inc.
Internship final report@Treasure Data Inc.Internship final report@Treasure Data Inc.
Internship final report@Treasure Data Inc.Ryuichi ITO
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaLightbend
 
Akka Actor presentation
Akka Actor presentationAkka Actor presentation
Akka Actor presentationGene Chang
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterKonstantin Tsykulenko
 
Akka Streams and HTTP
Akka Streams and HTTPAkka Streams and HTTP
Akka Streams and HTTPRoland Kuhn
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Rajeev Rastogi (KRR)
 
Introduction to ScalaZ
Introduction to ScalaZIntroduction to ScalaZ
Introduction to ScalaZKnoldus Inc.
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
 
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Spark Summit
 

What's hot (20)

Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
 
Akka streams
Akka streamsAkka streams
Akka streams
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Performance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersPerformance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen Borgers
 
Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on Heroku
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Internship final report@Treasure Data Inc.
Internship final report@Treasure Data Inc.Internship final report@Treasure Data Inc.
Internship final report@Treasure Data Inc.
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For Scala
 
Akka Actor presentation
Akka Actor presentationAkka Actor presentation
Akka Actor presentation
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka Cluster
 
Akka Streams and HTTP
Akka Streams and HTTPAkka Streams and HTTP
Akka Streams and HTTP
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
Introduction to ScalaZ
Introduction to ScalaZIntroduction to ScalaZ
Introduction to ScalaZ
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
 

Similar to Stream processing from single node to a cluster

Journey into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka StreamsJourney into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka StreamsKevin Webber
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flinkRenato Guimaraes
 
Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorStéphane Maldini
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLMLconf
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overviewAvi Levi
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Akka-demy (a.k.a. How to build stateful distributed systems) I/II
 Akka-demy (a.k.a. How to build stateful distributed systems) I/II Akka-demy (a.k.a. How to build stateful distributed systems) I/II
Akka-demy (a.k.a. How to build stateful distributed systems) I/IIPeter Csala
 
Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrencyAlex Miller
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Writing Asynchronous Programs with Scala & Akka
Writing Asynchronous Programs with Scala & AkkaWriting Asynchronous Programs with Scala & Akka
Writing Asynchronous Programs with Scala & AkkaYardena Meymann
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streamingTao Li
 

Similar to Stream processing from single node to a cluster (20)

Journey into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka StreamsJourney into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka Streams
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Streams in Java 8
Streams in Java 8Streams in Java 8
Streams in Java 8
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
 
Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and Reactor
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATL
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Akka-demy (a.k.a. How to build stateful distributed systems) I/II
 Akka-demy (a.k.a. How to build stateful distributed systems) I/II Akka-demy (a.k.a. How to build stateful distributed systems) I/II
Akka-demy (a.k.a. How to build stateful distributed systems) I/II
 
Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrency
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Writing Asynchronous Programs with Scala & Akka
Writing Asynchronous Programs with Scala & AkkaWriting Asynchronous Programs with Scala & Akka
Writing Asynchronous Programs with Scala & Akka
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 

Recently uploaded

XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Recently uploaded (20)

XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

Stream processing from single node to a cluster

  • 1. Stream processing from single node to a cluster
  • 2. What are we going to talk about? • What is stream processing? • What are the challenges? • Reactive streams • Implementing reactive streams with Akka streams • Spark streaming • Questions ?
  • 3. What is a stream? • A sequence of data elements that becomes available over time • Can be finite (not interesting) • List of items • or infinite • A live video stream • Web analytics stream • IOT event stream • Processed one by one • So what is the best way to process a stream?
  • 4. Synchronous processing • Items in the stream are processed one by one • Every processing action blocks and waits to finish • Plus: easy to implement • Minus: can’t handle load
  • 5. A-Synchronous processing • Items in the stream are stored in a buffer • The consumer fetches items from the buffer in his own time • Plus: not blocking any more • Minus: what happens if the buffer fills up ?
  • 6. Solving the fast publisher problem 1. Increase the buffer size • temporary solution • good for picks • May cause OOM error 2. Drop messages and signal the publisher to resend • Messages are “wasted” • TCP works this way
  • 7. Reactive streams • Ask the publisher for a specific amount of messages • No out of memory • No messages wasted • Part of the Java 9 JDK : • Processor • Publisher • Subscriber • Subscription
  • 8. Reactive streams @FunctionalInterface public static interface Flow.Publisher<T> { public void subscribe(Flow.Subscriber<? super T> subscriber); } public static interface Flow.Subscriber<T> { public void onSubscribe(Flow.Subscription subscription); public void onNext(T item) ; public void onError(Throwable throwable) ; public void onComplete() ; } public static interface Flow.Subscription { public void request(long n); public void cancel() ; } public static interface Flow.Processor<T,R> extends Flow.Subscriber<T>, Flow.Publisher<R> {}
  • 9. Akka streams • High level stream API that implements reactive streams • Based on the Akka actor toolkit Actor A Hello msg Actor B
  • 10. Talk streams to me • Graph - description how the stream is processed, composed of processing stages • Processing stage – the basic unit of the graph, may transform, receive or emit elements – must not block • Source – a processing stage that has single output – emits elements when the downstream stages are ready • Sink – a processing stage with a single input – requests and accepts data • Flow - a processing stage with a single input and output
  • 11. Demo
  • 12. Runnable Graph • Runnable Graph = Source + Flow + Sink • Executed by calling run() • Till calling run the graph doesn’t run • Materialization is when he materializer takes the stream “recipe” and actually executes it. • How? remember the akka actors?
  • 13. Complex stream graphs • We want that the lines of the file will get to two different flows • Its called “Broadcast” in the Akka streams • The sign “~>” is used as a connector in the GraphDSL • Once the graph is connected we can return closed shape File Lines mapper Word counter Cleaner Print Top words Longest Line
  • 14. Demo
  • 15. Batching • There some cases when we want to collect several items and only then apply our business logic • Aggregative logic • Batch writes to a db • We can use the batch(max,seedFunction)(aggFunction) – In case of back pressure aggregates the elements till max elements • max- defines the maximal number of elements • seed – a function to create a batch of single element • aggFunction – combines the existing batch with the next element
  • 16. To summarize • Backpressure enables us to handle stream in an efficent manner • Akka streams implement the reactive streams api using Source, Flow, Graph, Sink • Graph is a blue print (“recipe”) of processing stages • We can build complex flows using the Graph DSL • We also can batch
  • 17. Stream processing requirements • What if I need to have the same logic for stream processing and batch processing? • I want to run a cluster of stream processors • I want it to recover from fail automatically • Handle multiple stream sources out of the box • High level API
  • 18. Spark streaming • A Spark module for building scalable, fault tolerant stream processing Taken from official spark documentation
  • 19. Remember Spark? •Spark is a cluster computing engine. •Provides high-level API in Scala, Java, Python and R. •The basic abstraction in Spark is the RDD. •Stands for: Resilient Distributed Dataset. •It is a distributed collection of items which their source may for example: Hadoop (HDFS), Kafka, Kinesis …
  • 20. D is for Partitioned • Partition is a sub-collection of data that should fit into memory • Partition + transformation = Task • This is the distributed part of the RDD • Partitions are recomputed in case of failure - Resilient Foo bar .. Line 2 Hello … … Line 100.. Line #... … … Line 200.. Line #... … … Line 300.. Line #... …
  • 21. RDD Actions •Return values by evaluating the RDD (not lazy): •collect() – returns an list containing all the elements of the RDD. This is the main method that evaluates the RDD. •count() – returns the number of the elements in the RDD. •first() – returns the first element of the RDD. •foreach(f) – performs the function on each element of the RDD.
  • 22. RDD Transformations •Return pointer to new RDD with transformation meta-data •map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. •filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. •flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
  • 23. Micro batching with Spark Streaming • Takes a partitioned stream of data • Slices it up by time – usually seconds • DStream – composed of RDD slices that contains a collection of items Taken from official spark documentation
  • 24. Example val ssc = new StreamingContext(conf, Seconds(1)) ssc.checkpoint(checkpoint.toString()) val dstream: DStream[Int] = ssc.textFileStream(s"file://$folder/").map(_.trim.toInt) dstream.print() ssc.start() ssc.awaitTermination()
  • 25. DSTream operations • Similar to RDD operations with small changes •map(func) - returns a new DSTream applying func on every element of the original stream. •filter(func) - returns a new DSTream formed by selecting those elements of the source stream on which func returns true. •reduce(func) – returns a new Dstream of single-element RDDs by applying the reduce func on every source RDD
  • 26. Using your existing batch logic • transform(func) - operation that creates a new DStream by a applying func to DStream RDDs. dstream.transform(existingBuisnessFunction)
  • 27. Updating the state • All the operations so far didn't have state • How do I accumulate results with the current batch? • updateStateByKey(updateFunc) – a transformation that creates a new DStream with key-value where the value is updated according to the state and the new values. def updateFunction(newValues: Seq[Int], count: Option[Int]): Option[Int] = { runningCount.map(_ + newValues.sum).orElse(Some(newValues.sum)) }
  • 28. Checkpoints • Checkpoints – periodically saves to reliable storage (HDFS/S3/…) necessary data to recover from failures • Metadata checkpoints • Configuration of the stream context • DStream definition and operations • Incomplete batches • Data checkpoints • saving stateful RDD data
  • 29. Checkpoints • To configure checkpoint usage : • streamingContext.checkpoint(directory) • To create a recoverable streaming application: • StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)
  • 30. Working with the foreach RDD • A common practice is to use the foreachRDD(func) to push data to an external system. • Don’t do: dstream.foreachRDD { rdd => val myExternalResource = ... // Created on the driver rdd.foreachPartition { partition => myExternalResource.save(partition) } }
  • 31. Working with the foreach RDD • Instead do: dstream.foreachRDD { rdd => rdd.foreachPartition { partition => val myExternalResource = ... // Created on the executor myExternalResource.save(partition) } }
  • 32. To summarize • Spark streaming provides high level micro-batch API • It is distributed by using RDD • It is fault tolerant because due to the checkpoints • You can have state that is updated over time • Use for each RDD carefully