Stream processing from
single node to a cluster
What are we going to talk about?
• What is stream processing?
• What are the challenges?
• Reactive streams
• Implementing reactive streams with Akka streams
• Spark streaming
• Questions ?
What is a stream?
• A sequence of data elements that becomes available over time
• Can be finite (not interesting)
• List of items
• or infinite
• A live video stream
• Web analytics stream
• IOT event stream
• Processed one by one
• So what is the best way to process a stream?
Synchronous processing
• Items in the stream are processed one by one
• Every processing action blocks and waits to finish
• Plus: easy to implement
• Minus: can’t handle load
A-Synchronous processing
• Items in the stream are stored in a buffer
• The consumer fetches items from the buffer in his own time
• Plus: not blocking any more
• Minus: what happens if the buffer fills up ?
Solving the fast publisher problem
1. Increase the buffer size
• temporary solution
• good for picks
• May cause OOM error
2. Drop messages and signal the publisher to resend
• Messages are “wasted”
• TCP works this way
Reactive streams
• Ask the publisher for a specific amount of messages
• No out of memory
• No messages wasted
• Part of the Java 9 JDK :
• Processor
• Publisher
• Subscriber
• Subscription
Reactive streams
@FunctionalInterface
public static interface Flow.Publisher<T> {
public void subscribe(Flow.Subscriber<? super T> subscriber);
}
public static interface Flow.Subscriber<T> {
public void onSubscribe(Flow.Subscription subscription);
public void onNext(T item) ;
public void onError(Throwable throwable) ;
public void onComplete() ;
}
public static interface Flow.Subscription {
public void request(long n);
public void cancel() ;
}
public static interface Flow.Processor<T,R> extends Flow.Subscriber<T>, Flow.Publisher<R> {}
Akka streams
• High level stream API that implements reactive streams
• Based on the Akka actor toolkit
Actor A
Hello msg
Actor B
Talk streams to me
• Graph - description how the stream is processed, composed of
processing stages
• Processing stage – the basic unit of the graph, may transform,
receive or emit elements – must not block
• Source – a processing stage that has single output – emits
elements when the downstream stages are ready
• Sink – a processing stage with a single input – requests and
accepts data
• Flow - a processing stage with a single input and output
Demo
Runnable Graph
• Runnable Graph = Source + Flow + Sink
• Executed by calling run()
• Till calling run the graph doesn’t run
• Materialization is when he materializer takes the stream “recipe”
and actually executes it.
• How? remember the akka actors?
Complex stream graphs
• We want that the lines of the file will get to two different flows
• Its called “Broadcast” in the Akka streams
• The sign “~>” is used as a connector in the GraphDSL
• Once the graph is connected we can return closed shape
File
Lines
mapper
Word
counter
Cleaner
Print
Top
words
Longest
Line
Demo
Batching
• There some cases when we want to collect several items and only
then apply our business logic
• Aggregative logic
• Batch writes to a db
• We can use the batch(max,seedFunction)(aggFunction) – In case
of back pressure aggregates the elements till max elements
• max- defines the maximal number of elements
• seed – a function to create a batch of single element
• aggFunction – combines the existing batch with the next element
To summarize
• Backpressure enables us to handle stream in an efficent manner
• Akka streams implement the reactive streams api using Source,
Flow, Graph, Sink
• Graph is a blue print (“recipe”) of processing stages
• We can build complex flows using the Graph DSL
• We also can batch
Stream processing requirements
• What if I need to have the same logic for stream processing and
batch processing?
• I want to run a cluster of stream processors
• I want it to recover from fail automatically
• Handle multiple stream sources out of the box
• High level API
Spark streaming
• A Spark module for building scalable, fault tolerant stream
processing
Taken from official spark documentation
Remember Spark?
•Spark is a cluster computing engine.
•Provides high-level API in Scala, Java, Python and R.
•The basic abstraction in Spark is the RDD.
•Stands for: Resilient Distributed Dataset.
•It is a distributed collection of items which their source may for
example: Hadoop (HDFS), Kafka, Kinesis …
D is for Partitioned
• Partition is a sub-collection of data that should fit into memory
• Partition + transformation = Task
• This is the distributed part of the RDD
• Partitions are recomputed in case of failure - Resilient
Foo bar ..
Line 2
Hello
…
…
Line 100..
Line #...
…
…
Line 200..
Line #...
…
…
Line 300..
Line #...
…
RDD Actions
•Return values by evaluating the RDD (not lazy):
•collect() – returns an list containing all the elements of the
RDD. This is the main method that evaluates the RDD.
•count() – returns the number of the elements in the RDD.
•first() – returns the first element of the RDD.
•foreach(f) – performs the function on each element of the
RDD.
RDD Transformations
•Return pointer to new RDD with transformation meta-data
•map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•flatMap(func) - Similar to map, but each input item can be mapped
to 0 or more output items (so func should return a Seq rather than a
single item).
Micro batching with Spark Streaming
• Takes a partitioned stream of data
• Slices it up by time – usually seconds
• DStream – composed of RDD slices that contains a collection of
items
Taken from official spark documentation
Example
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(checkpoint.toString())
val dstream: DStream[Int] =
ssc.textFileStream(s"file://$folder/").map(_.trim.toInt)
dstream.print()
ssc.start()
ssc.awaitTermination()
DSTream operations
• Similar to RDD operations with small changes
•map(func) - returns a new DSTream applying func on every
element of the original stream.
•filter(func) - returns a new DSTream formed by selecting those
elements of the source stream on which func returns true.
•reduce(func) – returns a new Dstream of single-element
RDDs by applying the reduce func on every source RDD
Using your existing batch logic
• transform(func) - operation that creates a new DStream by a
applying func to DStream RDDs.
dstream.transform(existingBuisnessFunction)
Updating the state
• All the operations so far didn't have state
• How do I accumulate results with the current batch?
• updateStateByKey(updateFunc) – a transformation that
creates a new DStream with key-value where the value is
updated according to the state and the new values.
def updateFunction(newValues: Seq[Int], count: Option[Int]): Option[Int] = {
runningCount.map(_ + newValues.sum).orElse(Some(newValues.sum))
}
Checkpoints
• Checkpoints – periodically saves to reliable storage
(HDFS/S3/…) necessary data to recover from failures
• Metadata checkpoints
• Configuration of the stream context
• DStream definition and operations
• Incomplete batches
• Data checkpoints
• saving stateful RDD data
Checkpoints
• To configure checkpoint usage :
• streamingContext.checkpoint(directory)
• To create a recoverable streaming application:
• StreamingContext.getOrCreate(checkpointDirectory,
functionToCreateContext)
Working with the foreach RDD
• A common practice is to use the foreachRDD(func) to push
data to an external system.
• Don’t do:
dstream.foreachRDD { rdd =>
val myExternalResource = ... // Created on the driver
rdd.foreachPartition { partition =>
myExternalResource.save(partition)
}
}
Working with the foreach RDD
• Instead do:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partition =>
val myExternalResource = ... // Created on the executor
myExternalResource.save(partition)
}
}
To summarize
• Spark streaming provides high level micro-batch API
• It is distributed by using RDD
• It is fault tolerant because due to the checkpoints
• You can have state that is updated over time
• Use for each RDD carefully
Questions?

Stream processing from single node to a cluster

  • 1.
  • 2.
    What are wegoing to talk about? • What is stream processing? • What are the challenges? • Reactive streams • Implementing reactive streams with Akka streams • Spark streaming • Questions ?
  • 3.
    What is astream? • A sequence of data elements that becomes available over time • Can be finite (not interesting) • List of items • or infinite • A live video stream • Web analytics stream • IOT event stream • Processed one by one • So what is the best way to process a stream?
  • 4.
    Synchronous processing • Itemsin the stream are processed one by one • Every processing action blocks and waits to finish • Plus: easy to implement • Minus: can’t handle load
  • 5.
    A-Synchronous processing • Itemsin the stream are stored in a buffer • The consumer fetches items from the buffer in his own time • Plus: not blocking any more • Minus: what happens if the buffer fills up ?
  • 6.
    Solving the fastpublisher problem 1. Increase the buffer size • temporary solution • good for picks • May cause OOM error 2. Drop messages and signal the publisher to resend • Messages are “wasted” • TCP works this way
  • 7.
    Reactive streams • Askthe publisher for a specific amount of messages • No out of memory • No messages wasted • Part of the Java 9 JDK : • Processor • Publisher • Subscriber • Subscription
  • 8.
    Reactive streams @FunctionalInterface public staticinterface Flow.Publisher<T> { public void subscribe(Flow.Subscriber<? super T> subscriber); } public static interface Flow.Subscriber<T> { public void onSubscribe(Flow.Subscription subscription); public void onNext(T item) ; public void onError(Throwable throwable) ; public void onComplete() ; } public static interface Flow.Subscription { public void request(long n); public void cancel() ; } public static interface Flow.Processor<T,R> extends Flow.Subscriber<T>, Flow.Publisher<R> {}
  • 9.
    Akka streams • Highlevel stream API that implements reactive streams • Based on the Akka actor toolkit Actor A Hello msg Actor B
  • 10.
    Talk streams tome • Graph - description how the stream is processed, composed of processing stages • Processing stage – the basic unit of the graph, may transform, receive or emit elements – must not block • Source – a processing stage that has single output – emits elements when the downstream stages are ready • Sink – a processing stage with a single input – requests and accepts data • Flow - a processing stage with a single input and output
  • 11.
  • 12.
    Runnable Graph • RunnableGraph = Source + Flow + Sink • Executed by calling run() • Till calling run the graph doesn’t run • Materialization is when he materializer takes the stream “recipe” and actually executes it. • How? remember the akka actors?
  • 13.
    Complex stream graphs •We want that the lines of the file will get to two different flows • Its called “Broadcast” in the Akka streams • The sign “~>” is used as a connector in the GraphDSL • Once the graph is connected we can return closed shape File Lines mapper Word counter Cleaner Print Top words Longest Line
  • 14.
  • 15.
    Batching • There somecases when we want to collect several items and only then apply our business logic • Aggregative logic • Batch writes to a db • We can use the batch(max,seedFunction)(aggFunction) – In case of back pressure aggregates the elements till max elements • max- defines the maximal number of elements • seed – a function to create a batch of single element • aggFunction – combines the existing batch with the next element
  • 16.
    To summarize • Backpressureenables us to handle stream in an efficent manner • Akka streams implement the reactive streams api using Source, Flow, Graph, Sink • Graph is a blue print (“recipe”) of processing stages • We can build complex flows using the Graph DSL • We also can batch
  • 17.
    Stream processing requirements •What if I need to have the same logic for stream processing and batch processing? • I want to run a cluster of stream processors • I want it to recover from fail automatically • Handle multiple stream sources out of the box • High level API
  • 18.
    Spark streaming • ASpark module for building scalable, fault tolerant stream processing Taken from official spark documentation
  • 19.
    Remember Spark? •Spark isa cluster computing engine. •Provides high-level API in Scala, Java, Python and R. •The basic abstraction in Spark is the RDD. •Stands for: Resilient Distributed Dataset. •It is a distributed collection of items which their source may for example: Hadoop (HDFS), Kafka, Kinesis …
  • 20.
    D is forPartitioned • Partition is a sub-collection of data that should fit into memory • Partition + transformation = Task • This is the distributed part of the RDD • Partitions are recomputed in case of failure - Resilient Foo bar .. Line 2 Hello … … Line 100.. Line #... … … Line 200.. Line #... … … Line 300.. Line #... …
  • 21.
    RDD Actions •Return valuesby evaluating the RDD (not lazy): •collect() – returns an list containing all the elements of the RDD. This is the main method that evaluates the RDD. •count() – returns the number of the elements in the RDD. •first() – returns the first element of the RDD. •foreach(f) – performs the function on each element of the RDD.
  • 22.
    RDD Transformations •Return pointerto new RDD with transformation meta-data •map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. •filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. •flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
  • 23.
    Micro batching withSpark Streaming • Takes a partitioned stream of data • Slices it up by time – usually seconds • DStream – composed of RDD slices that contains a collection of items Taken from official spark documentation
  • 24.
    Example val ssc =new StreamingContext(conf, Seconds(1)) ssc.checkpoint(checkpoint.toString()) val dstream: DStream[Int] = ssc.textFileStream(s"file://$folder/").map(_.trim.toInt) dstream.print() ssc.start() ssc.awaitTermination()
  • 25.
    DSTream operations • Similarto RDD operations with small changes •map(func) - returns a new DSTream applying func on every element of the original stream. •filter(func) - returns a new DSTream formed by selecting those elements of the source stream on which func returns true. •reduce(func) – returns a new Dstream of single-element RDDs by applying the reduce func on every source RDD
  • 26.
    Using your existingbatch logic • transform(func) - operation that creates a new DStream by a applying func to DStream RDDs. dstream.transform(existingBuisnessFunction)
  • 27.
    Updating the state •All the operations so far didn't have state • How do I accumulate results with the current batch? • updateStateByKey(updateFunc) – a transformation that creates a new DStream with key-value where the value is updated according to the state and the new values. def updateFunction(newValues: Seq[Int], count: Option[Int]): Option[Int] = { runningCount.map(_ + newValues.sum).orElse(Some(newValues.sum)) }
  • 28.
    Checkpoints • Checkpoints –periodically saves to reliable storage (HDFS/S3/…) necessary data to recover from failures • Metadata checkpoints • Configuration of the stream context • DStream definition and operations • Incomplete batches • Data checkpoints • saving stateful RDD data
  • 29.
    Checkpoints • To configurecheckpoint usage : • streamingContext.checkpoint(directory) • To create a recoverable streaming application: • StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)
  • 30.
    Working with theforeach RDD • A common practice is to use the foreachRDD(func) to push data to an external system. • Don’t do: dstream.foreachRDD { rdd => val myExternalResource = ... // Created on the driver rdd.foreachPartition { partition => myExternalResource.save(partition) } }
  • 31.
    Working with theforeach RDD • Instead do: dstream.foreachRDD { rdd => rdd.foreachPartition { partition => val myExternalResource = ... // Created on the executor myExternalResource.save(partition) } }
  • 32.
    To summarize • Sparkstreaming provides high level micro-batch API • It is distributed by using RDD • It is fault tolerant because due to the checkpoints • You can have state that is updated over time • Use for each RDD carefully
  • 33.