Building data pipelines shouldn't be so hard, you just need to choose the right tools for the task.
We will review Akka and Spark streaming, how they work and how to use them and when.
2. What are we going to talk about?
• What is stream processing?
• What are the challenges?
• Reactive streams
• Implementing reactive streams with Akka streams
• Spark streaming
• Questions ?
3. What is a stream?
• A sequence of data elements that becomes available over time
• Can be finite (not interesting)
• List of items
• or infinite
• A live video stream
• Web analytics stream
• IOT event stream
• Processed one by one
• So what is the best way to process a stream?
4. Synchronous processing
• Items in the stream are processed one by one
• Every processing action blocks and waits to finish
• Plus: easy to implement
• Minus: can’t handle load
5. A-Synchronous processing
• Items in the stream are stored in a buffer
• The consumer fetches items from the buffer in his own time
• Plus: not blocking any more
• Minus: what happens if the buffer fills up ?
6. Solving the fast publisher problem
1. Increase the buffer size
• temporary solution
• good for picks
• May cause OOM error
2. Drop messages and signal the publisher to resend
• Messages are “wasted”
• TCP works this way
7. Reactive streams
• Ask the publisher for a specific amount of messages
• No out of memory
• No messages wasted
• Part of the Java 9 JDK :
• Processor
• Publisher
• Subscriber
• Subscription
8. Reactive streams
@FunctionalInterface
public static interface Flow.Publisher<T> {
public void subscribe(Flow.Subscriber<? super T> subscriber);
}
public static interface Flow.Subscriber<T> {
public void onSubscribe(Flow.Subscription subscription);
public void onNext(T item) ;
public void onError(Throwable throwable) ;
public void onComplete() ;
}
public static interface Flow.Subscription {
public void request(long n);
public void cancel() ;
}
public static interface Flow.Processor<T,R> extends Flow.Subscriber<T>, Flow.Publisher<R> {}
9. Akka streams
• High level stream API that implements reactive streams
• Based on the Akka actor toolkit
Actor A
Hello msg
Actor B
10. Talk streams to me
• Graph - description how the stream is processed, composed of
processing stages
• Processing stage – the basic unit of the graph, may transform,
receive or emit elements – must not block
• Source – a processing stage that has single output – emits
elements when the downstream stages are ready
• Sink – a processing stage with a single input – requests and
accepts data
• Flow - a processing stage with a single input and output
12. Runnable Graph
• Runnable Graph = Source + Flow + Sink
• Executed by calling run()
• Till calling run the graph doesn’t run
• Materialization is when he materializer takes the stream “recipe”
and actually executes it.
• How? remember the akka actors?
13. Complex stream graphs
• We want that the lines of the file will get to two different flows
• Its called “Broadcast” in the Akka streams
• The sign “~>” is used as a connector in the GraphDSL
• Once the graph is connected we can return closed shape
File
Lines
mapper
Word
counter
Cleaner
Print
Top
words
Longest
Line
15. Batching
• There some cases when we want to collect several items and only
then apply our business logic
• Aggregative logic
• Batch writes to a db
• We can use the batch(max,seedFunction)(aggFunction) – In case
of back pressure aggregates the elements till max elements
• max- defines the maximal number of elements
• seed – a function to create a batch of single element
• aggFunction – combines the existing batch with the next element
16. To summarize
• Backpressure enables us to handle stream in an efficent manner
• Akka streams implement the reactive streams api using Source,
Flow, Graph, Sink
• Graph is a blue print (“recipe”) of processing stages
• We can build complex flows using the Graph DSL
• We also can batch
17. Stream processing requirements
• What if I need to have the same logic for stream processing and
batch processing?
• I want to run a cluster of stream processors
• I want it to recover from fail automatically
• Handle multiple stream sources out of the box
• High level API
18. Spark streaming
• A Spark module for building scalable, fault tolerant stream
processing
Taken from official spark documentation
19. Remember Spark?
•Spark is a cluster computing engine.
•Provides high-level API in Scala, Java, Python and R.
•The basic abstraction in Spark is the RDD.
•Stands for: Resilient Distributed Dataset.
•It is a distributed collection of items which their source may for
example: Hadoop (HDFS), Kafka, Kinesis …
20. D is for Partitioned
• Partition is a sub-collection of data that should fit into memory
• Partition + transformation = Task
• This is the distributed part of the RDD
• Partitions are recomputed in case of failure - Resilient
Foo bar ..
Line 2
Hello
…
…
Line 100..
Line #...
…
…
Line 200..
Line #...
…
…
Line 300..
Line #...
…
21. RDD Actions
•Return values by evaluating the RDD (not lazy):
•collect() – returns an list containing all the elements of the
RDD. This is the main method that evaluates the RDD.
•count() – returns the number of the elements in the RDD.
•first() – returns the first element of the RDD.
•foreach(f) – performs the function on each element of the
RDD.
22. RDD Transformations
•Return pointer to new RDD with transformation meta-data
•map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•flatMap(func) - Similar to map, but each input item can be mapped
to 0 or more output items (so func should return a Seq rather than a
single item).
23. Micro batching with Spark Streaming
• Takes a partitioned stream of data
• Slices it up by time – usually seconds
• DStream – composed of RDD slices that contains a collection of
items
Taken from official spark documentation
24. Example
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(checkpoint.toString())
val dstream: DStream[Int] =
ssc.textFileStream(s"file://$folder/").map(_.trim.toInt)
dstream.print()
ssc.start()
ssc.awaitTermination()
25. DSTream operations
• Similar to RDD operations with small changes
•map(func) - returns a new DSTream applying func on every
element of the original stream.
•filter(func) - returns a new DSTream formed by selecting those
elements of the source stream on which func returns true.
•reduce(func) – returns a new Dstream of single-element
RDDs by applying the reduce func on every source RDD
26. Using your existing batch logic
• transform(func) - operation that creates a new DStream by a
applying func to DStream RDDs.
dstream.transform(existingBuisnessFunction)
27. Updating the state
• All the operations so far didn't have state
• How do I accumulate results with the current batch?
• updateStateByKey(updateFunc) – a transformation that
creates a new DStream with key-value where the value is
updated according to the state and the new values.
def updateFunction(newValues: Seq[Int], count: Option[Int]): Option[Int] = {
runningCount.map(_ + newValues.sum).orElse(Some(newValues.sum))
}
28. Checkpoints
• Checkpoints – periodically saves to reliable storage
(HDFS/S3/…) necessary data to recover from failures
• Metadata checkpoints
• Configuration of the stream context
• DStream definition and operations
• Incomplete batches
• Data checkpoints
• saving stateful RDD data
29. Checkpoints
• To configure checkpoint usage :
• streamingContext.checkpoint(directory)
• To create a recoverable streaming application:
• StreamingContext.getOrCreate(checkpointDirectory,
functionToCreateContext)
30. Working with the foreach RDD
• A common practice is to use the foreachRDD(func) to push
data to an external system.
• Don’t do:
dstream.foreachRDD { rdd =>
val myExternalResource = ... // Created on the driver
rdd.foreachPartition { partition =>
myExternalResource.save(partition)
}
}
31. Working with the foreach RDD
• Instead do:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partition =>
val myExternalResource = ... // Created on the executor
myExternalResource.save(partition)
}
}
32. To summarize
• Spark streaming provides high level micro-batch API
• It is distributed by using RDD
• It is fault tolerant because due to the checkpoints
• You can have state that is updated over time
• Use for each RDD carefully