Spark Streaming
into context
David Martinez Rego
20th of October 2016
About me
• Phd in ML 2013: predictive maintenance
of windmills
• Lived in London since then
• Postdoc @ UCL
• Teaching and Mentoring @ UCL
internships inside financial institutions
• Consulting on Data analytics
• Early Startup
Plethora of options?
Wishlist
• Easy to compose complex pipelines
• Easy scaling out
• Interoperable with a large ecosystem
• Low latency and high throughput
• Monitoring
Plethora of options?
Flume
• Its mechanism of scaling to different machines is
managed in an ad hoc way
Flume
• Its mechanism of scaling to different machines is
managed in an ad hoc way
Flume
• Its mechanism of scaling to different machines is
managed in an ad hoc way
• Nice to solve simple custom data gathering from
the exterior and throw it in the perimeter for further
processing.
Plethora of options?
Plethora of options?
Plethora of options?
Plethora of options?
Lessons learnt
• Each project has added some good ideas when
they were more needed
• Eventually, all platforms have absorbed the best
ideas from peers
• It seems that we have a winner, for now?
Time view
Pipelining
Composition
one at a time
spouts and bolts
RDD
one at a time
spouts and bolts
Storm basic model
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Topology
s.g.
s.g.
s.g.
s.g.
Guarantees and fault tolerance
ACK ANCH
Anchoring
ACK
Guarantees and fault tolerance
Spout
Bolt
Topology
Storm basic model
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Topology
s.g.
s.g.
s.g.
s.g.
Lambda architecture
Time view
Pipelining
Composition
one at a time
spouts and bolts
RDD
one at a time
system, stream, stream task
Samza
Samza
Samza
Samza
Kappa architecture
Time view
Pipelining
Composition
one at a time
source, spouts, bolts and ack
RDD
one at a time
system, stream, stream task
RDD
Microbatch
Init + connect to source
pipeline
computation + state mgmt.
Time view
Pipelining
Composition
one at a time
source, spouts, bolts and ack
RDD
one at a time
system, stream, stream task
Much better, but still…
• Introduce problems
1. Still no full equivalence between batch and
streaming
2. out of order management and early reporting have
to be coded
3. custom windows code needs to be mixed with
business logic
4. Micro-batches impose a lower limit on latency
Spark: batch and streaming
Spark: batch and streaming
Lambda architecture?
Out of order
Latency is unpredictable
Our aim
Final Spark (1)
Final Spark (2)
Batch vs. Streaming
Data
Streaming
Data Batch
Batch vs. Streaming
Data Batch
Batch vs. Streaming
A batch pipeline IS a
streaming pipeline applied to
a finite stream!
Event time + Processing time
Processing time
Event time
Business logic
+
Event time + Processing time
Processing time
Event time
Business logic
+
Plethora of options?
Beam/Dataflow
Beam/Dataflow
Beam/Dataflow
Apache Beam
Streaming API
Execution engine
Apache Beam
Streaming API
!
!
!
Execution engine
http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
Apache Beam
Kostas Tzoumas, Data artisans
Tyler Akidau, Beam PMC
Other considerations
Maturity ? -
Ecosystem - -
Community -
Ops - -
Other considerations
• Flow of the experiment:
• Read an event from Kafka.
• Deserialize the JSON string.
• Filter out irrelevant events
• Take a projection of the relevant
fields
• Join each event with its associated
campaign (from Redis).
• Take a windowed count of events per
campaign and store each window in
Redis along with a last updated
timestamp (with late events).
Resources
• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
• https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
• http://data-artisans.com/why-apache-beam/#more-710
• http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-
processing-with-apache-flink/
• http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-
matrix.html
• https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
Spark Streaming
into context
Thanks for listening!

Spark Streaming into context