A fascinating presentation on Storm and Spark Streaming in a high velocity analytical environment by Guangyu Wu, Post Doctoral Researcher with CeADAR. This presentation was made to the Hadoop User Group (HUG) Ireland on Monday 14th December, 2015.
5. Spark
‣ Spark is a platform for distributed batch data processing.
‣ Spark includes a number extensions: Spark Streaming, Spark
SQL, MLlib, GraphX.
‣ Spark runs batch jobs predominantly in memory.
‣ Spark Streaming manages to integrate stream processing
with batch processing by treating a data stream as
sequences of small batches of data points, or micro-batches.
‣ Spark Streaming maintains computation states.
6. Storm
‣ A Storm topology is comprised of spouts and bolts.
‣ Storm operates over individual data points.
‣ Storm is designed purely for stream processing.
7. Trident
‣ Trident is a high level programming abstraction built on top of
Storm.
‣ It provides a number of useful functions such as aggregations
and filters.
‣ An application can be designed and implemented using these
high level abstractions and Trident converts the logic into a
standard Storm topology under the hood.
‣ Trident works over micro-batches of data.
‣ Trident also has built-in support for maintaining processing state
and state query.
8. Methodology
Large static batches
of messages
Hadoop and off-line
batch processing in
Spark
Single messages
Storm
Micro-batches of
messages
Spark Streaming,
Trident
Discretised streams
9. Continuous Clustering
‣ Use case: real-time SMS spam detection in mobile networks.
‣ Clustering SMS messages based on their content is a good
way to identify spam.
‣ Many similar spam messages are sent out over a short
period of time.
10. Continuous Clustering
‣ Problem with traditional clustering algorithms…
‣ work off-line over historical data
‣ require multiple passes over the data
‣ not incrementally updatable
‣ are hard to scale to ‘big’ data
‣ CeADAR solution: we developed a novel single pass,
scalable data stream clustering algorithm implemented on
Storm.
12. Deployment
‣ Our compute cluster is composed of 4 machines.
‣ Each machine:
‣ Intel Xeon CPU E5-2630 0 @ 2.30GHz with 24 cores
‣ 64G memory
‣ 1T disk
‣ Spark, Storm, Hadoop, Kafka, Redis
13. Continuous Clustering
‣ US tier 1 mobile operator
‣ ~500 messages/second average
‣ ~1,300 messages/second peak
35,913
Near-exact matching
8,160
Matching threshold 75%
14. Continuous Metrics
‣ Evaluate and compare Storm, Storm Trident and Spark Streaming on
the task of computing a set of statistical metrics in real-time over a
continuous stream of data.
‣ Evaluate and compare
‣ Throughput: the volume and velocity of data that can be processed
on different configurations and hardware.
‣ Latency: the time delay between a new data point being received and
the updated metrics being computed.
vs vs
16. Continuous Metrics
‣ High level results overview
‣ Spark Streaming achieves the highest throughput, with
Storm at the other end with the lowest throughput.
‣ However, Storm achieves the best latency by a
considerable margin. Spark and Trident both exhibit
considerably higher latency which is due at least in part
to their micro-batch data processing approach.
‣ The evaluation produced many other insights, learnings
and recommendations relating to these real-time platforms.
17. Stream Converge
‣ Current project:
process and combine
heterogeneous data
streams from diverse
sources using Spark
Streaming.
18. Stream Converge
‣ Challenges:
‣ managing data streams of different frequency.
‣ linking together events across different streams via
complex key relationships.
‣ handling out of order arrival of data.
‣ ……