Real Time Analytics on High Velocity Streaming data by Guangyu Wu
Dec. 15, 2015•0 likes
2 likes
Be the first to like this
Show More
•1,038 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download to read offline
Report
Technology
A fascinating presentation on Storm and Spark Streaming in a high velocity analytical environment by Guangyu Wu, Post Doctoral Researcher with CeADAR. This presentation was made to the Hadoop User Group (HUG) Ireland on Monday 14th December, 2015.
Spark
‣ Spark is a platform for distributed batch data processing.
‣ Spark includes a number extensions: Spark Streaming, Spark
SQL, MLlib, GraphX.
‣ Spark runs batch jobs predominantly in memory.
‣ Spark Streaming manages to integrate stream processing
with batch processing by treating a data stream as
sequences of small batches of data points, or micro-batches.
‣ Spark Streaming maintains computation states.
Storm
‣ A Storm topology is comprised of spouts and bolts.
‣ Storm operates over individual data points.
‣ Storm is designed purely for stream processing.
Trident
‣ Trident is a high level programming abstraction built on top of
Storm.
‣ It provides a number of useful functions such as aggregations
and filters.
‣ An application can be designed and implemented using these
high level abstractions and Trident converts the logic into a
standard Storm topology under the hood.
‣ Trident works over micro-batches of data.
‣ Trident also has built-in support for maintaining processing state
and state query.
Methodology
Large static batches
of messages
Hadoop and off-line
batch processing in
Spark
Single messages
Storm
Micro-batches of
messages
Spark Streaming,
Trident
Discretised streams
Continuous Clustering
‣ Use case: real-time SMS spam detection in mobile networks.
‣ Clustering SMS messages based on their content is a good
way to identify spam.
‣ Many similar spam messages are sent out over a short
period of time.
Continuous Clustering
‣ Problem with traditional clustering algorithms…
‣ work off-line over historical data
‣ require multiple passes over the data
‣ not incrementally updatable
‣ are hard to scale to ‘big’ data
‣ CeADAR solution: we developed a novel single pass,
scalable data stream clustering algorithm implemented on
Storm.
Deployment
‣ Our compute cluster is composed of 4 machines.
‣ Each machine:
‣ Intel Xeon CPU E5-2630 0 @ 2.30GHz with 24 cores
‣ 64G memory
‣ 1T disk
‣ Spark, Storm, Hadoop, Kafka, Redis
Continuous Clustering
‣ US tier 1 mobile operator
‣ ~500 messages/second average
‣ ~1,300 messages/second peak
35,913
Near-exact matching
8,160
Matching threshold 75%
Continuous Metrics
‣ Evaluate and compare Storm, Storm Trident and Spark Streaming on
the task of computing a set of statistical metrics in real-time over a
continuous stream of data.
‣ Evaluate and compare
‣ Throughput: the volume and velocity of data that can be processed
on different configurations and hardware.
‣ Latency: the time delay between a new data point being received and
the updated metrics being computed.
vs vs
Continuous Metrics
‣ High level results overview
‣ Spark Streaming achieves the highest throughput, with
Storm at the other end with the lowest throughput.
‣ However, Storm achieves the best latency by a
considerable margin. Spark and Trident both exhibit
considerably higher latency which is due at least in part
to their micro-batch data processing approach.
‣ The evaluation produced many other insights, learnings
and recommendations relating to these real-time platforms.
Stream Converge
‣ Current project:
process and combine
heterogeneous data
streams from diverse
sources using Spark
Streaming.
Stream Converge
‣ Challenges:
‣ managing data streams of different frequency.
‣ linking together events across different streams via
complex key relationships.
‣ handling out of order arrival of data.
‣ ……