Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Real-time Analytics on High
Velocity Streaming Data
Guangyu Wu @CeADAR

CeADAR
‣ Application development & proof of concept
‣ Business-value driven
‣ Market pull/need-driven
‣ Website: http://ceadar.ie/
University CeADAR Enterprise

CeADAR
Visualisa'on &
Analy'c Interfaces
• ‘Beyond the desktop’
• Ease of interac6on
• Changing user
behaviour
• Passive analy6cs
Data Management for
Analy'cs
• Reduce data
management eﬀort
for analy6cs
• Data valida6on
• Relevance of events
to rela6onships
• Data cura6on
(determining useful
data)
• Adap6ve ETL
(Extract, Transform,
Load)
Advanced Analy'cs
• Causa6on challenge
• Live topic monitoring
• Social trending and
contextualisa6on
• Con'nuous analy'cs
• Social Iden6ty
ﬁngerprin6ng

Overview
‣ Introduce different frameworks:
‣ Spark, Storm, Trident
‣ Continuous Clustering project
‣ Continuous Metrics project
‣ Stream Converge project

Spark
‣ Spark is a platform for distributed batch data processing.
‣ Spark includes a number extensions: Spark Streaming, Spark
SQL, MLlib, GraphX.
‣ Spark runs batch jobs predominantly in memory.
‣ Spark Streaming manages to integrate stream processing
with batch processing by treating a data stream as
sequences of small batches of data points, or micro-batches.
‣ Spark Streaming maintains computation states.

Storm
‣ A Storm topology is comprised of spouts and bolts.
‣ Storm operates over individual data points.
‣ Storm is designed purely for stream processing.

Trident
‣ Trident is a high level programming abstraction built on top of
Storm.
‣ It provides a number of useful functions such as aggregations
and ﬁlters.
‣ An application can be designed and implemented using these
high level abstractions and Trident converts the logic into a
standard Storm topology under the hood.
‣ Trident works over micro-batches of data.
‣ Trident also has built-in support for maintaining processing state
and state query.

Methodology
Large static batches
of messages
Hadoop and off-line
batch processing in
Spark
Single messages
Storm
Micro-batches of
messages
Spark Streaming,
Trident
Discretised streams

Continuous Clustering
‣ Use case: real-time SMS spam detection in mobile networks.
‣ Clustering SMS messages based on their content is a good
way to identify spam.
‣ Many similar spam messages are sent out over a short
period of time.

‣ Problem with traditional clustering algorithms…
‣ work off-line over historical data
‣ require multiple passes over the data
‣ not incrementally updatable
‣ are hard to scale to ‘big’ data
‣ CeADAR solution: we developed a novel single pass,
scalable data stream clustering algorithm implemented on
Storm.

Deployment
‣ Our compute cluster is composed of 4 machines.
‣ Each machine:
‣ Intel Xeon CPU E5-2630 0 @ 2.30GHz with 24 cores
‣ 64G memory
‣ 1T disk
‣ Spark, Storm, Hadoop, Kafka, Redis

‣ US tier 1 mobile operator
‣ ~500 messages/second average
‣ ~1,300 messages/second peak
35,913
Near-exact matching
8,160
Matching threshold 75%

Continuous Metrics
‣ Evaluate and compare Storm, Storm Trident and Spark Streaming on
the task of computing a set of statistical metrics in real-time over a
continuous stream of data.
‣ Evaluate and compare
‣ Throughput: the volume and velocity of data that can be processed
on different conﬁgurations and hardware.
‣ Latency: the time delay between a new data point being received and
the updated metrics being computed.
vs vs

Sliding Windows
‣ By items
‣ By time

Continuous Metrics
‣ High level results overview
‣ Spark Streaming achieves the highest throughput, with
Storm at the other end with the lowest throughput.
‣ However, Storm achieves the best latency by a
considerable margin. Spark and Trident both exhibit
considerably higher latency which is due at least in part
to their micro-batch data processing approach.
‣ The evaluation produced many other insights, learnings
and recommendations relating to these real-time platforms.

Stream Converge
‣ Current project:
process and combine
heterogeneous data
streams from diverse
sources using Spark
Streaming.

Stream Converge
‣ Challenges:
‣ managing data streams of different frequency.
‣ linking together events across different streams via
complex key relationships.
‣ handling out of order arrival of data.
‣ ……

Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Recommended

Recommended

More Related Content

More from John Mulhall

More from John Mulhall (14)

Recently uploaded

Recently uploaded (20)

Real Time Analytics on High Velocity Streaming data by Guangyu Wu