SlideShare a Scribd company logo
1 of 18
Download to read offline
Real-time Analytics on High
Velocity Streaming Data
Guangyu Wu @CeADAR
CeADAR
‣ Application development & proof of concept
‣ Business-value driven
‣ Market pull/need-driven
‣ Website: http://ceadar.ie/
University	 CeADAR	 Enterprise
CeADAR
Visualisa'on	&	
Analy'c	Interfaces	
• ‘Beyond	the	desktop’	
• Ease	of	interac6on	
• Changing	user	
behaviour	
• Passive	analy6cs	
Data	Management	for	
Analy'cs	
• Reduce	data	
management	effort	
for	analy6cs	
• Data	valida6on	
• Relevance	of	events	
to	rela6onships	
• Data	cura6on	
(determining	useful	
data)	
• Adap6ve	ETL	
(Extract,	Transform,	
Load)	
Advanced	Analy'cs	
• Causa6on	challenge	
• Live	topic	monitoring	
• Social	trending	and	
contextualisa6on	
• Con'nuous	analy'cs	
• Social	Iden6ty	
fingerprin6ng
Overview
‣ Introduce different frameworks:
‣ Spark, Storm, Trident
‣ Continuous Clustering project
‣ Continuous Metrics project
‣ Stream Converge project
Spark
‣ Spark is a platform for distributed batch data processing.
‣ Spark includes a number extensions: Spark Streaming, Spark
SQL, MLlib, GraphX.
‣ Spark runs batch jobs predominantly in memory.
‣ Spark Streaming manages to integrate stream processing
with batch processing by treating a data stream as
sequences of small batches of data points, or micro-batches.
‣ Spark Streaming maintains computation states.
Storm
‣ A Storm topology is comprised of spouts and bolts.
‣ Storm operates over individual data points.
‣ Storm is designed purely for stream processing.
Trident
‣ Trident is a high level programming abstraction built on top of
Storm.
‣ It provides a number of useful functions such as aggregations
and filters.
‣ An application can be designed and implemented using these
high level abstractions and Trident converts the logic into a
standard Storm topology under the hood.
‣ Trident works over micro-batches of data.
‣ Trident also has built-in support for maintaining processing state
and state query.
Methodology
Large static batches
of messages
Hadoop and off-line
batch processing in
Spark
Single messages
Storm
Micro-batches of
messages
Spark Streaming,
Trident
Discretised streams
Continuous Clustering
‣ Use case: real-time SMS spam detection in mobile networks.
‣ Clustering SMS messages based on their content is a good
way to identify spam.
‣ Many similar spam messages are sent out over a short
period of time.
Continuous Clustering
‣ Problem with traditional clustering algorithms…
‣ work off-line over historical data
‣ require multiple passes over the data
‣ not incrementally updatable
‣ are hard to scale to ‘big’ data
‣ CeADAR solution: we developed a novel single pass,
scalable data stream clustering algorithm implemented on
Storm.
Continuous Clustering
Deployment
‣ Our compute cluster is composed of 4 machines.
‣ Each machine:
‣ Intel Xeon CPU E5-2630 0 @ 2.30GHz with 24 cores
‣ 64G memory
‣ 1T disk
‣ Spark, Storm, Hadoop, Kafka, Redis
Continuous Clustering
‣ US tier 1 mobile operator
‣ ~500 messages/second average
‣ ~1,300 messages/second peak
35,913
Near-exact matching
8,160
Matching threshold 75%
Continuous Metrics
‣ Evaluate and compare Storm, Storm Trident and Spark Streaming on
the task of computing a set of statistical metrics in real-time over a
continuous stream of data.
‣ Evaluate and compare
‣ Throughput: the volume and velocity of data that can be processed
on different configurations and hardware.
‣ Latency: the time delay between a new data point being received and
the updated metrics being computed.
vs vs
Sliding Windows
‣ By items
‣ By time
Continuous Metrics
‣ High level results overview
‣ Spark Streaming achieves the highest throughput, with
Storm at the other end with the lowest throughput.
‣ However, Storm achieves the best latency by a
considerable margin. Spark and Trident both exhibit
considerably higher latency which is due at least in part
to their micro-batch data processing approach.
‣ The evaluation produced many other insights, learnings
and recommendations relating to these real-time platforms.
Stream Converge
‣ Current project:
process and combine
heterogeneous data
streams from diverse
sources using Spark
Streaming.
Stream Converge
‣ Challenges:
‣ managing data streams of different frequency.
‣ linking together events across different streams via
complex key relationships.
‣ handling out of order arrival of data.
‣ ……

More Related Content

More from John Mulhall

More from John Mulhall (14)

cloud-migrations.pptx
cloud-migrations.pptxcloud-migrations.pptx
cloud-migrations.pptx
 
HUGIreland_VincentDeStocklin_DataScienceWorkflows
HUGIreland_VincentDeStocklin_DataScienceWorkflowsHUGIreland_VincentDeStocklin_DataScienceWorkflows
HUGIreland_VincentDeStocklin_DataScienceWorkflows
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfHUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
 
Introduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallIntroduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John Mulhall
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
 
HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation Slides
 
HUG Ireland Event Presentation - In-Memory Databases
HUG Ireland Event Presentation - In-Memory DatabasesHUG Ireland Event Presentation - In-Memory Databases
HUG Ireland Event Presentation - In-Memory Databases
 
HUG_Ireland_BryanQuinnPresentation_20160111
HUG_Ireland_BryanQuinnPresentation_20160111HUG_Ireland_BryanQuinnPresentation_20160111
HUG_Ireland_BryanQuinnPresentation_20160111
 
HUG Ireland Event - Dama Ireland slides
HUG Ireland Event - Dama Ireland slidesHUG Ireland Event - Dama Ireland slides
HUG Ireland Event - Dama Ireland slides
 
Periscope Getting Started-2
Periscope Getting Started-2Periscope Getting Started-2
Periscope Getting Started-2
 
AIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIB
AIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIBAIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIB
AIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIB
 
Sonra Intelligence Ltd
Sonra Intelligence LtdSonra Intelligence Ltd
Sonra Intelligence Ltd
 

Recently uploaded

Recently uploaded (20)

Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 

Real Time Analytics on High Velocity Streaming data by Guangyu Wu