SlideShare a Scribd company logo
1 of 18
Download to read offline
Real-time Analytics on High
Velocity Streaming Data
Guangyu Wu @CeADAR
CeADAR
‣ Application development & proof of concept
‣ Business-value driven
‣ Market pull/need-driven
‣ Website: http://ceadar.ie/
University	 CeADAR	 Enterprise
CeADAR
Visualisa'on	&	
Analy'c	Interfaces	
• ‘Beyond	the	desktop’	
• Ease	of	interac6on	
• Changing	user	
behaviour	
• Passive	analy6cs	
Data	Management	for	
Analy'cs	
• Reduce	data	
management	effort	
for	analy6cs	
• Data	valida6on	
• Relevance	of	events	
to	rela6onships	
• Data	cura6on	
(determining	useful	
data)	
• Adap6ve	ETL	
(Extract,	Transform,	
Load)	
Advanced	Analy'cs	
• Causa6on	challenge	
• Live	topic	monitoring	
• Social	trending	and	
contextualisa6on	
• Con'nuous	analy'cs	
• Social	Iden6ty	
fingerprin6ng
Overview
‣ Introduce different frameworks:
‣ Spark, Storm, Trident
‣ Continuous Clustering project
‣ Continuous Metrics project
‣ Stream Converge project
Spark
‣ Spark is a platform for distributed batch data processing.
‣ Spark includes a number extensions: Spark Streaming, Spark
SQL, MLlib, GraphX.
‣ Spark runs batch jobs predominantly in memory.
‣ Spark Streaming manages to integrate stream processing
with batch processing by treating a data stream as
sequences of small batches of data points, or micro-batches.
‣ Spark Streaming maintains computation states.
Storm
‣ A Storm topology is comprised of spouts and bolts.
‣ Storm operates over individual data points.
‣ Storm is designed purely for stream processing.
Trident
‣ Trident is a high level programming abstraction built on top of
Storm.
‣ It provides a number of useful functions such as aggregations
and filters.
‣ An application can be designed and implemented using these
high level abstractions and Trident converts the logic into a
standard Storm topology under the hood.
‣ Trident works over micro-batches of data.
‣ Trident also has built-in support for maintaining processing state
and state query.
Methodology
Large static batches
of messages
Hadoop and off-line
batch processing in
Spark
Single messages
Storm
Micro-batches of
messages
Spark Streaming,
Trident
Discretised streams
Continuous Clustering
‣ Use case: real-time SMS spam detection in mobile networks.
‣ Clustering SMS messages based on their content is a good
way to identify spam.
‣ Many similar spam messages are sent out over a short
period of time.
Continuous Clustering
‣ Problem with traditional clustering algorithms…
‣ work off-line over historical data
‣ require multiple passes over the data
‣ not incrementally updatable
‣ are hard to scale to ‘big’ data
‣ CeADAR solution: we developed a novel single pass,
scalable data stream clustering algorithm implemented on
Storm.
Continuous Clustering
Deployment
‣ Our compute cluster is composed of 4 machines.
‣ Each machine:
‣ Intel Xeon CPU E5-2630 0 @ 2.30GHz with 24 cores
‣ 64G memory
‣ 1T disk
‣ Spark, Storm, Hadoop, Kafka, Redis
Continuous Clustering
‣ US tier 1 mobile operator
‣ ~500 messages/second average
‣ ~1,300 messages/second peak
35,913
Near-exact matching
8,160
Matching threshold 75%
Continuous Metrics
‣ Evaluate and compare Storm, Storm Trident and Spark Streaming on
the task of computing a set of statistical metrics in real-time over a
continuous stream of data.
‣ Evaluate and compare
‣ Throughput: the volume and velocity of data that can be processed
on different configurations and hardware.
‣ Latency: the time delay between a new data point being received and
the updated metrics being computed.
vs vs
Sliding Windows
‣ By items
‣ By time
Continuous Metrics
‣ High level results overview
‣ Spark Streaming achieves the highest throughput, with
Storm at the other end with the lowest throughput.
‣ However, Storm achieves the best latency by a
considerable margin. Spark and Trident both exhibit
considerably higher latency which is due at least in part
to their micro-batch data processing approach.
‣ The evaluation produced many other insights, learnings
and recommendations relating to these real-time platforms.
Stream Converge
‣ Current project:
process and combine
heterogeneous data
streams from diverse
sources using Spark
Streaming.
Stream Converge
‣ Challenges:
‣ managing data streams of different frequency.
‣ linking together events across different streams via
complex key relationships.
‣ handling out of order arrival of data.
‣ ……

More Related Content

More from John Mulhall

cloud-migrations.pptx
cloud-migrations.pptxcloud-migrations.pptx
cloud-migrations.pptxJohn Mulhall
 
HUGIreland_VincentDeStocklin_DataScienceWorkflows
HUGIreland_VincentDeStocklin_DataScienceWorkflowsHUGIreland_VincentDeStocklin_DataScienceWorkflows
HUGIreland_VincentDeStocklin_DataScienceWorkflowsJohn Mulhall
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfHUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfJohn Mulhall
 
Introduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallIntroduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallJohn Mulhall
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningJohn Mulhall
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
 
Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016John Mulhall
 
HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesJohn Mulhall
 
HUG Ireland Event Presentation - In-Memory Databases
HUG Ireland Event Presentation - In-Memory DatabasesHUG Ireland Event Presentation - In-Memory Databases
HUG Ireland Event Presentation - In-Memory DatabasesJohn Mulhall
 
HUG_Ireland_BryanQuinnPresentation_20160111
HUG_Ireland_BryanQuinnPresentation_20160111HUG_Ireland_BryanQuinnPresentation_20160111
HUG_Ireland_BryanQuinnPresentation_20160111John Mulhall
 
HUG Ireland Event - Dama Ireland slides
HUG Ireland Event - Dama Ireland slidesHUG Ireland Event - Dama Ireland slides
HUG Ireland Event - Dama Ireland slidesJohn Mulhall
 
Periscope Getting Started-2
Periscope Getting Started-2Periscope Getting Started-2
Periscope Getting Started-2John Mulhall
 
AIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIB
AIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIBAIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIB
AIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIBJohn Mulhall
 
Sonra Intelligence Ltd
Sonra Intelligence LtdSonra Intelligence Ltd
Sonra Intelligence LtdJohn Mulhall
 

More from John Mulhall (14)

cloud-migrations.pptx
cloud-migrations.pptxcloud-migrations.pptx
cloud-migrations.pptx
 
HUGIreland_VincentDeStocklin_DataScienceWorkflows
HUGIreland_VincentDeStocklin_DataScienceWorkflowsHUGIreland_VincentDeStocklin_DataScienceWorkflows
HUGIreland_VincentDeStocklin_DataScienceWorkflows
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfHUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
 
Introduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallIntroduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John Mulhall
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
 
HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation Slides
 
HUG Ireland Event Presentation - In-Memory Databases
HUG Ireland Event Presentation - In-Memory DatabasesHUG Ireland Event Presentation - In-Memory Databases
HUG Ireland Event Presentation - In-Memory Databases
 
HUG_Ireland_BryanQuinnPresentation_20160111
HUG_Ireland_BryanQuinnPresentation_20160111HUG_Ireland_BryanQuinnPresentation_20160111
HUG_Ireland_BryanQuinnPresentation_20160111
 
HUG Ireland Event - Dama Ireland slides
HUG Ireland Event - Dama Ireland slidesHUG Ireland Event - Dama Ireland slides
HUG Ireland Event - Dama Ireland slides
 
Periscope Getting Started-2
Periscope Getting Started-2Periscope Getting Started-2
Periscope Getting Started-2
 
AIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIB
AIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIBAIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIB
AIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIB
 
Sonra Intelligence Ltd
Sonra Intelligence LtdSonra Intelligence Ltd
Sonra Intelligence Ltd
 

Recently uploaded

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Recently uploaded (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Real Time Analytics on High Velocity Streaming data by Guangyu Wu