SlideShare a Scribd company logo
1 of 37
Performance Metrics for Big Data
Systems: Streaming Data Analytics
1
Faculty Development Program (FDP) FDP on Performance Assessment of Computing Systems
organized by the department of CSE, JIIT-62 , NOIDA from 10th - 15th July 2017 .
Dr. Shikha Mehta
JIIT, Sec 62, Noida
mehtshikha@gmail.com
Outline
• Introduction
• What is data Streaming?
• Data at rest vs data in Motion
– Batch Processing vs Stream Processing
• Why Streaming Data Analytics?
– Streaming Data Challenges
• Performance Metrics for streaming Data
• Technologies for Streaming Data Analytics
• Lambda and Kappa Architecture
• Hype Cycle
2
3
According to a new International Data Corporation (IDC)
Spending Guide, “worldwide spending on the Internet of Things (IoT) will
grow at a 17.0% compound annual growth rate (CAGR) from $698.6 billion in
2015 to nearly $1.3 trillion in 2019.”
Courtesy: https://www.digitaldefense.com/a-look-towards-2016-and-dangers-of-the-internet-of-things-iot/ 4
5
Harnessing Big Data: Analytics
6
http://www.slideshare.net/sajjanvsl/final-presentation-45456729
Data at rest Vs Data in motion
Courtesy: introduction-to-realtime-data-processing-3-160213152050.pdf
7
At Rest In Motion
Data is Fixed Continuously incoming data
a.k.a bounded a.k.a unbounded
Difference lies in when are you analyzing your data?
after the event occurs as the event occurs
Finding stats about group in a
closed room
Finding stats about group in a
marathon
Analyzing sales data for last month
to make strategic decisions
e-commerce order processing
What kind of Processing?
Courtesy: introduction-to-realtime-data-processing-3-160213152050.pdf
8
wada ⇒ batch
pani puri ⇒ Streaming
Batch vs Stream Processing cont..
Courtesy: Streaming Analytics on AWS, Dmitri Tchikatilov, AdTech BD, AWS, dmitrit@amazon.com 9
Batch Processing Stream Processing
Data scope
Queries or processing over
all or most of the data
Queries or processing over data
on rolling window or most
recent data record
Data size Large batches of data
Individual records or micro
batches of few records
Performance
Latencies in minutes to
hours.
Requires latency in the order of
seconds or milliseconds.
Analytics Complex analytics.
Simple response functions,
aggregates, and rolling metrics.
What is Stream Processing?
• Imagine you are browsing:
• If you see an advert on a page, there will be an
AdViewEvent
• {UserId, AdId, Timestamp}
• If you clicked the ad, there will be another
AdClickEvent
• {UserId, AdId, Timestamp}
Courtesy: Coursera, course on Cloud Computing Applications 10
Stream Processing Cont..
Courtesy: Coursera, course on Cloud Computing Applications 12
Which is the most effective ad during last hour?
Stream Processing Cont..
• Data Streams: Continuous flow of data generated
at high-speed in Dynamic, Time-changing
environments.
• We need to maintain decision models in real time.
• Decision Models must be capable of:
– incorporating new information at the speed data
arrives;
– detecting changes and adapting the decision models to
the most recent information.
– forgetting outdated information;
• Unbounded training sets, dynamic models.
• In Practice: finite training sets, static models.
13
Stream Processing Cont..
Courtesy: Ecmlpkdd2015 slides 14
1. One example at a time,
used at most once
2. Limited memory
3. Limited time
4. Anytime prediction
How to evaluate decision models that evolve over
time?
Why Streaming Analytics?
Value Creation, Cost and the Challenge
• Its not cost effective to store all
data, especially if its low or yet
to be deemed of value (noise)
• But its highly valuable to inspect
/ analyze all the data, to identify
the signal from the noise or
determine what needs to be
persisted
• There is value in identifying the
signal in the past, offline analysis
(actually required), but you’ve
now lost the chance to effect the
now
Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 15
Top Client Challenges
• 80% of data is unstructured. Existing analytics cannot analyze
streaming data like video, acoustic, text and sensor.
• Too much noise. Too much low value data. How to pre-process
all data on the fly (megabytes or petabytes). Keep only what is
required/valuable? Remember more data means more cost
and compliance pain.
• Data volumes double every year. Too much to store and then
analyze. How to analyze now before data is gone forever?
• Dashboard overload. Too much history and not enough
future prediction. How to get ahead, plan and predict vs.
react?
• Sometimes 1 minute is too late. How to quickly process,
analyze and act on perishable data to lower costs? Not just
batch/historical
Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 16
Major Research challenges in
Streaming Data Analytics:
1. Concept Drift
2. Classification of stream data
3. Pre-processing of streams
4. Performance evaluation parameters for
stream data mining processes
5. Protecting data privacy
17
Courtesy: Krempl, Georg, et al. "Open challenges for data stream mining research." ACM SIGKDD explorations newsletter 16.1 (2014).
Performance Metrics for stream data
mining processes
18
[1]Bifet A., Read J., Žliobaitė I., Pfahringer B., Holmes G. (2013) Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them. In: Blockeel H., Kersting
K., Nijssen S., Železný F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science, vol 8188. Springer,
Berlin, Heidelberg
[2]Mingzhou Song,Lin Zhang, Comparison of Cluster Representations from Partial Second- to Full Fourth-Order Cross Moments for Data Stream Clustering,ICDM
'08. Eighth IEEE International Conference on Data Mining, 2008.
Task Evaluation Parameter Major Purpose Value significance
Classification
Kappa statistics [1] Assess performance imbalance
data stream case
Higher value means better
performance
Temporal-Kappa statistics [1] Assess performance in case of
temporal dependent data
stream
Negative value means worse
performance
Clustering
Completeness [2] Measures whether same class
instance fall in same cluster or
not
Higher value means better
clustering
Purity [2] Assesses purity of the clusters
in terms of having same class
instances
Higher value means better
clustering
SSQ [2] Measures cluster cohesiveness Lower value means better
performance
Silhouette coefficient [2] Assess compactness as well as
separation of clusters
Higher value means better
clustering
Performance Metrics for stream data
mining processes cont..
• Loss: measuring how appropriate is the current
model to the actual status of the nature.
• Memory used: Learning algorithms run in fixed
memory. We need to evaluate the memory usage
over time, and the impact in accuracy when using
the available memory.
• Speed of Processing examples: Algorithms must
process the examples as fast if not faster than
they arrive.
19
20
• A high performance distributed publish-subscribe messaging system.
• Designed for processing of real time activity stream data.
• Initially developed at LinkedIn, now part of Apache.
• Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for
real-time analysis and rendering of streaming data.
Courtesy: https://www.tutorialspoint.com/apache_kafka/ 21
Fast
Scalable
Durable
Fault-
tolerant
• A highly distributed real-time
computation system.
• Acquired by Twitter.
• Twitter claims, “Over a million tuples
processed per second per node.”
• Fast, Scalable, Reliable and Fault-
tolerant.
• Stream: Unbounded sequence of
tuples
– Primitives Spouts: Pull messages
– Bolts: Perform core functions of stream
computing
Courtesy: http://www.tutorialspoint.com/apache_storm/ 22
• Spark Streaming uses micro-batching to support continuous
stream processing.
• It is an extension of Spark which is a batch-processing system
Courtesy:http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html 23
•Was developed in the AMPLab at UC
Berkeley.
•In-memory computing capabilities deliver
speed.
•Low latency
•High throughput
•Fault tolerant
•New programing model:
•Discretized streams (Dstreams)
•Resilient Distributed Datasets
SpringXD
• Spring XD is a unified, distributed, and extensible system for data ingestion,
real time analytics, batch processing, and data export.
• Spring XD framework supports streams for the ingestion of event driven
data from a source to a sink that passes through any number of processors.
Courtesy: https://github.com/spring-projects/spring-xd/wiki/About-Spring-XD
24
Comparison of Tools
Courtesy: https://www.slideshare.net/kamalika1912/big-data-analytics-for-real-time-systems 25
Comparison of Tools cont..
26
Commercial Stream processing
frameworks
• Google DataFlow
Courtesy: https://cloud.google.com/dataflow/ 27
Commercial Stream processing
frameworks cont..
• Azure Stream
Analytics
Courtesy:https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction 28
29
Lambda Architecture
Courtesy:https://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action
30
Lambda Architecture cont..
A. All data is sent to both the batch and speed layer
B. Master data set is an immutable, append-only set of data
C. Batch layer pre-computes query functions from scratch, result is
called Batch Views. Batch layer constantly re-computes the batch
views.
D. Batch views are indexed and stored in a scalable database to get
particular values very quickly. Swaps in new batch views when
they are available
E. Speed layer compensates for the high latency of updates to the
Batch Views
F. Uses fast incremental algorithms and read/write databases to
produce real time views
G. Queries are resolved by getting results from both batch and real-
time views
Courtesy:https://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action
31
Lambda Architecture cont..
Courtesy:https://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action
32
Lambda Architecture cont.. Example
Courtesy:https://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action
33
Lambda Architecture: Open Source
Frameworks
34
Kappa Architecture
Courtesy: Coursera, course on Cloud Computing Applications
36
Common Real-Time Analytics Use
Cases
• Sales Enrichment - Use of real time events to provide a prediction of what a
consumer is interested in right now
– Data : Current search keywords, Transactions, Web-pages visited, Mobility/Location, Weather,
etc
– Deliver a relevant coupon before they pass the store
– Display a relevant advert as they swipe a credit card at the gas pump
– Deliver promotion to incentivize change in behaviour
• Security/Fraud - Use of real-time context to determine if an action is or likely to be
fraudulent
– Data: Store browsing patterns, Location, Machine / Network activity, etc
– Determine if an online session is fraudulent before a purchase transaction is submitted
– Identify & block a denial of service attack before it brings down any system
• Anomaly Prediction - Use of real-time events and context to predict anomalous
behaviour before it occurs
– Data: Server logs, System metrics, Sensors, etc
– Predict a network switch crash to allow full capture of all network data prior to the crash to
allow root cause analysis
– Predictive a Black Ice or Brake Failure event in a Connected Car
– Detect Drilling Dysfunction on a Oil Rig to prevent breakages and lost productivity
Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 37
38
39

More Related Content

What's hot

Threat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique BrezinskiThreat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique BrezinskiDatabricks
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonSri Ambati
 
Enterprise Ready: A Look at Neo4j in Production
Enterprise Ready: A Look at Neo4j in ProductionEnterprise Ready: A Look at Neo4j in Production
Enterprise Ready: A Look at Neo4j in ProductionNeo4j
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityEmpower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityDatabricks
 
Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Aravindharamanan S
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Data Bases - Introduction to data science
Data Bases - Introduction to data scienceData Bases - Introduction to data science
Data Bases - Introduction to data scienceFrank Kienle
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Big Data Spain
 
Real Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from PivotalReal Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from PivotalVMware Tanzu Korea
 
Big Data as a Service: A Neo-Metropolis Model Approach for Innovation
Big Data as a Service: A Neo-Metropolis Model Approach for InnovationBig Data as a Service: A Neo-Metropolis Model Approach for Innovation
Big Data as a Service: A Neo-Metropolis Model Approach for InnovationSoftServe
 
Towards Personalization in Global Digital Health
Towards Personalization in Global Digital HealthTowards Personalization in Global Digital Health
Towards Personalization in Global Digital HealthDatabricks
 
SQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightSQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightEduardo Castro
 

What's hot (19)

Threat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique BrezinskiThreat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique Brezinski
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
Enterprise Ready: A Look at Neo4j in Production
Enterprise Ready: A Look at Neo4j in ProductionEnterprise Ready: A Look at Neo4j in Production
Enterprise Ready: A Look at Neo4j in Production
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
DataHub
DataHubDataHub
DataHub
 
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityEmpower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
 
Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Data Bases - Introduction to data science
Data Bases - Introduction to data scienceData Bases - Introduction to data science
Data Bases - Introduction to data science
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Real Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from PivotalReal Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from Pivotal
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 
Big Data as a Service: A Neo-Metropolis Model Approach for Innovation
Big Data as a Service: A Neo-Metropolis Model Approach for InnovationBig Data as a Service: A Neo-Metropolis Model Approach for Innovation
Big Data as a Service: A Neo-Metropolis Model Approach for Innovation
 
Towards Personalization in Global Digital Health
Towards Personalization in Global Digital HealthTowards Personalization in Global Digital Health
Towards Personalization in Global Digital Health
 
SQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightSQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsight
 
AI in the Enterprise at Scale
AI in the Enterprise at ScaleAI in the Enterprise at Scale
AI in the Enterprise at Scale
 

Similar to Shikha fdp 62_14july2017

Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Stavros Kontopoulos
 
Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfAlbert Wong
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022HostedbyConfluent
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
 
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfBuild User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfAlbert Wong
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1Bill Liu
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Using Data Science for Cybersecurity
Using Data Science for CybersecurityUsing Data Science for Cybersecurity
Using Data Science for CybersecurityVMware Tanzu
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
How to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT OperationsHow to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT OperationsExtraHop Networks
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 

Similar to Shikha fdp 62_14july2017 (20)

Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdf
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfBuild User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Using Data Science for Cybersecurity
Using Data Science for CybersecurityUsing Data Science for Cybersecurity
Using Data Science for Cybersecurity
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
How to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT OperationsHow to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT Operations
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 

Recently uploaded

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Recently uploaded (20)

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

Shikha fdp 62_14july2017

  • 1. Performance Metrics for Big Data Systems: Streaming Data Analytics 1 Faculty Development Program (FDP) FDP on Performance Assessment of Computing Systems organized by the department of CSE, JIIT-62 , NOIDA from 10th - 15th July 2017 . Dr. Shikha Mehta JIIT, Sec 62, Noida mehtshikha@gmail.com
  • 2. Outline • Introduction • What is data Streaming? • Data at rest vs data in Motion – Batch Processing vs Stream Processing • Why Streaming Data Analytics? – Streaming Data Challenges • Performance Metrics for streaming Data • Technologies for Streaming Data Analytics • Lambda and Kappa Architecture • Hype Cycle 2
  • 3. 3
  • 4. According to a new International Data Corporation (IDC) Spending Guide, “worldwide spending on the Internet of Things (IoT) will grow at a 17.0% compound annual growth rate (CAGR) from $698.6 billion in 2015 to nearly $1.3 trillion in 2019.” Courtesy: https://www.digitaldefense.com/a-look-towards-2016-and-dangers-of-the-internet-of-things-iot/ 4
  • 5. 5
  • 6. Harnessing Big Data: Analytics 6 http://www.slideshare.net/sajjanvsl/final-presentation-45456729
  • 7. Data at rest Vs Data in motion Courtesy: introduction-to-realtime-data-processing-3-160213152050.pdf 7 At Rest In Motion Data is Fixed Continuously incoming data a.k.a bounded a.k.a unbounded Difference lies in when are you analyzing your data? after the event occurs as the event occurs Finding stats about group in a closed room Finding stats about group in a marathon Analyzing sales data for last month to make strategic decisions e-commerce order processing
  • 8. What kind of Processing? Courtesy: introduction-to-realtime-data-processing-3-160213152050.pdf 8 wada ⇒ batch pani puri ⇒ Streaming
  • 9. Batch vs Stream Processing cont.. Courtesy: Streaming Analytics on AWS, Dmitri Tchikatilov, AdTech BD, AWS, dmitrit@amazon.com 9 Batch Processing Stream Processing Data scope Queries or processing over all or most of the data Queries or processing over data on rolling window or most recent data record Data size Large batches of data Individual records or micro batches of few records Performance Latencies in minutes to hours. Requires latency in the order of seconds or milliseconds. Analytics Complex analytics. Simple response functions, aggregates, and rolling metrics.
  • 10. What is Stream Processing? • Imagine you are browsing: • If you see an advert on a page, there will be an AdViewEvent • {UserId, AdId, Timestamp} • If you clicked the ad, there will be another AdClickEvent • {UserId, AdId, Timestamp} Courtesy: Coursera, course on Cloud Computing Applications 10
  • 11. Stream Processing Cont.. Courtesy: Coursera, course on Cloud Computing Applications 12 Which is the most effective ad during last hour?
  • 12. Stream Processing Cont.. • Data Streams: Continuous flow of data generated at high-speed in Dynamic, Time-changing environments. • We need to maintain decision models in real time. • Decision Models must be capable of: – incorporating new information at the speed data arrives; – detecting changes and adapting the decision models to the most recent information. – forgetting outdated information; • Unbounded training sets, dynamic models. • In Practice: finite training sets, static models. 13
  • 13. Stream Processing Cont.. Courtesy: Ecmlpkdd2015 slides 14 1. One example at a time, used at most once 2. Limited memory 3. Limited time 4. Anytime prediction How to evaluate decision models that evolve over time?
  • 14. Why Streaming Analytics? Value Creation, Cost and the Challenge • Its not cost effective to store all data, especially if its low or yet to be deemed of value (noise) • But its highly valuable to inspect / analyze all the data, to identify the signal from the noise or determine what needs to be persisted • There is value in identifying the signal in the past, offline analysis (actually required), but you’ve now lost the chance to effect the now Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 15
  • 15. Top Client Challenges • 80% of data is unstructured. Existing analytics cannot analyze streaming data like video, acoustic, text and sensor. • Too much noise. Too much low value data. How to pre-process all data on the fly (megabytes or petabytes). Keep only what is required/valuable? Remember more data means more cost and compliance pain. • Data volumes double every year. Too much to store and then analyze. How to analyze now before data is gone forever? • Dashboard overload. Too much history and not enough future prediction. How to get ahead, plan and predict vs. react? • Sometimes 1 minute is too late. How to quickly process, analyze and act on perishable data to lower costs? Not just batch/historical Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 16
  • 16. Major Research challenges in Streaming Data Analytics: 1. Concept Drift 2. Classification of stream data 3. Pre-processing of streams 4. Performance evaluation parameters for stream data mining processes 5. Protecting data privacy 17 Courtesy: Krempl, Georg, et al. "Open challenges for data stream mining research." ACM SIGKDD explorations newsletter 16.1 (2014).
  • 17. Performance Metrics for stream data mining processes 18 [1]Bifet A., Read J., Žliobaitė I., Pfahringer B., Holmes G. (2013) Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them. In: Blockeel H., Kersting K., Nijssen S., Železný F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science, vol 8188. Springer, Berlin, Heidelberg [2]Mingzhou Song,Lin Zhang, Comparison of Cluster Representations from Partial Second- to Full Fourth-Order Cross Moments for Data Stream Clustering,ICDM '08. Eighth IEEE International Conference on Data Mining, 2008. Task Evaluation Parameter Major Purpose Value significance Classification Kappa statistics [1] Assess performance imbalance data stream case Higher value means better performance Temporal-Kappa statistics [1] Assess performance in case of temporal dependent data stream Negative value means worse performance Clustering Completeness [2] Measures whether same class instance fall in same cluster or not Higher value means better clustering Purity [2] Assesses purity of the clusters in terms of having same class instances Higher value means better clustering SSQ [2] Measures cluster cohesiveness Lower value means better performance Silhouette coefficient [2] Assess compactness as well as separation of clusters Higher value means better clustering
  • 18. Performance Metrics for stream data mining processes cont.. • Loss: measuring how appropriate is the current model to the actual status of the nature. • Memory used: Learning algorithms run in fixed memory. We need to evaluate the memory usage over time, and the impact in accuracy when using the available memory. • Speed of Processing examples: Algorithms must process the examples as fast if not faster than they arrive. 19
  • 19. 20
  • 20. • A high performance distributed publish-subscribe messaging system. • Designed for processing of real time activity stream data. • Initially developed at LinkedIn, now part of Apache. • Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data. Courtesy: https://www.tutorialspoint.com/apache_kafka/ 21 Fast Scalable Durable Fault- tolerant
  • 21. • A highly distributed real-time computation system. • Acquired by Twitter. • Twitter claims, “Over a million tuples processed per second per node.” • Fast, Scalable, Reliable and Fault- tolerant. • Stream: Unbounded sequence of tuples – Primitives Spouts: Pull messages – Bolts: Perform core functions of stream computing Courtesy: http://www.tutorialspoint.com/apache_storm/ 22
  • 22. • Spark Streaming uses micro-batching to support continuous stream processing. • It is an extension of Spark which is a batch-processing system Courtesy:http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html 23 •Was developed in the AMPLab at UC Berkeley. •In-memory computing capabilities deliver speed. •Low latency •High throughput •Fault tolerant •New programing model: •Discretized streams (Dstreams) •Resilient Distributed Datasets
  • 23. SpringXD • Spring XD is a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. • Spring XD framework supports streams for the ingestion of event driven data from a source to a sink that passes through any number of processors. Courtesy: https://github.com/spring-projects/spring-xd/wiki/About-Spring-XD 24
  • 24. Comparison of Tools Courtesy: https://www.slideshare.net/kamalika1912/big-data-analytics-for-real-time-systems 25
  • 25. Comparison of Tools cont.. 26
  • 26. Commercial Stream processing frameworks • Google DataFlow Courtesy: https://cloud.google.com/dataflow/ 27
  • 27. Commercial Stream processing frameworks cont.. • Azure Stream Analytics Courtesy:https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction 28
  • 28. 29
  • 30. Lambda Architecture cont.. A. All data is sent to both the batch and speed layer B. Master data set is an immutable, append-only set of data C. Batch layer pre-computes query functions from scratch, result is called Batch Views. Batch layer constantly re-computes the batch views. D. Batch views are indexed and stored in a scalable database to get particular values very quickly. Swaps in new batch views when they are available E. Speed layer compensates for the high latency of updates to the Batch Views F. Uses fast incremental algorithms and read/write databases to produce real time views G. Queries are resolved by getting results from both batch and real- time views Courtesy:https://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action 31
  • 32. Lambda Architecture cont.. Example Courtesy:https://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action 33
  • 33. Lambda Architecture: Open Source Frameworks 34
  • 34. Kappa Architecture Courtesy: Coursera, course on Cloud Computing Applications 36
  • 35. Common Real-Time Analytics Use Cases • Sales Enrichment - Use of real time events to provide a prediction of what a consumer is interested in right now – Data : Current search keywords, Transactions, Web-pages visited, Mobility/Location, Weather, etc – Deliver a relevant coupon before they pass the store – Display a relevant advert as they swipe a credit card at the gas pump – Deliver promotion to incentivize change in behaviour • Security/Fraud - Use of real-time context to determine if an action is or likely to be fraudulent – Data: Store browsing patterns, Location, Machine / Network activity, etc – Determine if an online session is fraudulent before a purchase transaction is submitted – Identify & block a denial of service attack before it brings down any system • Anomaly Prediction - Use of real-time events and context to predict anomalous behaviour before it occurs – Data: Server logs, System metrics, Sensors, etc – Predict a network switch crash to allow full capture of all network data prior to the crash to allow root cause analysis – Predictive a Black Ice or Brake Failure event in a Connected Car – Detect Drilling Dysfunction on a Oil Rig to prevent breakages and lost productivity Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 37
  • 36. 38
  • 37. 39

Editor's Notes

  1. https://www.digitaldefense.com/a-look-towards-2016-and-dangers-of-the-internet-of-things-iot/
  2. Veracity. IBM coined Veracity as the fourth V, which represents the unreliability inherent in some sources of data. For example, customer sentiments in social media are uncertain in nature, since they entail human judgment. Yet they contain valuable information. Thus the need to deal with imprecise and uncertain data is another facet of big data, which is addressed using tools and analytics developed for management and mining of uncertain data. •Variability (and complexity). SAS introduced Variability and Complexity as two additional dimensions of big data. Variability refers to the variation in the data flow rates. Often, big data velocity is not consistent and has periodic peaks and troughs. Complexity refers to the fact that big data are generated through a myriad of sources. This imposes a critical challenge: the need to connect, match, cleanse and transform data received from different sources. •Value. Oracle introduced Value as a defining attribute of big data. Based on Oracle's definition, big data are often characterized by relatively “low value density”. That is, the data received in the original form usually has a low value relative to its volume. However, a high value can be obtained by analyzing large volumes of such data.
  3. introduction-to-realtime-data-processing-3-160213152050.pdf
  4. introduction-to-realtime-data-processing-3-160213152050.pdf
  5. Streaming Analytics on AWS Dmitri Tchikatilov AdTech BD, AWS dmitrit@amazon.com
  6. ibm
  7. ibm
  8. Albert Bifet 1 , Jesse Read 2 , Indr ̇ e ˇ Zliobait ̇ e 3 , Bernhard Pfahringer 4 , and Geoff Holmes 4
  9. https://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action
  10. ibm