Shikha fdp 62_14july2017

Performance Metrics for Big Data
Systems: Streaming Data Analytics
1
Faculty Development Program (FDP) FDP on Performance Assessment of Computing Systems
organized by the department of CSE, JIIT-62 , NOIDA from 10th - 15th July 2017 .
Dr. Shikha Mehta
JIIT, Sec 62, Noida
mehtshikha@gmail.com

Outline
• Introduction
• What is data Streaming?
• Data at rest vs data in Motion
– Batch Processing vs Stream Processing
• Why Streaming Data Analytics?
– Streaming Data Challenges
• Performance Metrics for streaming Data
• Technologies for Streaming Data Analytics
• Lambda and Kappa Architecture
• Hype Cycle
2

According to a new International Data Corporation (IDC)
Spending Guide, “worldwide spending on the Internet of Things (IoT) will
grow at a 17.0% compound annual growth rate (CAGR) from $698.6 billion in
2015 to nearly $1.3 trillion in 2019.”
Courtesy: https://www.digitaldefense.com/a-look-towards-2016-and-dangers-of-the-internet-of-things-iot/ 4

Harnessing Big Data: Analytics
6
http://www.slideshare.net/sajjanvsl/final-presentation-45456729

Data at rest Vs Data in motion
Courtesy: introduction-to-realtime-data-processing-3-160213152050.pdf
7
At Rest In Motion
Data is Fixed Continuously incoming data
a.k.a bounded a.k.a unbounded
Difference lies in when are you analyzing your data?
after the event occurs as the event occurs
Finding stats about group in a
closed room
Finding stats about group in a
marathon
Analyzing sales data for last month
to make strategic decisions
e-commerce order processing

What kind of Processing?
Courtesy: introduction-to-realtime-data-processing-3-160213152050.pdf
8
wada ⇒ batch
pani puri ⇒ Streaming

Batch vs Stream Processing cont..
Courtesy: Streaming Analytics on AWS, Dmitri Tchikatilov, AdTech BD, AWS, dmitrit@amazon.com 9
Batch Processing Stream Processing
Data scope
Queries or processing over
all or most of the data
Queries or processing over data
on rolling window or most
recent data record
Data size Large batches of data
Individual records or micro
batches of few records
Performance
Latencies in minutes to
hours.
Requires latency in the order of
seconds or milliseconds.
Analytics Complex analytics.
Simple response functions,
aggregates, and rolling metrics.

What is Stream Processing?
• Imagine you are browsing:
• If you see an advert on a page, there will be an
AdViewEvent
• {UserId, AdId, Timestamp}
• If you clicked the ad, there will be another
AdClickEvent
• {UserId, AdId, Timestamp}
Courtesy: Coursera, course on Cloud Computing Applications 10

Stream Processing Cont..
Courtesy: Coursera, course on Cloud Computing Applications 12
Which is the most effective ad during last hour?

• Data Streams: Continuous flow of data generated
at high-speed in Dynamic, Time-changing
environments.
• We need to maintain decision models in real time.
• Decision Models must be capable of:
– incorporating new information at the speed data
arrives;
– detecting changes and adapting the decision models to
the most recent information.
– forgetting outdated information;
• Unbounded training sets, dynamic models.
• In Practice: finite training sets, static models.
13

Courtesy: Ecmlpkdd2015 slides 14
1. One example at a time,
used at most once
2. Limited memory
3. Limited time
4. Anytime prediction
How to evaluate decision models that evolve over
time?

Why Streaming Analytics?
Value Creation, Cost and the Challenge
• Its not cost effective to store all
data, especially if its low or yet
to be deemed of value (noise)
• But its highly valuable to inspect
/ analyze all the data, to identify
the signal from the noise or
determine what needs to be
persisted
• There is value in identifying the
signal in the past, offline analysis
(actually required), but you’ve
now lost the chance to effect the
now
Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 15

Top Client Challenges
• 80% of data is unstructured. Existing analytics cannot analyze
streaming data like video, acoustic, text and sensor.
• Too much noise. Too much low value data. How to pre-process
all data on the fly (megabytes or petabytes). Keep only what is
required/valuable? Remember more data means more cost
and compliance pain.
• Data volumes double every year. Too much to store and then
analyze. How to analyze now before data is gone forever?
• Dashboard overload. Too much history and not enough
future prediction. How to get ahead, plan and predict vs.
react?
• Sometimes 1 minute is too late. How to quickly process,
analyze and act on perishable data to lower costs? Not just
batch/historical

Major Research challenges in
Streaming Data Analytics:
1. Concept Drift
2. Classification of stream data
3. Pre-processing of streams
4. Performance evaluation parameters for
stream data mining processes
5. Protecting data privacy
17
Courtesy: Krempl, Georg, et al. "Open challenges for data stream mining research." ACM SIGKDD explorations newsletter 16.1 (2014).

Performance Metrics for stream data
mining processes
18
[1]Bifet A., Read J., Žliobaitė I., Pfahringer B., Holmes G. (2013) Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them. In: Blockeel H., Kersting
K., Nijssen S., Železný F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science, vol 8188. Springer,
Berlin, Heidelberg
[2]Mingzhou Song,Lin Zhang, Comparison of Cluster Representations from Partial Second- to Full Fourth-Order Cross Moments for Data Stream Clustering,ICDM
'08. Eighth IEEE International Conference on Data Mining, 2008.
Task Evaluation Parameter Major Purpose Value significance
Classification
Kappa statistics [1] Assess performance imbalance
data stream case
Higher value means better
performance
Temporal-Kappa statistics [1] Assess performance in case of
temporal dependent data
stream
Negative value means worse
performance
Clustering
Completeness [2] Measures whether same class
instance fall in same cluster or
not
clustering
Purity [2] Assesses purity of the clusters
in terms of having same class
instances
clustering
SSQ [2] Measures cluster cohesiveness Lower value means better
performance
Silhouette coefficient [2] Assess compactness as well as
separation of clusters
clustering

Performance Metrics for stream data
mining processes cont..
• Loss: measuring how appropriate is the current
model to the actual status of the nature.
• Memory used: Learning algorithms run in fixed
memory. We need to evaluate the memory usage
over time, and the impact in accuracy when using
the available memory.
• Speed of Processing examples: Algorithms must
process the examples as fast if not faster than
they arrive.
19

• A high performance distributed publish-subscribe messaging system.
• Designed for processing of real time activity stream data.
• Initially developed at LinkedIn, now part of Apache.
• Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for
real-time analysis and rendering of streaming data.
Courtesy: https://www.tutorialspoint.com/apache_kafka/ 21
Fast
Scalable
Durable
Fault-
tolerant

• A highly distributed real-time
computation system.
• Acquired by Twitter.
• Twitter claims, “Over a million tuples
processed per second per node.”
• Fast, Scalable, Reliable and Fault-
tolerant.
• Stream: Unbounded sequence of
tuples
– Primitives Spouts: Pull messages
– Bolts: Perform core functions of stream
computing
Courtesy: http://www.tutorialspoint.com/apache_storm/ 22

• Spark Streaming uses micro-batching to support continuous
stream processing.
• It is an extension of Spark which is a batch-processing system
Courtesy:http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html 23
•Was developed in the AMPLab at UC
Berkeley.
•In-memory computing capabilities deliver
speed.
•Low latency
•High throughput
•Fault tolerant
•New programing model:
•Discretized streams (Dstreams)
•Resilient Distributed Datasets

SpringXD
• Spring XD is a unified, distributed, and extensible system for data ingestion,
real time analytics, batch processing, and data export.
• Spring XD framework supports streams for the ingestion of event driven
data from a source to a sink that passes through any number of processors.
Courtesy: https://github.com/spring-projects/spring-xd/wiki/About-Spring-XD
24

Comparison of Tools
Courtesy: https://www.slideshare.net/kamalika1912/big-data-analytics-for-real-time-systems 25

Commercial Stream processing
frameworks
• Google DataFlow
Courtesy: https://cloud.google.com/dataflow/ 27

Commercial Stream processing
frameworks cont..
• Azure Stream
Analytics
Courtesy:https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction 28

Lambda Architecture
Courtesy:https://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action
30

Lambda Architecture cont..
A. All data is sent to both the batch and speed layer
B. Master data set is an immutable, append-only set of data
C. Batch layer pre-computes query functions from scratch, result is
called Batch Views. Batch layer constantly re-computes the batch
views.
D. Batch views are indexed and stored in a scalable database to get
particular values very quickly. Swaps in new batch views when
they are available
E. Speed layer compensates for the high latency of updates to the
Batch Views
F. Uses fast incremental algorithms and read/write databases to
produce real time views
G. Queries are resolved by getting results from both batch and real-
time views
31

Lambda Architecture cont..
32

Lambda Architecture cont.. Example
33

Lambda Architecture: Open Source
Frameworks
34

Kappa Architecture
Courtesy: Coursera, course on Cloud Computing Applications
36

Common Real-Time Analytics Use
Cases
• Sales Enrichment - Use of real time events to provide a prediction of what a
consumer is interested in right now
– Data : Current search keywords, Transactions, Web-pages visited, Mobility/Location, Weather,
etc
– Deliver a relevant coupon before they pass the store
– Display a relevant advert as they swipe a credit card at the gas pump
– Deliver promotion to incentivize change in behaviour
• Security/Fraud - Use of real-time context to determine if an action is or likely to be
fraudulent
– Data: Store browsing patterns, Location, Machine / Network activity, etc
– Determine if an online session is fraudulent before a purchase transaction is submitted
– Identify & block a denial of service attack before it brings down any system
• Anomaly Prediction - Use of real-time events and context to predict anomalous
behaviour before it occurs
– Data: Server logs, System metrics, Sensors, etc
– Predict a network switch crash to allow full capture of all network data prior to the crash to
allow root cause analysis
– Predictive a Black Ice or Brake Failure event in a Connected Car
– Detect Drilling Dysfunction on a Oil Rig to prevent breakages and lost productivity

Shikha fdp 62_14july2017

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Shikha fdp 62_14july2017

Similar to Shikha fdp 62_14july2017 (20)

Recently uploaded

Recently uploaded (20)

Shikha fdp 62_14july2017

Editor's Notes