Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
Shikha fdp 62_14july2017
1. Performance Metrics for Big Data
Systems: Streaming Data Analytics
1
Faculty Development Program (FDP) FDP on Performance Assessment of Computing Systems
organized by the department of CSE, JIIT-62 , NOIDA from 10th - 15th July 2017 .
Dr. Shikha Mehta
JIIT, Sec 62, Noida
mehtshikha@gmail.com
2. Outline
• Introduction
• What is data Streaming?
• Data at rest vs data in Motion
– Batch Processing vs Stream Processing
• Why Streaming Data Analytics?
– Streaming Data Challenges
• Performance Metrics for streaming Data
• Technologies for Streaming Data Analytics
• Lambda and Kappa Architecture
• Hype Cycle
2
4. According to a new International Data Corporation (IDC)
Spending Guide, “worldwide spending on the Internet of Things (IoT) will
grow at a 17.0% compound annual growth rate (CAGR) from $698.6 billion in
2015 to nearly $1.3 trillion in 2019.”
Courtesy: https://www.digitaldefense.com/a-look-towards-2016-and-dangers-of-the-internet-of-things-iot/ 4
6. Harnessing Big Data: Analytics
6
http://www.slideshare.net/sajjanvsl/final-presentation-45456729
7. Data at rest Vs Data in motion
Courtesy: introduction-to-realtime-data-processing-3-160213152050.pdf
7
At Rest In Motion
Data is Fixed Continuously incoming data
a.k.a bounded a.k.a unbounded
Difference lies in when are you analyzing your data?
after the event occurs as the event occurs
Finding stats about group in a
closed room
Finding stats about group in a
marathon
Analyzing sales data for last month
to make strategic decisions
e-commerce order processing
8. What kind of Processing?
Courtesy: introduction-to-realtime-data-processing-3-160213152050.pdf
8
wada ⇒ batch
pani puri ⇒ Streaming
9. Batch vs Stream Processing cont..
Courtesy: Streaming Analytics on AWS, Dmitri Tchikatilov, AdTech BD, AWS, dmitrit@amazon.com 9
Batch Processing Stream Processing
Data scope
Queries or processing over
all or most of the data
Queries or processing over data
on rolling window or most
recent data record
Data size Large batches of data
Individual records or micro
batches of few records
Performance
Latencies in minutes to
hours.
Requires latency in the order of
seconds or milliseconds.
Analytics Complex analytics.
Simple response functions,
aggregates, and rolling metrics.
10. What is Stream Processing?
• Imagine you are browsing:
• If you see an advert on a page, there will be an
AdViewEvent
• {UserId, AdId, Timestamp}
• If you clicked the ad, there will be another
AdClickEvent
• {UserId, AdId, Timestamp}
Courtesy: Coursera, course on Cloud Computing Applications 10
12. Stream Processing Cont..
• Data Streams: Continuous flow of data generated
at high-speed in Dynamic, Time-changing
environments.
• We need to maintain decision models in real time.
• Decision Models must be capable of:
– incorporating new information at the speed data
arrives;
– detecting changes and adapting the decision models to
the most recent information.
– forgetting outdated information;
• Unbounded training sets, dynamic models.
• In Practice: finite training sets, static models.
13
13. Stream Processing Cont..
Courtesy: Ecmlpkdd2015 slides 14
1. One example at a time,
used at most once
2. Limited memory
3. Limited time
4. Anytime prediction
How to evaluate decision models that evolve over
time?
14. Why Streaming Analytics?
Value Creation, Cost and the Challenge
• Its not cost effective to store all
data, especially if its low or yet
to be deemed of value (noise)
• But its highly valuable to inspect
/ analyze all the data, to identify
the signal from the noise or
determine what needs to be
persisted
• There is value in identifying the
signal in the past, offline analysis
(actually required), but you’ve
now lost the chance to effect the
now
Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 15
15. Top Client Challenges
• 80% of data is unstructured. Existing analytics cannot analyze
streaming data like video, acoustic, text and sensor.
• Too much noise. Too much low value data. How to pre-process
all data on the fly (megabytes or petabytes). Keep only what is
required/valuable? Remember more data means more cost
and compliance pain.
• Data volumes double every year. Too much to store and then
analyze. How to analyze now before data is gone forever?
• Dashboard overload. Too much history and not enough
future prediction. How to get ahead, plan and predict vs.
react?
• Sometimes 1 minute is too late. How to quickly process,
analyze and act on perishable data to lower costs? Not just
batch/historical
Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 16
16. Major Research challenges in
Streaming Data Analytics:
1. Concept Drift
2. Classification of stream data
3. Pre-processing of streams
4. Performance evaluation parameters for
stream data mining processes
5. Protecting data privacy
17
Courtesy: Krempl, Georg, et al. "Open challenges for data stream mining research." ACM SIGKDD explorations newsletter 16.1 (2014).
17. Performance Metrics for stream data
mining processes
18
[1]Bifet A., Read J., Žliobaitė I., Pfahringer B., Holmes G. (2013) Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them. In: Blockeel H., Kersting
K., Nijssen S., Železný F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science, vol 8188. Springer,
Berlin, Heidelberg
[2]Mingzhou Song,Lin Zhang, Comparison of Cluster Representations from Partial Second- to Full Fourth-Order Cross Moments for Data Stream Clustering,ICDM
'08. Eighth IEEE International Conference on Data Mining, 2008.
Task Evaluation Parameter Major Purpose Value significance
Classification
Kappa statistics [1] Assess performance imbalance
data stream case
Higher value means better
performance
Temporal-Kappa statistics [1] Assess performance in case of
temporal dependent data
stream
Negative value means worse
performance
Clustering
Completeness [2] Measures whether same class
instance fall in same cluster or
not
Higher value means better
clustering
Purity [2] Assesses purity of the clusters
in terms of having same class
instances
Higher value means better
clustering
SSQ [2] Measures cluster cohesiveness Lower value means better
performance
Silhouette coefficient [2] Assess compactness as well as
separation of clusters
Higher value means better
clustering
18. Performance Metrics for stream data
mining processes cont..
• Loss: measuring how appropriate is the current
model to the actual status of the nature.
• Memory used: Learning algorithms run in fixed
memory. We need to evaluate the memory usage
over time, and the impact in accuracy when using
the available memory.
• Speed of Processing examples: Algorithms must
process the examples as fast if not faster than
they arrive.
19
20. • A high performance distributed publish-subscribe messaging system.
• Designed for processing of real time activity stream data.
• Initially developed at LinkedIn, now part of Apache.
• Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for
real-time analysis and rendering of streaming data.
Courtesy: https://www.tutorialspoint.com/apache_kafka/ 21
Fast
Scalable
Durable
Fault-
tolerant
21. • A highly distributed real-time
computation system.
• Acquired by Twitter.
• Twitter claims, “Over a million tuples
processed per second per node.”
• Fast, Scalable, Reliable and Fault-
tolerant.
• Stream: Unbounded sequence of
tuples
– Primitives Spouts: Pull messages
– Bolts: Perform core functions of stream
computing
Courtesy: http://www.tutorialspoint.com/apache_storm/ 22
22. • Spark Streaming uses micro-batching to support continuous
stream processing.
• It is an extension of Spark which is a batch-processing system
Courtesy:http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html 23
•Was developed in the AMPLab at UC
Berkeley.
•In-memory computing capabilities deliver
speed.
•Low latency
•High throughput
•Fault tolerant
•New programing model:
•Discretized streams (Dstreams)
•Resilient Distributed Datasets
23. SpringXD
• Spring XD is a unified, distributed, and extensible system for data ingestion,
real time analytics, batch processing, and data export.
• Spring XD framework supports streams for the ingestion of event driven
data from a source to a sink that passes through any number of processors.
Courtesy: https://github.com/spring-projects/spring-xd/wiki/About-Spring-XD
24
30. Lambda Architecture cont..
A. All data is sent to both the batch and speed layer
B. Master data set is an immutable, append-only set of data
C. Batch layer pre-computes query functions from scratch, result is
called Batch Views. Batch layer constantly re-computes the batch
views.
D. Batch views are indexed and stored in a scalable database to get
particular values very quickly. Swaps in new batch views when
they are available
E. Speed layer compensates for the high latency of updates to the
Batch Views
F. Uses fast incremental algorithms and read/write databases to
produce real time views
G. Queries are resolved by getting results from both batch and real-
time views
Courtesy:https://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action
31
35. Common Real-Time Analytics Use
Cases
• Sales Enrichment - Use of real time events to provide a prediction of what a
consumer is interested in right now
– Data : Current search keywords, Transactions, Web-pages visited, Mobility/Location, Weather,
etc
– Deliver a relevant coupon before they pass the store
– Display a relevant advert as they swipe a credit card at the gas pump
– Deliver promotion to incentivize change in behaviour
• Security/Fraud - Use of real-time context to determine if an action is or likely to be
fraudulent
– Data: Store browsing patterns, Location, Machine / Network activity, etc
– Determine if an online session is fraudulent before a purchase transaction is submitted
– Identify & block a denial of service attack before it brings down any system
• Anomaly Prediction - Use of real-time events and context to predict anomalous
behaviour before it occurs
– Data: Server logs, System metrics, Sensors, etc
– Predict a network switch crash to allow full capture of all network data prior to the crash to
allow root cause analysis
– Predictive a Black Ice or Brake Failure event in a Connected Car
– Detect Drilling Dysfunction on a Oil Rig to prevent breakages and lost productivity
Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 37
Veracity. IBM coined Veracity as the fourth V, which represents the unreliability inherent in some sources of data. For example, customer sentiments in social media are uncertain in nature, since they entail human judgment. Yet they contain valuable information. Thus the need to deal with imprecise and uncertain data is another facet of big data, which is addressed using tools and analytics developed for management and mining of uncertain data.
•Variability (and complexity). SAS introduced Variability and Complexity as two additional dimensions of big data. Variability refers to the variation in the data flow rates. Often, big data velocity is not consistent and has periodic peaks and troughs. Complexity refers to the fact that big data are generated through a myriad of sources. This imposes a critical challenge: the need to connect, match, cleanse and transform data received from different sources.
•Value. Oracle introduced Value as a defining attribute of big data. Based on Oracle's definition, big data are often characterized by relatively “low value density”. That is, the data received in the original form usually has a low value relative to its volume. However, a high value can be obtained by analyzing large volumes of such data.