Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Shikha fdp 62_14july2017


Published on

Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Shikha fdp 62_14july2017

  1. 1. Performance Metrics for Big Data Systems: Streaming Data Analytics 1 Faculty Development Program (FDP) FDP on Performance Assessment of Computing Systems organized by the department of CSE, JIIT-62 , NOIDA from 10th - 15th July 2017 . Dr. Shikha Mehta JIIT, Sec 62, Noida
  2. 2. Outline • Introduction • What is data Streaming? • Data at rest vs data in Motion – Batch Processing vs Stream Processing • Why Streaming Data Analytics? – Streaming Data Challenges • Performance Metrics for streaming Data • Technologies for Streaming Data Analytics • Lambda and Kappa Architecture • Hype Cycle 2
  3. 3. 3
  4. 4. According to a new International Data Corporation (IDC) Spending Guide, “worldwide spending on the Internet of Things (IoT) will grow at a 17.0% compound annual growth rate (CAGR) from $698.6 billion in 2015 to nearly $1.3 trillion in 2019.” Courtesy: 4
  5. 5. 5
  6. 6. Harnessing Big Data: Analytics 6
  7. 7. Data at rest Vs Data in motion Courtesy: introduction-to-realtime-data-processing-3-160213152050.pdf 7 At Rest In Motion Data is Fixed Continuously incoming data a.k.a bounded a.k.a unbounded Difference lies in when are you analyzing your data? after the event occurs as the event occurs Finding stats about group in a closed room Finding stats about group in a marathon Analyzing sales data for last month to make strategic decisions e-commerce order processing
  8. 8. What kind of Processing? Courtesy: introduction-to-realtime-data-processing-3-160213152050.pdf 8 wada ⇒ batch pani puri ⇒ Streaming
  9. 9. Batch vs Stream Processing cont.. Courtesy: Streaming Analytics on AWS, Dmitri Tchikatilov, AdTech BD, AWS, 9 Batch Processing Stream Processing Data scope Queries or processing over all or most of the data Queries or processing over data on rolling window or most recent data record Data size Large batches of data Individual records or micro batches of few records Performance Latencies in minutes to hours. Requires latency in the order of seconds or milliseconds. Analytics Complex analytics. Simple response functions, aggregates, and rolling metrics.
  10. 10. What is Stream Processing? • Imagine you are browsing: • If you see an advert on a page, there will be an AdViewEvent • {UserId, AdId, Timestamp} • If you clicked the ad, there will be another AdClickEvent • {UserId, AdId, Timestamp} Courtesy: Coursera, course on Cloud Computing Applications 10
  11. 11. Stream Processing Cont.. Courtesy: Coursera, course on Cloud Computing Applications 12 Which is the most effective ad during last hour?
  12. 12. Stream Processing Cont.. • Data Streams: Continuous flow of data generated at high-speed in Dynamic, Time-changing environments. • We need to maintain decision models in real time. • Decision Models must be capable of: – incorporating new information at the speed data arrives; – detecting changes and adapting the decision models to the most recent information. – forgetting outdated information; • Unbounded training sets, dynamic models. • In Practice: finite training sets, static models. 13
  13. 13. Stream Processing Cont.. Courtesy: Ecmlpkdd2015 slides 14 1. One example at a time, used at most once 2. Limited memory 3. Limited time 4. Anytime prediction How to evaluate decision models that evolve over time?
  14. 14. Why Streaming Analytics? Value Creation, Cost and the Challenge • Its not cost effective to store all data, especially if its low or yet to be deemed of value (noise) • But its highly valuable to inspect / analyze all the data, to identify the signal from the noise or determine what needs to be persisted • There is value in identifying the signal in the past, offline analysis (actually required), but you’ve now lost the chance to effect the now Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 15
  15. 15. Top Client Challenges • 80% of data is unstructured. Existing analytics cannot analyze streaming data like video, acoustic, text and sensor. • Too much noise. Too much low value data. How to pre-process all data on the fly (megabytes or petabytes). Keep only what is required/valuable? Remember more data means more cost and compliance pain. • Data volumes double every year. Too much to store and then analyze. How to analyze now before data is gone forever? • Dashboard overload. Too much history and not enough future prediction. How to get ahead, plan and predict vs. react? • Sometimes 1 minute is too late. How to quickly process, analyze and act on perishable data to lower costs? Not just batch/historical Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 16
  16. 16. Major Research challenges in Streaming Data Analytics: 1. Concept Drift 2. Classification of stream data 3. Pre-processing of streams 4. Performance evaluation parameters for stream data mining processes 5. Protecting data privacy 17 Courtesy: Krempl, Georg, et al. "Open challenges for data stream mining research." ACM SIGKDD explorations newsletter 16.1 (2014).
  17. 17. Performance Metrics for stream data mining processes 18 [1]Bifet A., Read J., Žliobaitė I., Pfahringer B., Holmes G. (2013) Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them. In: Blockeel H., Kersting K., Nijssen S., Železný F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science, vol 8188. Springer, Berlin, Heidelberg [2]Mingzhou Song,Lin Zhang, Comparison of Cluster Representations from Partial Second- to Full Fourth-Order Cross Moments for Data Stream Clustering,ICDM '08. Eighth IEEE International Conference on Data Mining, 2008. Task Evaluation Parameter Major Purpose Value significance Classification Kappa statistics [1] Assess performance imbalance data stream case Higher value means better performance Temporal-Kappa statistics [1] Assess performance in case of temporal dependent data stream Negative value means worse performance Clustering Completeness [2] Measures whether same class instance fall in same cluster or not Higher value means better clustering Purity [2] Assesses purity of the clusters in terms of having same class instances Higher value means better clustering SSQ [2] Measures cluster cohesiveness Lower value means better performance Silhouette coefficient [2] Assess compactness as well as separation of clusters Higher value means better clustering
  18. 18. Performance Metrics for stream data mining processes cont.. • Loss: measuring how appropriate is the current model to the actual status of the nature. • Memory used: Learning algorithms run in fixed memory. We need to evaluate the memory usage over time, and the impact in accuracy when using the available memory. • Speed of Processing examples: Algorithms must process the examples as fast if not faster than they arrive. 19
  19. 19. 20
  20. 20. • A high performance distributed publish-subscribe messaging system. • Designed for processing of real time activity stream data. • Initially developed at LinkedIn, now part of Apache. • Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data. Courtesy: 21 Fast Scalable Durable Fault- tolerant
  21. 21. • A highly distributed real-time computation system. • Acquired by Twitter. • Twitter claims, “Over a million tuples processed per second per node.” • Fast, Scalable, Reliable and Fault- tolerant. • Stream: Unbounded sequence of tuples – Primitives Spouts: Pull messages – Bolts: Perform core functions of stream computing Courtesy: 22
  22. 22. • Spark Streaming uses micro-batching to support continuous stream processing. • It is an extension of Spark which is a batch-processing system Courtesy: 23 •Was developed in the AMPLab at UC Berkeley. •In-memory computing capabilities deliver speed. •Low latency •High throughput •Fault tolerant •New programing model: •Discretized streams (Dstreams) •Resilient Distributed Datasets
  23. 23. SpringXD • Spring XD is a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. • Spring XD framework supports streams for the ingestion of event driven data from a source to a sink that passes through any number of processors. Courtesy: 24
  24. 24. Comparison of Tools Courtesy: 25
  25. 25. Comparison of Tools cont.. 26
  26. 26. Commercial Stream processing frameworks • Google DataFlow Courtesy: 27
  27. 27. Commercial Stream processing frameworks cont.. • Azure Stream Analytics Courtesy: 28
  28. 28. 29
  29. 29. Lambda Architecture Courtesy: 30
  30. 30. Lambda Architecture cont.. A. All data is sent to both the batch and speed layer B. Master data set is an immutable, append-only set of data C. Batch layer pre-computes query functions from scratch, result is called Batch Views. Batch layer constantly re-computes the batch views. D. Batch views are indexed and stored in a scalable database to get particular values very quickly. Swaps in new batch views when they are available E. Speed layer compensates for the high latency of updates to the Batch Views F. Uses fast incremental algorithms and read/write databases to produce real time views G. Queries are resolved by getting results from both batch and real- time views Courtesy: 31
  31. 31. Lambda Architecture cont.. Courtesy: 32
  32. 32. Lambda Architecture cont.. Example Courtesy: 33
  33. 33. Lambda Architecture: Open Source Frameworks 34
  34. 34. Kappa Architecture Courtesy: Coursera, course on Cloud Computing Applications 36
  35. 35. Common Real-Time Analytics Use Cases • Sales Enrichment - Use of real time events to provide a prediction of what a consumer is interested in right now – Data : Current search keywords, Transactions, Web-pages visited, Mobility/Location, Weather, etc – Deliver a relevant coupon before they pass the store – Display a relevant advert as they swipe a credit card at the gas pump – Deliver promotion to incentivize change in behaviour • Security/Fraud - Use of real-time context to determine if an action is or likely to be fraudulent – Data: Store browsing patterns, Location, Machine / Network activity, etc – Determine if an online session is fraudulent before a purchase transaction is submitted – Identify & block a denial of service attack before it brings down any system • Anomaly Prediction - Use of real-time events and context to predict anomalous behaviour before it occurs – Data: Server logs, System metrics, Sensors, etc – Predict a network switch crash to allow full capture of all network data prior to the crash to allow root cause analysis – Predictive a Black Ice or Brake Failure event in a Connected Car – Detect Drilling Dysfunction on a Oil Rig to prevent breakages and lost productivity Courtesy: IBM Big Data Streaming Analytics, Stewart Hanna 37
  36. 36. 38
  37. 37. 39