Advertisement

Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Founder | Engineer | Writer at Maolte Technical Solutions Limited
Dec. 15, 2015
Advertisement

More Related Content

Advertisement

Real Time Analytics on High Velocity Streaming data by Guangyu Wu

  1. Real-time Analytics on High Velocity Streaming Data Guangyu Wu @CeADAR
  2. CeADAR ‣ Application development & proof of concept ‣ Business-value driven ‣ Market pull/need-driven ‣ Website: http://ceadar.ie/ University CeADAR Enterprise
  3. CeADAR Visualisa'on & Analy'c Interfaces • ‘Beyond the desktop’ • Ease of interac6on • Changing user behaviour • Passive analy6cs Data Management for Analy'cs • Reduce data management effort for analy6cs • Data valida6on • Relevance of events to rela6onships • Data cura6on (determining useful data) • Adap6ve ETL (Extract, Transform, Load) Advanced Analy'cs • Causa6on challenge • Live topic monitoring • Social trending and contextualisa6on • Con'nuous analy'cs • Social Iden6ty fingerprin6ng
  4. Overview ‣ Introduce different frameworks: ‣ Spark, Storm, Trident ‣ Continuous Clustering project ‣ Continuous Metrics project ‣ Stream Converge project
  5. Spark ‣ Spark is a platform for distributed batch data processing. ‣ Spark includes a number extensions: Spark Streaming, Spark SQL, MLlib, GraphX. ‣ Spark runs batch jobs predominantly in memory. ‣ Spark Streaming manages to integrate stream processing with batch processing by treating a data stream as sequences of small batches of data points, or micro-batches. ‣ Spark Streaming maintains computation states.
  6. Storm ‣ A Storm topology is comprised of spouts and bolts. ‣ Storm operates over individual data points. ‣ Storm is designed purely for stream processing.
  7. Trident ‣ Trident is a high level programming abstraction built on top of Storm. ‣ It provides a number of useful functions such as aggregations and filters. ‣ An application can be designed and implemented using these high level abstractions and Trident converts the logic into a standard Storm topology under the hood. ‣ Trident works over micro-batches of data. ‣ Trident also has built-in support for maintaining processing state and state query.
  8. Methodology Large static batches of messages Hadoop and off-line batch processing in Spark Single messages Storm Micro-batches of messages Spark Streaming, Trident Discretised streams
  9. Continuous Clustering ‣ Use case: real-time SMS spam detection in mobile networks. ‣ Clustering SMS messages based on their content is a good way to identify spam. ‣ Many similar spam messages are sent out over a short period of time.
  10. Continuous Clustering ‣ Problem with traditional clustering algorithms… ‣ work off-line over historical data ‣ require multiple passes over the data ‣ not incrementally updatable ‣ are hard to scale to ‘big’ data ‣ CeADAR solution: we developed a novel single pass, scalable data stream clustering algorithm implemented on Storm.
  11. Continuous Clustering
  12. Deployment ‣ Our compute cluster is composed of 4 machines. ‣ Each machine: ‣ Intel Xeon CPU E5-2630 0 @ 2.30GHz with 24 cores ‣ 64G memory ‣ 1T disk ‣ Spark, Storm, Hadoop, Kafka, Redis
  13. Continuous Clustering ‣ US tier 1 mobile operator ‣ ~500 messages/second average ‣ ~1,300 messages/second peak 35,913 Near-exact matching 8,160 Matching threshold 75%
  14. Continuous Metrics ‣ Evaluate and compare Storm, Storm Trident and Spark Streaming on the task of computing a set of statistical metrics in real-time over a continuous stream of data. ‣ Evaluate and compare ‣ Throughput: the volume and velocity of data that can be processed on different configurations and hardware. ‣ Latency: the time delay between a new data point being received and the updated metrics being computed. vs vs
  15. Sliding Windows ‣ By items ‣ By time
  16. Continuous Metrics ‣ High level results overview ‣ Spark Streaming achieves the highest throughput, with Storm at the other end with the lowest throughput. ‣ However, Storm achieves the best latency by a considerable margin. Spark and Trident both exhibit considerably higher latency which is due at least in part to their micro-batch data processing approach. ‣ The evaluation produced many other insights, learnings and recommendations relating to these real-time platforms.
  17. Stream Converge ‣ Current project: process and combine heterogeneous data streams from diverse sources using Spark Streaming.
  18. Stream Converge ‣ Challenges: ‣ managing data streams of different frequency. ‣ linking together events across different streams via complex key relationships. ‣ handling out of order arrival of data. ‣ ……
Advertisement