Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stream Processing and Real-Time Data Pipelines


Published on

CS HUG Prague, 28th June 2018

Published in: Technology
  • Be the first to comment

Stream Processing and Real-Time Data Pipelines

  1. 1. © 2017 Hazelcast Inc. Stream processing and real-time data pipelines 1 Vladimir Schreiner | | CS HUG Prague | 29th June 2018 2
  2. 2. © 2017 Hazelcast Inc. What is Hazelcast? HAZELCAST IMDG is a Distributed In-Memory Store Cache, KV-Store, Messaging Implements standard collections API in distributed way Clients in Java, .NET, NodeJS, Go, Pythoon, C++ From 2008 HAZELCAST JET is a general purpose distributed data processing engine built on Hazelcast and based on directed acyclic graphs (DAGs) to model data flow for low latency batch and stream processing. From 2015
  3. 3. © 2017 Hazelcast Inc. Stream Processing Why should I bother?
  4. 4. © 2017 Hazelcast Inc. What is stream processing? Data Processing: Massage the data when moving from place to place. On-Line systems – request/response, small volumes, low-latency Batch Processing – data in / data out, big volumes, huge latency Stream processing – data in / data out, big volumes, low-latency
  5. 5. © 2017 Hazelcast Inc. From Batch to Stream
  6. 6. © 2017 Hazelcast Inc. From Batch to Stream
  7. 7. © 2017 Hazelcast Inc. When to Use Stream Processing • Real-time analytics • Monitoring, Fraud, Anomalies, Pattern detection, Prediction • Event-Driven Architectures • Real-Time ETL • Moving batch tasks to near real-time • Continuous data • Consistent Resource Consumption (1GB/sec -> 86TB/day)
  8. 8. © 2017 Hazelcast Inc. Streaming is nothing brand new Complex Event Processing – early 2000‘s Materialised Views – 1998 (Oracle 8i) UNIX Pipelines? Scale is what‘s new!
  9. 9. © 2017 Hazelcast Inc. What modern SPE can do for you? • Offers high-level API to implement the processing pipeline • map, filter, groupBy, aggregate, join … • Offers connectors to read and write the data • Kafka, HDFS, JDBC, JMS … • You implement the data pipeline and submit it to the SPE • Executes the pipeline in a parallel and distributed environment • Moves the data through the pipeline (partitioning, shuffling, backpressure) • Is fault-tolerant (survives failures) and elastic • Monitoring, diagnostics etc.
  10. 10. © 2017 Hazelcast Inc. England – Belgium England amazing! #ENGBEL ENG win +1, BEL lost +1 Belgium hopeless, it‘s so bad #ENGBEL ENG win +2, BEL lost + 2 My guess is draw for England - Belgium #ENGBEL Both draw + 1
  11. 11. © 2017 Hazelcast Inc. Streaming and Hadoop Major Evolutionary Steps
  12. 12. © 2017 Hazelcast Inc. 1st Gen: MapReduce • Google File System Paper, 2003 ve/gfs-sosp2003.pdf • MapReduce paper, 2004 ve/mapreduce-osdi04.pdf • Apache Hadoop founded by Doug Cutting and Mike Cafarella at Yahoo 2006 • Commercial Open Source Distros: Cloudera, Hortonworks, MapR • Lots of additions to the ecosystem
  13. 13. © 2017 Hazelcast Inc. • Apache Spark started as a research project at UC Berkeley in the AMPLab, which focuses on big data analytics, in 2010. • Goal was to design a programming model that supports a much wider class of applications than MapReduce. Introduced DAG 2nd Gen: Spark
  14. 14. © 2017 Hazelcast Inc. • Fault-Tolerance of Spark designed with for batch • Spark Streaming Paper, 2012. Stream as a sequence of micro-batches • Spark has now moved on to DataFrames, Tungsten and Spark Streaming and its architecture continues to evolve so it is also 3rd Gen (Continuous). 2nd Gen: Spark Streaming
  15. 15. © 2017 Hazelcast Inc. 3rd Gen: Continuous streaming • DAG based • Streaming based. Not micro-batch • Batch is a simply streaming with bounds • Learns from previous systems • Informed by academic papers in the last decade, such as the Google[1][2] but many more • Plethora of Choice: Apache Storm, Twitter Heron, Apache Flink, Kafka Streams, Google DataFlow, Hazelcast Jet, Spark Continuous Streaming [1] MillWheel: Fault-Tolerant Stream Processing at Internet Scale [2] FlumeJava: Easy, Efficient Data-Parallel Pipelines
  16. 16. © 2017 Hazelcast Inc. Big Data Trends
  17. 17. © 2017 Hazelcast Inc. Hadoop – Declining Interest
  18. 18. © 2017 Hazelcast Inc. Hadoop – Obsolete?
  19. 19. © 2017 Hazelcast Inc. Spark Dominant but Stable
  20. 20. © 2017 Hazelcast Inc. The Rest
  21. 21. © 2017 Hazelcast Inc. Challenges of Stream Processing
  22. 22. © 2017 Hazelcast Inc. 1 - Infinite input • Some operations need bounded data (aggregate, join) • Slowest response within last 10 secs x Slowest response since the app was started • Windows - a bounded view on top of an infinite stream. Scope of a computation defined in code. Not in the data (batch) or by operational settings (micro batch for streaming)!
  23. 23. © 2017 Hazelcast Inc. Tumbling windows • Continuous stream divided into discrete parts. • Defined by a time duration or record count.
  24. 24. © 2017 Hazelcast Inc. Sliding windows • Fixed-size view that slides with the time advancing. • Fixed-size - they can overlap. • Defined by a size (time duration, record count) and shift
  25. 25. © 2017 Hazelcast Inc. Session windows • Burst of activity followed by period of inactivity (timeout). • Collects the activity belonging to the same session. • It’s dynamic - no fixed start or duration.
  26. 26. © 2017 Hazelcast Inc. 2 – Late Events • Wall clock time x Event time • Event time – more natural, unordered data • Long or short waiting? Latency, memory X correctnes. • Heuristic helping you assessing window completenes
  27. 27. © 2017 Hazelcast Inc. 3 - Fault Tolerance • Streaming jobs can be running for a long time! What if something crashes? • System back-ups! • It’s not that simple: you still need to coordinate the states of different steps in the computation: Distributed snapshots • Chandy-Lamport distributed snapshotting algorithm[1] [1] -
  28. 28. © 2017 Hazelcast Inc. Distributed snapshots Reader Writer Reader Reader Reader Writer Snapshot Done!
  29. 29. © 2017 Hazelcast Inc. 4 - Complexity There is a lot of moving parts: • Replayable data source (2x Kafka + 3x Zookeeper) • Stream Processing Engine (2x) • Cluster manager (YARN, Mesos, Kubernetes) • Highly-Available Snapshot Storage (HDFS) (3x) • Data sink for results You Are Not Google, are you?
  30. 30. © 2017 Hazelcast Inc. Jet Favours Speed * * Spark had all performance options turn on including Tungsten
  31. 31. © 2017 Hazelcast Inc. Jet Favours Simplicity
  32. 32. © 2017 Hazelcast Inc. Jet Application Deployment Options IMDG IMDG • No separate process to manage • Great for microservices • Great for OEM • Simplest for Ops – nothing extra Embedded Application Java API Application Java API Application Java API Java Client Application Client-Server Jet Jet Jet Jet Jet Jet Java Client Application Java Client Application Java Client Application • Separate Jet Cluster • Scale Jet independent of applications • Isolate Jet from application server lifecycle • Managed by Ops
  33. 33. © 2017 Hazelcast Inc. Our game prediction!
  34. 34. © 2017 Hazelcast Inc. Flight Telemetry © 2017 Hazelcast Inc. Flight Telemetry
  35. 35. © 2017 Hazelcast Inc. Demo Applications Real-time Image Recognition Twitter Cryptocurrency Sentiment Analysis Recognizes images present in the webcam video input with a model trained with CIFAR-10 dataset. Twitter content is analyzed in real time to calculate cryptocurrency trend list with popularity index. Real-Time Road Traffic Analysis And Prediction Real-time Sports Betting Engine Continuously computes linear regression models from current traffic. Uses the trend from week ago to predict traffic now This is a simple example of a sports book and is a good introduction to the Pipeline API. It also uses Hazelcast IMDG as an in-memory data store. Flight Telemetry Market Data Ingest Reads a stream of telemetry data from ADB-S on all commercial aircraft flying anywhere in the world. There is typically 5,000 - 6,000 aircraft at any point in time. This is then filtered, aggregated and certain features are enriched and displayed in Grafana. Uploads a stream of stock market data (prices) from a Kafka topic into an IMDG map. Data is analyzed as part of the upload process, calculating the moving averages to detect buy/sell indicators. Input data here is manufactured to ensure such indicators exist, but this is easy to reconnect to real input. Markov Chain Generator Real-Time Trade Processing Generates a Markov Chain with probabilities based on supplied classical books. Processes immutable events from an event bus (Kafka) to update storage optimized for querying and reading (IMDG).
  36. 36. © 2017 Hazelcast Inc. Hazelcast Jet. Try it! 36 • High Performance | Industry Leading Performance • Works great with Hazelcast IMDG | Source, Sink, Enrichment • Very simple to program | Leverages existing standards • Very simple to deploy | embed 10MB jar or Client Server • Works in every Cloud | Same as Hazelcast IMDG • For Developers by Developers | Code it
  37. 37. © 2017 Hazelcast Inc. Questions? Version 0.6.1 is the current release with 0.7 coming in September aiming for 1.0 this year