Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

In-Memory Stream Processing with Hazelcast Jet @MorningAtLohika

87 views

Published on

Slides from my talk at Morning@Lohika meetup on Jun 23, 2018.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

In-Memory Stream Processing with Hazelcast Jet @MorningAtLohika

  1. 1. IN-MEMORY STREAM PROCESSING WITH Nazarii Cherkas | Hazelcast nazarii@hazelcast.com https://twitter.com/n_cherkas
  2. 2. Brief Agenda • Why Stream Processing? • What‘s special about Streaming Data • Challenges when processing the Infinite Stream • Hazelcast Jet: The modern Stream Processing Engine • Overview and Key Concepts • Infinite Stream Processing • Fault Tolerance • Jet Performance • Summary 2© 2018 Hazelcast Inc.
  3. 3. About me 3© 2018 Hazelcast Inc.
  4. 4. About me • 7+ years of experience of on different positions from Java Engineer to Team Lead 4© 2018 Hazelcast Inc.
  5. 5. About me • 7+ years of experience of on different positions from Java Engineer to Team Lead • Solutions Architect at Hazelcast, I solve problems of our users and interact with community 5© 2018 Hazelcast Inc.
  6. 6. Why Stream Processing? 6© 2018 Hazelcast Inc.
  7. 7. Streaming Data is everywhere 7© 2018 Hazelcast Inc.
  8. 8. What's special about Streaming Data 8© 2018 Hazelcast Inc.
  9. 9. What's special about Streaming Data • Infinite data sets 9© 2018 Hazelcast Inc.
  10. 10. What's special about Streaming Data • Infinite data sets • Small size of data record 10© 2018 Hazelcast Inc.
  11. 11. What's special about Streaming Data • Infinite data sets • Small size of data record • Near real-time insights 11© 2018 Hazelcast Inc.
  12. 12. What's special about Streaming Data • Infinite data sets • Small size of data record • Near real-time insights • Variance in throughput and variance in disorder 12© 2018 Hazelcast Inc.
  13. 13. Definitions of Stream Processing 13© 2018 Hazelcast Inc.
  14. 14. Definitions of Stream Processing “...a type of data processing that is designed with infinite data sets in mind...” https://jet.hazelcast.org/use-cases/real-time-stream-processing/ https://data-artisans.com/what-is-stream-processing https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 14© 2018 Hazelcast Inc.
  15. 15. Definitions of Stream Processing “...a type of data processing that is designed with infinite data sets in mind...” “...processing of data in motion, or in other words, computing on data directly as it is produced or received…” https://jet.hazelcast.org/use-cases/real-time-stream-processing/ https://data-artisans.com/what-is-stream-processing https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 15© 2018 Hazelcast Inc.
  16. 16. Definitions of Stream Processing “...a type of data processing that is designed with infinite data sets in mind...” “...processing of data in motion, or in other words, computing on data directly as it is produced or received…” “...a technique to process the data on-the-fly, prior to it’s storage...” https://jet.hazelcast.org/use-cases/real-time-stream-processing/ https://data-artisans.com/what-is-stream-processing https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 16© 2018 Hazelcast Inc.
  17. 17. Stream vs Batch Processing 17© 2018 Hazelcast Inc.
  18. 18. Stream vs Batch Processing https://aws.amazon.com/streaming-data/ 18 Batch processing Stream processing Data scope Queries or processing over all or most of the data in the dataset Queries or processing over data within a rolling time window, or on just the most recent data record © 2018 Hazelcast Inc.
  19. 19. Stream vs Batch Processing https://aws.amazon.com/streaming-data/ 19 Batch processing Stream processing Data scope Queries or processing over all or most of the data in the dataset Queries or processing over data within a rolling time window, or on just the most recent data record Data size Large batches of data Individual records or micro batches consisting of a few records © 2018 Hazelcast Inc.
  20. 20. Stream vs Batch Processing https://aws.amazon.com/streaming-data/ 20 Batch processing Stream processing Data scope Queries or processing over all or most of the data in the dataset Queries or processing over data within a rolling time window, or on just the most recent data record Data size Large batches of data Individual records or micro batches consisting of a few records Responsiveness Latencies in minutes to hours Requires latency in the order of seconds or milliseconds © 2018 Hazelcast Inc.
  21. 21. Stream vs Batch Processing https://aws.amazon.com/streaming-data/ 21 Batch processing Stream processing Data scope Queries or processing over all or most of the data in the dataset Queries or processing over data within a rolling time window, or on just the most recent data record Data size Large batches of data Individual records or micro batches consisting of a few records Responsiveness Latencies in minutes to hours Requires latency in the order of seconds or milliseconds Analyses Complex analytics Aggregates, approximation algorithms and simple response functions © 2018 Hazelcast Inc.
  22. 22. Layers of Stream Processing 22© 2018 Hazelcast Inc.
  23. 23. Challenges of Stream Processing 23© 2018 Hazelcast Inc.
  24. 24. Challenges of Stream Processing • Distributed system coordination 24© 2018 Hazelcast Inc.
  25. 25. Challenges of Stream Processing • Distributed system coordination • Notion of time 25© 2018 Hazelcast Inc.
  26. 26. Challenges of Stream Processing • Distributed system coordination • Notion of time • Memory management 26© 2018 Hazelcast Inc.
  27. 27. Challenges of Stream Processing • Distributed system coordination • Notion of time • Memory management • Fault-tolerance 27© 2018 Hazelcast Inc.
  28. 28. Hazelcast Jet: In-Memory Streaming and Fast Batch Processing 28© 2018 Hazelcast Inc.
  29. 29. What is Hazelcast Jet 29© 2018 Hazelcast Inc. Source Sink
  30. 30. What is Hazelcast Jet https://github.com/hazelcast/hazelcast-jet/ Apache License 2.0 30© 2018 Hazelcast Inc. Source Sink
  31. 31. Hazelcast Jet use cases 31© 2018 Hazelcast Inc.
  32. 32. Hazelcast Jet use cases • Low-latency Stream processing and analytics 32© 2018 Hazelcast Inc.
  33. 33. Hazelcast Jet use cases • Low-latency Stream processing and analytics • Fast Batch processing and ETL 33© 2018 Hazelcast Inc.
  34. 34. Hazelcast Jet use cases • Low-latency Stream processing and analytics • Fast Batch processing and ETL • Distributed java.util.stream 34© 2018 Hazelcast Inc.
  35. 35. Hazelcast Jet use cases • Low-latency Stream processing and analytics • Fast Batch processing and ETL • Distributed java.util.stream • Implementing event sourcing and CQRS 35© 2018 Hazelcast Inc.
  36. 36. Hazelcast Jet use cases • Low-latency Stream processing and analytics • Fast Batch processing and ETL • Distributed java.util.stream • Implementing event sourcing and CQRS • Data processing microservice architectures 36© 2018 Hazelcast Inc.
  37. 37. Hazelcast Jet: Architecture Overview 37 Core API java.util.stream Batch Readers and Writers Batch Processing Pipeline API Streaming Readers and Writers Stream Processing Networking Deployment Data Structures and Partition Management Execution Engine Cluster Management with Cloud Discovery SPI Java Client Fault-Tolerance Connectors High-Level APIs Processing Core © 2018 Hazelcast Inc. 31
  38. 38. Hazelcast Jet: Architecture Overview 38 Core API java.util.stream Batch Readers and Writers Batch Processing Pipeline API Streaming Readers and Writers Stream Processing Networking Deployment Data Structures and Partition Management Execution Engine Cluster Management with Cloud Discovery SPI Java Client Fault-Tolerance Connectors High-Level APIs Processing Core © 2018 Hazelcast Inc. 32
  39. 39. Hazelcast Jet: Architecture Overview 39 Core API java.util.stream Batch Readers and Writers Batch Processing Pipeline API Streaming Readers and Writers Stream Processing Networking Deployment Data Structures and Partition Management Execution Engine Cluster Management with Cloud Discovery SPI Java Client Fault-Tolerance Connectors High-Level APIs Processing Core © 2018 Hazelcast Inc. 33
  40. 40. Talk is cheap, show me the Word Count Demo Word Count problem is the “Hello, World” in the Land of Stream Processing • Input • Text book in the single file • Stop-list of words to ignore i.e. ”this”, “that”, “of” etc. • Output • Top N word occurrences in the book, saved as key -> value pairs 40© 2018 Hazelcast Inc. https://es.wikiquote.org/wiki/Linus_Torvalds https://github.com/ncherkas/hazelcast-jet-demos
  41. 41. Key Concepts 41© 2018 Hazelcast Inc.
  42. 42. Key concepts Distributed Acyclic Graph (DAG) 42© 2018 Hazelcast Inc.
  43. 43. Key concepts Distributed Acyclic Graph (DAG) 43© 2018 Hazelcast Inc.
  44. 44. Key concepts Distributed Acyclic Graph (DAG) 44© 2018 Hazelcast Inc.
  45. 45. Key concepts Distributed Acyclic Graph (DAG) 45© 2018 Hazelcast Inc.
  46. 46. Key concepts Distributed Acyclic Graph (DAG) 46© 2018 Hazelcast Inc.
  47. 47. Key concepts Distributed Acyclic Graph (DAG) 47© 2018 Hazelcast Inc.
  48. 48. Key concepts Jet Cluster 48© 2018 Hazelcast Inc.
  49. 49. Key concepts Jet Cluster 49© 2018 Hazelcast Inc.
  50. 50. Key concepts Job Execution 50© 2018 Hazelcast Inc.
  51. 51. Key concepts Job Execution 51© 2018 Hazelcast Inc.
  52. 52. Key concepts Job Execution 52© 2018 Hazelcast Inc. Hazelcast Member Hazelcast Member Hazelcast Member
  53. 53. Jet APIs 53© 2018 Hazelcast Inc.
  54. 54. Jet APIs • Pipeline API • First choice to use Jet. Build rich data pipelines on a variety of sources and sinks 54© 2018 Hazelcast Inc.
  55. 55. Jet APIs • Pipeline API • First choice to use Jet. Build rich data pipelines on a variety of sources and sinks • Distributed java.util.stream • Entry-level usage, simple transform-aggregate operations on IMap, JCache and IList 55© 2018 Hazelcast Inc.
  56. 56. Jet APIs • Pipeline API • First choice to use Jet. Build rich data pipelines on a variety of sources and sinks • Distributed java.util.stream • Entry-level usage, simple transform-aggregate operations on IMap, JCache and IList • Core DAG API • Low-level API for fine-grained tuning and integration 56© 2018 Hazelcast Inc.
  57. 57. Sources and Sinks 57© 2018 Hazelcast Inc. Resource Infinite? IList ❌ IMap, ICache ❌ Remote IMap, ICache ❌ Event Journal ✅ Remote Event Journal ✅ HDFS ❌ Kafka ✅ Files ❌ File Watcher ✅ TCP Socket ✅ Application Log N/A
  58. 58. Sources and Sinks 58© 2018 Hazelcast Inc. Resource Infinite? Replyable? Checkpointing? IList ❌ ✅ ❌ IMap, ICache ❌ ✅ ❌ Remote IMap, ICache ❌ ✅ ❌ Event Journal ✅ ✅ ✅ Remote Event Journal ✅ ✅ ✅ HDFS ❌ ✅ ❌ Kafka ✅ ✅ ✅ Files ❌ ✅ ❌ File Watcher ✅ ❌ ❌ TCP Socket ✅ ❌ ❌ Application Log N/A N/A ❌
  59. 59. Sources and Sinks 59© 2018 Hazelcast Inc. Resource Infinite? Replyable? Checkpointing? Distributed? IList ❌ ✅ ❌ ❌ IMap, ICache ❌ ✅ ❌ ✅ Remote IMap, ICache ❌ ✅ ❌ ✅ Event Journal ✅ ✅ ✅ ✅ Remote Event Journal ✅ ✅ ✅ ✅ HDFS ❌ ✅ ❌ ✅ Kafka ✅ ✅ ✅ ✅ Files ❌ ✅ ❌ ❌ File Watcher ✅ ❌ ❌ ❌ TCP Socket ✅ ❌ ❌ ❌ Application Log N/A N/A ❌ ❌
  60. 60. Sources and Sinks 60© 2018 Hazelcast Inc. Resource Infinite? Replyable? Checkpointing? Distributed? Data Locality IList ❌ ✅ ❌ ❌ ❌ IMap, ICache ❌ ✅ ❌ ✅ Src ✅ Sink ❌ Remote IMap, ICache ❌ ✅ ❌ ✅ ❌ Event Journal ✅ ✅ ✅ ✅ ❌ Remote Event Journal ✅ ✅ ✅ ✅ ❌ HDFS ❌ ✅ ❌ ✅ ✅ Kafka ✅ ✅ ✅ ✅ ❌ Files ❌ ✅ ❌ ❌ ✅ File Watcher ✅ ❌ ❌ ❌ ✅ TCP Socket ✅ ❌ ❌ ❌ ❌ Application Log N/A N/A ❌ ❌ ✅
  61. 61. Infinite Stream Processing with Jet 61© 2018 Hazelcast Inc.
  62. 62. Jet Streaming Demo Flight Telemetry Processing a near real-time Flight Telemetry Stream from ADS-B Exchange - https://www.adsbexchange.com/ 62© 2018 Hazelcast Inc.
  63. 63. Jet Streaming Demo Flight Telemetry Processing a near real-time Flight Telemetry Stream from ADS-B Exchange - https://www.adsbexchange.com/ • Filter out planes outside of defined airports 63© 2018 Hazelcast Inc.
  64. 64. Jet Streaming Demo Flight Telemetry Processing a near real-time Flight Telemetry Stream from ADS-B Exchange - https://www.adsbexchange.com/ • Filter out planes outside of defined airports • Detect whether the plane is ascending, descending or staying in the same level 64© 2018 Hazelcast Inc.
  65. 65. Jet Streaming Demo Flight Telemetry Processing a near real-time Flight Telemetry Stream from ADS-B Exchange - https://www.adsbexchange.com/ • Filter out planes outside of defined airports • Detect whether the plane is ascending, descending or staying in the same level • Based on the plane type and phase of the flight calculate the maximum noise levels nearby to an airport and estimate C02 emissions for a region 65© 2018 Hazelcast Inc. https://github.com/ncherkas/hazelcast-jet-demos
  66. 66. Jet Streaming Demo Dashboard Pipeline 66© 2018 Hazelcast Inc.
  67. 67. Jet Streaming Demo Dashboard Pipeline 67© 2018 Hazelcast Inc.
  68. 68. Jet Streaming Demo Dashboard Pipeline 68© 2018 Hazelcast Inc.
  69. 69. Jet Streaming Demo Dashboard Pipeline 69© 2018 Hazelcast Inc.
  70. 70. Jet Streaming Demo Dashboard Pipeline 70© 2018 Hazelcast Inc.
  71. 71. Jet Streaming Demo Dashboard Pipeline 71© 2018 Hazelcast Inc.
  72. 72. Jet Streaming Demo Dashboard Pipeline 72© 2018 Hazelcast Inc.
  73. 73. Jet Streaming Demo Dashboard Pipeline 73© 2018 Hazelcast Inc.
  74. 74. Jet Streaming Demo Dashboard Pipeline 74© 2018 Hazelcast Inc.
  75. 75. Pipeline transformations 75© 2018 Hazelcast Inc.
  76. 76. Pipeline transformations • Time-agnostic transformations • Filter • Map • Flatmap 76© 2018 Hazelcast Inc.
  77. 77. Pipeline transformations • Time-agnostic transformations • Filter • Map • Flatmap • Aggregation and Grouping • Build-in count, different kind averages, min/max, linear trends and many more 77© 2018 Hazelcast Inc.
  78. 78. Pipeline transformations • Time-agnostic transformations • Filter • Map • Flatmap • Aggregation and Grouping • Build-in count, different kind averages, min/max, linear trends and many more • Co-Aggregation 78© 2018 Hazelcast Inc.
  79. 79. Pipeline transformations • Time-agnostic transformations • Filter • Map • Flatmap • Aggregation and Grouping • Build-in count, different kind averages, min/max, linear trends and many more • Co-Aggregation • Hash-Join 79© 2018 Hazelcast Inc.
  80. 80. Windowing 80© 2018 Hazelcast Inc.
  81. 81. Windowing 81© 2018 Hazelcast Inc.
  82. 82. Windowing 82© 2018 Hazelcast Inc.
  83. 83. Windowing 83© 2018 Hazelcast Inc.
  84. 84. Windowing Example: 30-second Window Sliding by 10 Seconds 84© 2018 Hazelcast Inc.
  85. 85. Windowing Example: 30-second Window Sliding by 10 Seconds 85© 2018 Hazelcast Inc.
  86. 86. Windowing Example: 30-second Window Sliding by 10 Seconds 86© 2018 Hazelcast Inc.
  87. 87. Windowing Example: 30-second Window Sliding by 10 Seconds 87© 2018 Hazelcast Inc.
  88. 88. Watermarks to handle Late Events Makes an educated guess that “from this point on there will be no more items with timestamp less than this” 88© 2018 Hazelcast Inc.
  89. 89. Watermarks to handle Late Events Makes an educated guess that “from this point on there will be no more items with timestamp less than this” 89© 2018 Hazelcast Inc.
  90. 90. Watermarks to handle Late Events Makes an educated guess that “from this point on there will be no more items with timestamp less than this” 90© 2018 Hazelcast Inc.
  91. 91. Watermarks in Jet Predefined Watermark Policies • With Fixed Lag • Limiting Lag and Delay • Limiting Lag and Lull • Limiting Timestamp and Wall-Clock Lag 91© 2018 Hazelcast Inc.
  92. 92. Fault Tolerance 92© 2018 Hazelcast Inc.
  93. 93. Jet Processing Fault Tolerance Cluster elects a Coordinator Member who takes care of the Job Coordination among the Cluster Members 93© 2018 Hazelcast Inc.
  94. 94. Jet Processing Fault Tolerance Jet achieves fault tolerance in streaming jobs by making a snapshot of the internal processing state 94© 2018 Hazelcast Inc.
  95. 95. Jet Processing Fault Tolerance Coordinator Member detects the other Member failure and restarts the Job using new topology 95© 2018 Hazelcast Inc.
  96. 96. Jet Processing Fault Tolerance When the Coordinator Member crashes the new one is elected by the Cluster 96© 2018 Hazelcast Inc.
  97. 97. Distributed Snapshots Technique 1st described in a paper by Chandy and Lamport in 1989 97© 2018 Hazelcast Inc.
  98. 98. Distributed Snapshots Technique 1st described in a paper by Chandy and Lamport in 1989 98© 2018 Hazelcast Inc.
  99. 99. Distributed Snapshots Technique 1st described in a paper by Chandy and Lamport in 1989 99© 2018 Hazelcast Inc.
  100. 100. Distributed Snapshots Technique 1st described in a paper by Chandy and Lamport in 1989 100© 2018 Hazelcast Inc.
  101. 101. Distributed Snapshots Technique 1st described in a paper by Chandy and Lamport in 1989 101© 2018 Hazelcast Inc.
  102. 102. Jet Processing Guarantees 102© 2018 Hazelcast Inc.
  103. 103. Jet Processing Guarantees • At-Least Once 103© 2018 Hazelcast Inc.
  104. 104. Jet Processing Guarantees • At-Least Once • Exactly Once 104© 2018 Hazelcast Inc.
  105. 105. Jet Processing Guarantees • At-Least Once • Exactly Once • At-Most Once (meaning that the Fault Tolerance is turned off) 105© 2018 Hazelcast Inc.
  106. 106. Performance 106© 2018 Hazelcast Inc.
  107. 107. Hazelcast Jet Performance Key Design Decisions 107© 2018 Hazelcast Inc.
  108. 108. Hazelcast Jet Performance Key Design Decisions • DAG to Model Computations 108© 2018 Hazelcast Inc.
  109. 109. Hazelcast Jet Performance Key Design Decisions • DAG to Model Computations • In-Memory Data Locality 109© 2018 Hazelcast Inc.
  110. 110. Hazelcast Jet Performance Key Design Decisions • DAG to Model Computations • In-Memory Data Locality • Partition Mapping Affinity 110© 2018 Hazelcast Inc.
  111. 111. Hazelcast Jet Performance Key Design Decisions • DAG to Model Computations • In-Memory Data Locality • Partition Mapping Affinity • SP/SC Queues 111© 2018 Hazelcast Inc.
  112. 112. Hazelcast Jet Performance Key Design Decisions • DAG to Model Computations • In-Memory Data Locality • Partition Mapping Affinity • SP/SC Queues • Cooperative Multithreading (Green Threads) 112© 2018 Hazelcast Inc.
  113. 113. Jet Streaming Performance 113© 2018 Hazelcast Inc. https://jet.hazelcast.org/performance/
  114. 114. Jet Throughput 114© 2018 Hazelcast Inc. https://jet.hazelcast.org/performance/
  115. 115. Running in Production 115© 2018 Hazelcast Inc.
  116. 116. Running Jet in Production • Docker images - https://github.com/hazelcast/hazelcast-jet-docker 116© 2018 Hazelcast Inc.
  117. 117. Running Jet in Production • Docker images - https://github.com/hazelcast/hazelcast-jet-docker • Cluster Management: Mesos, Yarn 117© 2018 Hazelcast Inc.
  118. 118. Running Jet in Production • Docker images - https://github.com/hazelcast/hazelcast-jet-docker • Cluster Management: Mesos, Yarn • Cluster Discovery • Cloud Providers: AWS, Windows Azure, GCP, PCF, Heroku • Kubernetes • Consul, Eureka, Zookeeper 118© 2018 Hazelcast Inc.
  119. 119. Summary Why you should consider to use the Hazelcast Jet 119© 2018 Hazelcast Inc.
  120. 120. Summary Why you should consider to use the Hazelcast Jet • High Performance | Industry Leading 120© 2018 Hazelcast Inc.
  121. 121. Summary Why you should consider to use the Hazelcast Jet • High Performance | Industry Leading • Out-of-box integration with Hazelcast IMDG | Source, Sink, Enrichment 121© 2018 Hazelcast Inc.
  122. 122. Summary Why you should consider to use the Hazelcast Jet • High Performance | Industry Leading • Out-of-box integration with Hazelcast IMDG | Source, Sink, Enrichment • Easy to start with and integrate | Zero dependencies, developer friendly 122© 2018 Hazelcast Inc.
  123. 123. Summary Why you should consider to use the Hazelcast Jet • High Performance | Industry Leading • Out-of-box integration with Hazelcast IMDG | Source, Sink, Enrichment • Easy to start with and integrate | Zero dependencies, developer friendly • Simple to deploy | Embedded 10MB jar or Client-Server 123© 2018 Hazelcast Inc.
  124. 124. Summary Why you should consider to use the Hazelcast Jet • High Performance | Industry Leading • Out-of-box integration with Hazelcast IMDG | Source, Sink, Enrichment • Easy to start with and integrate | Zero dependencies, developer friendly • Simple to deploy | Embedded 10MB jar or Client-Server • Works in every Cloud | Same as Hazelcast IMDG 124© 2018 Hazelcast Inc.
  125. 125. Summary Why you should consider to use the Hazelcast Jet • High Performance | Industry Leading • Out-of-box integration with Hazelcast IMDG | Source, Sink, Enrichment • Easy to start with and integrate | Zero dependencies, developer friendly • Simple to deploy | Embedded 10MB jar or Client-Server • Works in every Cloud | Same as Hazelcast IMDG • For Developers by Developers | Code it 125© 2018 Hazelcast Inc.
  126. 126. Questions? Version 0.6 is the current release with 0.7 coming Q3 2018 aiming for 1.0 this year http://jet.hazelcast.org https://groups.google.com/forum/#!forum/hazelcast-jet https://gitter.im/hazelcast/hazelcast 126© 2018 Hazelcast Inc.

×