Jet is an open-source, distributed data processing engine that can process millions of events per second with low tail latency from streaming data sources like Apache Pulsar and Kafka. In this talk, we'll go through key stream processing concepts, learn about the internals of a distributed stream processing engine, and discuss the integration between Apache Pulsar and Jet.
4. Why Distributed Stream Processing?
● A data source that never stops (events)
● Use the data instantly and get results
● Do arbitrary complex processing of events
● Don’t want to store all the data permanently
● The volume of data may be larger than that can be
processed in a timely fashion by downstream systems
9. Oh dear! Oh dear! I shall be too late!
Processing Time
Event Time
Lagging Events
10. Down the Rabbit Hole, or Watermarks
Processing Time
Event Time
08:1408:1308:1208:11
08:1408:13 08:1208:11
wm=08:11 wm=08:12 wm=08:12
Lag = 0:03
Max Lag := 0:02
wm=08:12
Late!
20. What is Jet?
● Distributed dataflow engine with distributed in-memory
storage
● Event-time based processing
● At-least-once and exactly-once processing guarantees
● Predictable latencies under load
● Single Java Binary
● https://github.com/hazelcast/hazelcast-jet
22. Word Count - Pipeline to DAG
Source
FlatMap +
Filter
Accumulate Combine Sink
partitioned
distributed
partitioned
Source Flat Map Filter
Group +
Aggregate
SinkPipeline
DAG
23. Cooperative Multithreading
● All execution is done through tasklets, such as network IO,
processors and snapshots.
● Similar concept to green threads
● Tasklets run in a loop serviced by the same native thread.
○ Each tasklet does small amount of work at a time
(<1ms)
25. Benefits
● Each native thread can handle thousands of cooperative
tasklets
● No context switching
● Almost guaranteed core affinity - better cache utilization
● High Performance for the win!
29. Benchmarks
● We tested the limits of Java performance with modern GC:
G1, Shenandoah, ZGC
● Self-contained benchmark with Sliding Window
Aggregation
● Events generated and consumed by Jet
● https://jet-start.sh/blog/2020/06/23/jdk-gc-benchmarks-rem
atch
32. Benchmarks
● 6,000,000 events per second
○ one event every 167 nanoseconds
● Kafka Topic with 24 partitions, replication factor = 3
● 20 second window with 20ms resolution
● Measuring end-to-end latency
○ from time of event to time of output
34. Latency includes:
● Time passing from the end of the window to the first event
beyond it
● Event simulator getting ready to send the event to Kafka
● Event traveling to Kafka (1st network hop)
● Event traveling from Kafka to Jet (2nd network hop)
● Event traveling through Jet's pipeline (3rd network hop due
to partitioning)
36. Pulsar Integration
● Use Pulsar as a data source or sink
● Uses either Consumer API or Reader API
● Consumer API -> no fault-tolerance
● Reader API -> supports fault-tolerance, but only partially
implemented, no partitioning support
● Looking for contributors!
38. Jet Roadmap
● Full SQL support
● Managed Service
● More connectors!
● Extended computational capabilities (deploying functions in
different languages)