General idea across the slide deck:
The value of Jet is in simplicity and integration with IMDG
Big data isn’t enough, fast big data is what matters (information available in a near real-time fashion)
Jet grabs the information from the data in near real-time and stores it to IMDG staying in-memory and distributed
At a business level, Hazelcast Jet is a distributed computing platform for fast processing of big data streaming sets. Jet is based on a parallel, core engine allowing data-intensive applications to operate at near real-time speeds.
Jet vision = fast big data, real-time stream processor + fast batch processor
How the ecosystem and looks like.
Jet is part of the data-processing pipeline
Two major deployment setups - stream and batch.
Batch
— A complete dataset is available before the processing starts. The data are submitted as one batch to be processed.
Usually in Database, IMDG, on filesystem
Data are loaded from (potentially multiple) sources, joined, transformed, cleaned. Then stored and/or left in IMDG for further in-memory usage.
- Typically used for ETL (Extract-transfer-load) tasks.
Stream
- data potentially unbounded and infinite
the goal is to process the data as soon as possible. Process = extract important information from the data. The whole processing stays in memory, therefore another systems may be alerted in near real-time.
Data ingested to the system via message broker (Kafka)
The streaming deployment is useful for use-cases, where real-time behaviour matters.
Streaming features
Streaming core of Jet (low latency, per-event processing).
Windowing and late-events processing
Hazelcast can be used in two different modes:
Embedded Mode is very easy to setup and run. It can only be done with peer Java applications. Developers really like this mode during prototyping phases.
Client-Server Mode is more work but efficient once an application is required to scale to larger volume of transactions or users as it separates the application from Hazelcast IMDG for better manageability. It can be done with different types of applications (C+, C#/.NET, Python, Scala, Node.js) using native clients provided by Hazelcast (C+, C#/.NET, Python, Scala, Node.js). Hazelcast Open Binary Protocol enables clients for any other language to be developed as needed (see Hazelcast IMDG Docs).
With both client-server and embedded architecture, Hazelcast IMDG offers:
+ Elasticity and scalability
+ Transparent integration with backend databases using Map Store interfaces
+ Management of the cluster through your web browser with Management Center
Jet is build on top of Hazelcast IMDG with the benefits of close integration. Hazelcast Jet and IMDG can run in these deployments:
Every Jet cluster works on top of embedded Hazelcast in-memory data grid. Embedded deployment suits well for:
Sharing processing state among Jet nodes
Caching intermediate processing results
Enriching processed events. Cache remote data (e.g. fact tables from database) on Jet nodes.
Running advanced data processing tasks on top of Hazelcast data structures
Development purposes. Since starting standalone Jet cluster is simple and fast.
Remote deployment
Share state or intermediate results among more Jet jobs
Cache intermediate results (e.g. to show real-time processing stats on dashboard)
Load input or store results of Jet computation in in-memory fashion to IMDG
Isolation of processing and storage layer
Jet x IMDG – product differentiation
Both Jet and IMDG are for parallel and distributed computing, why Jet is more expensive? Is Hazelcast IMDG not good enough?
The streaming benchmark is based on a simple stock exchange aggregation. Each message representing a trade is published to Kafka and then a simple windowed aggregation which calculates the number of traders per ticker per window is done using various data processing frameworks.
Compared to Spark, Flink doing word count, data on HDFS/IMap, cluster of 10 nodes, 32 cores each…
Here one thread is doing tokenizing and one is doing accumulation. Each can operate at its own speed using the concurrent queue to buffer work.
Not the same as Fork-Join in JDK as it uses a single thread for all processing steps. parallelStream allocates a Fork-Join task per data partition but that’s all.
Each Reducer is a different instance. It can be across the network or local depending on how you create your DAG model. In our j.u.s implementation reducers are always local and there is a network step for the combiner.
All the nodes need to agree on who owns what partition. The data is processed locally and then sent to the partition owner as part of the combiner step. Jet uses the same partition scheme as Hazelcast. So if your Combiner step is a collect to a Hazelcast iMap the final output will be local. So there is only one network hop in the entire DAG processing. Writing to HDFS is local but then is replicated out.
It‘s to be replaced by high-level What API in Jet 0.5 (opposed to „How“ core API)