The document provides information on Hazelcast Jet, a distributed computing platform for fast processing of big data sets. Hazelcast Jet is based on a parallel core engine that allows data-intensive applications to operate at near real-time speeds. It offers high performance, flexibility, and ease of use for both batch and stream processing.
General idea across the slide deck:
The value of Jet is in simplicity and integration with IMDG
Big data isn’t enough, fast big data is what matters (information available in a near real-time fashion)
Jet grabs the information from the data in near real-time and stores it to IMDG staying in-memory and distributed
Hazelcast Jet is a distributed computing platform for fast processing of big data sets. Jet is based on a parallel, streaming core engine allowing data-intensive applications to operate at near real-time speeds.
A general purpose distributed data processing engine built on Hazelcast and based on directed acyclic graphs to model data flow.
IMDG as storage, Jet as processing engine (querying, transforming) -> those tasks go oftehn hand in hand. This is motivation for Jet.
Define market demand for in-memory data processing (Motivation for Jet):
1/ Big and Fast Data
2/ Near real-time processing
3/ IMDG based
— Jet comes with IMDG
— Jet takes care about processing, IMDG about storage (in-mem speed, distributed).
—> operational simplicity, products integrated
Performance – For both stream (unbounded data) and batch (bounded data) processing, Jet is able to process data with an impressive throughput capacity, with very low latency even under increasing load
Integration with IMDG – IMDG provides scalable in-memory data storage to be used during processing (as source, sink, operational storage for cached/temporary data). Hazelcast IMDG and Jet are engineered to work together for high performance.
Simplicity – Jet is simple to deploy as it shares the lower stack with Hazelcast IMDG. Jet is application-embeddable for OEMs and microservices, making it is easier for manufacturers to build and maintain next generation systems.
Jet vision = fast big data, real-time stream processor + fast batch processor
How the ecosystem and looks like.
Jet is part of the data-processing pipeline
Two major deployment setups - stream and batch.
Batch
— A complete dataset is available before the processing starts. The data are submitted as one batch to be processed.
Usually in Database, IMDG, on filesystem
Data are loaded from (potentially multiple) sources, joined, transformed, cleaned. Then stored and/or left in IMDG for further in-memory usage.
- Typically used for ETL (Extract-transfer-load) tasks.
Stream
- data potentially unbounded and infinite
the goal is to process the data as soon as possible. Process = extract important information from the data. The whole processing stays in memory, therefore another systems may be alerted in near real-time.
Data ingested to the system via message broker (Kafka)
The streaming deployment is useful for use-cases, where real-time behaviour matters.
Use-cases enabled by Jet
Based on use-cases from web https://hazelcast.atlassian.net/wiki/display/JET/Reference+Use-Cases+and+Target+Audience
- Performs at lowest latency.
Important for real-time use cases (streaming) – after item is ingested, how fast do I get the results?
Latency of Jet remains low at scale, opposed to competitors. Thanks to windowing design.
Compared to Spark, Flink doing word count, data on HDFS/IMap, cluster of 10 nodes, 32 cores each…
Hazelcast can be used in two different modes:
Embedded Mode is very easy to setup and run. It can only be done with peer Java applications. Developers really like this mode during prototyping phases.
Client-Server Mode is more work but efficient once an application is required to scale to larger volume of transactions or users as it separates the application from Hazelcast IMDG for better manageability. It can be done with different types of applications (C+, C#/.NET, Python, Scala, Node.js) using native clients provided by Hazelcast (C+, C#/.NET, Python, Scala, Node.js). Hazelcast Open Binary Protocol enables clients for any other language to be developed as needed (see Hazelcast IMDG Docs).
With both client-server and embedded architecture, Hazelcast IMDG offers:
+ Elasticity and scalability
+ Transparent integration with backend databases using Map Store interfaces
+ Management of the cluster through your web browser with Management Center
Jet is build on top of Hazelcast IMDG with the benefits of close integration. Hazelcast Jet and IMDG can run in these deployments:
Every Jet cluster works on top of embedded Hazelcast in-memory data grid. Embedded deployment suits well for:
Sharing processing state among Jet nodes
Caching intermediate processing results
Enriching processed events. Cache remote data (e.g. fact tables from database) on Jet nodes.
Running advanced data processing tasks on top of Hazelcast data structures
Development purposes. Since starting standalone Jet cluster is simple and fast.
Remote deployment
Share state or intermediate results among more Jet jobs
Cache intermediate results (e.g. to show real-time processing stats on dashboard)
Load input or store results of Jet computation in in-memory fashion to IMDG
Isolation of processing and storage layer
Jet x IMDG – product differentiation
Both Jet and IMDG are for parallel and distributed computing, why Jet is more expensive? Is Hazelcast IMDG not good enough?
Jet vision = fast big data, real-time stream processor + fast batch processor
How the ecosystem and looks like.
Jet is part of the data-processing pipeline
Two major deployment setups - stream and batch.
Batch
— A complete dataset is available before the processing starts. The data are submitted as one batch to be processed.
Usually in Database, IMDG, on filesystem
Data are loaded from (potentially multiple) sources, joined, transformed, cleaned. Then stored and/or left in IMDG for further in-memory usage.
- Typically used for ETL (Extract-transfer-load) tasks.
Stream
- data potentially unbounded and infinite
the goal is to process the data as soon as possible. Process = extract important information from the data. The whole processing stays in memory, therefore another systems may be alerted in near real-time.
Data ingested to the system via message broker (Kafka)
The streaming deployment is useful for use-cases, where real-time behaviour matters.
CUSTOMER SUCCESS
Leading oil & gas system integrator, specializing in the acquisition, persistence, secure transportation and dissemination of high frequency sensor data.
10’s of 1000’s of events per second are gathered from oil rigs and various points in this process.
CHALLENGE
Industries such as Oil & Gas use sensor data to automate decision making; this requires:
Keeping latency low and consistent at scale while system operates turning insights into decisions that are executed in a closed loop system.
Simplifying deployment and maintenance of system in remote locations with limited bandwidth.
Sticking to standards that extend system viability and maintenance lowering TCO; there are many nascent protocols and standards in the emerging IoT industry.
Application high-availability and performance are required for early detection of issues to avoid costly production losses and optimize the productivity of wells.
As the link between the oil rig site and the data center is usually limited, some processing has to be done in the field. This involves operating a processing infrastructure in the limited confines of the field locality.
SOLUTION
Hazelcast Jet is used as the processing backbone. The application makes use of Jet API to connect multiple sensor data streams with varying format and frequency. After joined and filtered, sliding and session windows are used to compute data insights that are turned into decisions.
Hazelcast IMDG, which is embedded in Jet, is the operational data store, making the deployment easily scaled. This same Jet-based application is used in bare metal field installations as well as in the AWS cloud.
The application gathers data from various devices and sensors throughout the oil exploration, drilling, and lifting process. The events are correlated and analyzed in real-time allowing actions to be taken for mitigating risks or maximizing opportunities. Events reported from oil rig sensors are evaluated in the field by Jet so that the settings of the rig are adjusted in real time, i.e., sensor data on torque and depth of the drill are combined in Jet to indicate bottom of the oil reservoir.
WHY HAZELCAST JET
Performance and scalability
In-memory data store with parallel processing enables scalable real-time analytics
Open source standards-based avoids vendor lock-in
CUSTOMER SUCCESS
A global information technology solutions company.
processing of 10’s of 1,000’s of payments per second today and has built-in scalability with Hazelcast IMDG to support future business growth.
CHALLENGE
Payment processing systems check the details received from a merchant by forwarding them to the respective card’s issuing bank or card association for verification, and also carry out a series of anti-fraud measures before settling the transaction.
All of these payment process steps must happen before the transaction can be cleared. Each step involved in the payment processing pipeline requires the lowest possible latency in order to deliver a positive customer experience.
With 24/7 global operations and hard SLAs, resiliency and automatic recovery are a must have.
SOLUTION
Jet acts as the processing pipeline for each step of the payment process within the payment processing application.
The payment management application acts as an orchestrator which analyses XML payment instructions and forwards them to the respective card’s issuing bank or card association for verification, and also carries out a series of anti-fraud measures against the transaction before settling the payment transaction.
Multiple Jet processing jobs are involved as pipeline components. The distributed IMaps of Hazelcast IMDG are used for transaction ingestion and messaging.
WHY HAZELCAST JET
High-performance connectors between Jet and IMDG enable low-latency operations; consistent low latency of the Hazelcast platform allows the CGI payment management application to keep within strictest SLA requirements.
Automatic recovery of the Jet cluster is an important feature to achieve high-availability even during failures.
Open source standards-based avoids vendor lock-in.
n addition to getting regular open source patch releases, Hazelcast support subscriptions provide customers with Hot Fix patches. Hot Fix patches are a special release of Hazelcast software dedicated to addressing your specific reported issue. They are made to the exact version of Hazelcast that you are running with only the specific bug fix applied.
We then perform Hazelcast Production Assurance with a long production simulation using Hazelcast Simulator with a hardware, operating system and Hazelcast configuration as equivalent as possible to your own using your Hazelcast Simulator Profile. Hot Fix Patches can thus be applied to your production system with no or minimal testing, relying on our testing. We also go red before we go green – reproducing your issue on Simulator first before creating a fix. This way you and we have great confidence the issue is resolved by the Hot Fix.
Every Hazelcast customer can rest assured knowing all the license obligations, conflicts and risks of Hazelcast open source with Black Duck® Protex™ reporting. Poor open source compliance can expose you to costly, time-consuming risks, including litigation and loss of IP. Black Duck Protex is the industry’s leading solution for managing open source license compliance. Protex integrates with existing development tools to automatically scan, identify, and inventory open source software, allowing you to understand license obligations, conflicts and risks. This enables you to mitigate these risks by enforcing license compliance and corporate policy requirements.
Robust stream processing:
Configurable At-least once or exactly-once
OOTB experience – stored in memory (in embedded IMDG, so backed up in the same way as Hazelcast IMDG is).
When node fails -> job restarted from last snapshot on the remaining nodes
Outlook: Elasticity (when the node is added, extend computation to it),
Usability enhancements
- High-level API for Batch (0.5) and stream (0.6)