SlideShare a Scribd company logo
INTRODUCING
1
Rahul Gupta
@wildnez
2
© 2017 Hazelcast Inc. Confidential & Proprietary
What is Jet?
A general purpose distributed data processing engine
built on Hazelcast and based on directed acyclic graphs
(DAGs) to model data flow.
DISTRIBUTED COMPUTING.SIMPLIFIED.
© 2017 Hazelcast Inc. Confidential & Proprietary
Why?
• Distributed java.util.stream support for Hazelcast
• Extend computational capabilities in Hazelcast
(IExecutorService, EntryProcessor, JobTracker, …)
• Hazelcast provides a nice foundation for distributed data processing so it is
relatively easy to do this.
• Unifying IMDG and advanced data processing capabilities
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet Overview
4
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet – Fast Big Data
5
Stream and Fast In-Memory Batch Processing
Enrichment
Databases
IoT
Social
Networks
Enterprise
Applications
Databases/
Hazelcast IMDG
HDFS/
Spark
Stream
Stream
Stream
Batch
Batch
Ingest
Alerts
Enterprise
Applications
Interactive
Analytics
Databases/
Hazelcast IMDG
Output
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet Architecture
6
Streaming Readers and Writers
(Kafka, File, Socket)
Stream Processing
(Tumbling, Sliding and Session Windows)
java.util.stream
Batch Readers and Writers
(Hazelcast IMDG IMap, ICache & IList, HDFS, File)
Networking
(IPv4, IPv6)
Deployment
(On Premise, Embedded, AWS, Azure, Docker, Pivotal Cloud Foundry)
Hazelcast IMDG Data Structures
(Distributed Map, Cache, List)
Partition Management
(Members, Lite Members, Master Partition, Replicas, Migrations, Partition Groups, Partition Aware)
Job Management
(Jet Job Lifecycle Management, Resource Distribution and Deployment)
Execution Engine
(Distributed DAG Execution, Processors, Back Pressure Handling, Data Distribution)
Cluster Management with Cloud Discovery SPI
(Apache jclouds, AWS, Azure, Consul, etcd, Heroku, IP List, Kubernetes, Multicast, Zookeeper)
Java Client
Jet Core API
Hazelcast Jet Core
Connectors
High-Level API
Processing
Batch Readers and Writers
(Hazelcast IMDG IMap, ICache & IList, HDFS, File)
Batch Processing
(Aggregations, Join)
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet Key Competitive Differentiators
7
• Works great with Hazelcast IMDG | Source, Sink, Enrichment
• Very simple to program | Leverages existing standards
• Very simple to deploy | Same lower stack as Hazelcast
• Works in every Cloud | Same as Hazelcast IMDG
• High Performance | Fastest word count in town!
• For Developers by Developers | Code it
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Application Deployment Options
IMDG IMDG
• No separate process to manage
• Great for microservices
• Great for OEM
• Simplest for Ops – nothing extra
Embedded
Application
Java API
Application
Java API
Application
Java API
Java API
Application
Client-Server
Jet
Jet Jet
Jet
Jet Jet
Java API
Application
Java API
Application
Java API
Application
• Separate Jet Cluster
• Scale Jet independent of Applications
• Isolate Jet from Application Server Lifecycle
• Managed by Ops
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet with Hazelcast Deployment Choices
Jet Compute
Cluster
Hazelcast
IMDG Cluster
Sink Enrichment
Message
Broker
(Kafka)
Data
Enrichment
HDFS
Jet Cluster
Sink
Source /
Enrichment
Good when:
• Where source and sink are primarily Hazelcast
• Jet and Hazelcast have equivalent sizing needs
Good when:
• Where source and sink are primarily Hazelcast
• Where you want isolation of the Jet cluster
© 2017 Hazelcast Inc. Confidential & Proprietary
Performance - Latency
10
© 2017 Hazelcast Inc. Confidential & Proprietary
Performance - Throughput
© 2017 Hazelcast Inc. Confidential & Proprietary
Performance - ForkJoin
• Jet is faster than JDK’s j.u.s implementation. The issue is in the JDK the
character stream reports "unknown size" and thus it cannot be parallelised.
• If you first load the data into a List in RAM then run the JDK’s j.u.s it comes out
a little faster
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet is a DAG Processing Engine
13
© 2017 Hazelcast Inc. Confidential & Proprietary
DAG Processing
• Directed Acyclic Graphs are used to model computations
• Each vertex is a step in the computation
• It is a generalisation of the MapReduce paradigm
• Supports both batch and stream processing
• Other systems that use DAGs: Apache Tez, Flink, Spark…
© 2017 Hazelcast Inc. Confidential & Proprietary
Example: Word Count
If we lived in a single-threaded world:
1. Iterate through all the lines
2. Split the line into words
3. Update running total of counts with each word
final String text = "...";
final Pattern pattern = Pattern.compile("s+");
final Map<String, Long> counts = new HashMap<>();
for (String word : pattern.split(text)) {
counts.compute(word, (w, c) -> c == null ? 1L : c + 1);
}
© 2017 Hazelcast Inc. Confidential & Proprietary
Input Output
Still single-threaded execution:
each Vertex is executed in turn
Tokenizer Reducer
Split the text into words
For each word emit (word)
Collect running totals
Once everything is finished,
emit all pairs of (word, count)
(text) (word) (word, count)
We can represent the computation as a DAG
© 2017 Hazelcast Inc. Confidential & Proprietary
Input
(text) (word)
Output
(word, count)
Tokenizer Reducer
Split the text into words
For each word emit (word)
Collect running totals.
Once everything is finished,
emit all pairs of (word, count)
We can parallelise the execution of the vertices by
introducing concurrent queues between the vertices.
© 2017 Hazelcast Inc. Confidential & Proprietary
Output
(word, count)
Reducer
By dividing the text into lines, and having multiple threads,
each tokenizer can process a separate line thus
parallelizing the work.
Input
Tokenizer
Tokenizer
© 2017 Hazelcast Inc. Confidential & Proprietary
(word)
(word)
Input Output
Tokenizer
We only need to ensure the same words
go to the same Reducer.
The Reducer can also be executed in parallel by
partitioning by the individual words.
Tokenizer
Reducer
Reducer
© 2017 Hazelcast Inc. Confidential & Proprietary
Node
Node
The steps can also be distributed across multiple nodes.
To do this you need a distributed partitioning scheme.
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Combiner
Combiner
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Combiner
Combiner
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet Overview
21
© 2017 Hazelcast Inc. Confidential & Proprietary
jet-core
• Provides low-latency and high-throughput distributed
DAG execution
• Hazelcast provides partitioning, discovery, networking
and serialization and elasticity
• Jobs are submitted to the cluster and run in an isolated class
loader with spans the lifetime of the job. Many jobs can be
run in the cluster at once. Each vertex in the graph is
represented by a processor which does not need to be
thread safe
• Vertices are connected by edges.
• Processors are executed by Tasklets, which are allocated to
threads.
© 2017 Hazelcast Inc. Confidential & Proprietary
Processors
• Main work horse of a Jet application – each vertex must have a Processor
• Just Java code
• It typically takes some input, and emits some output
• Also can act as a source or a sink
• Convenience processors for: flatMap, groupAndAccumulate, filter
Vertex tokenize = dag.newVertex("tokenize",
flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase()))
.filter(word -> !word.isEmpty()))
);
Vertex reduce = dag.newVertex("reduce",
groupAndAccumulate(initialZero, (count, x) -> count + 1)
);
Vertex combine = dag.newVertex("combine",
groupAndAccumulate(Entry<String, Long>::getKey, initialZero,
(Long count, Entry<String, Long> wordAndCount) -> count + wordAndCount.getValue())
);
© 2017 Hazelcast Inc. Confidential & Proprietary
DAG Execution
• Each node in the cluster runs the whole graph
• Each vertex is executed by a number of tasklets
• Bounded number of execution threads (typically system
processor count)
• Work-stealing between execution threads
• Back pressure applies between vertices
© 2017 Hazelcast Inc. Confidential & Proprietary
Cooperative Multithreading
• Similar to green threads
• Tasklets run in a loop serviced by the same native thread.
- No context switching.
- Almost guaranteed core affinity
• Each tasklet does small amount of work at a time (<1ms)
• Cooperative tasklets must be non-blocking.
• Each thread can handle thousands of cooperative tasklets
• If there isn’t any work for a thread, it backs off
© 2017 Hazelcast Inc. Confidential & Proprietary
Cooperative Multithreading
• Edges are implemented by lock-free single producer, single
consumer queues
- It employs wait-free algorithms on both sides and avoids
volatile writes by using lazySet.
• Load balancing via back pressure
• Tasklets can also be non-cooperative, in which case they have a
dedicated thread and may perform blocking operations.
© 2017 Hazelcast Inc. Confidential & Proprietary
Producer
Tasklet 3
Producer
Tasklet 4
Producer
Tasklet 4
Producer
Tasklet 2
Consumer
Tasklet 2
Consumer
Tasklet 1
Producer
Tasklet 4
Producer
Tasklet 2
Producer
Tasklet 1
Consumer
Tasklet 2
Consumer
Tasklet 1
Producer
Tasklet 1
Consumer
Vertex
Producer
Vertex
Edge
Parallelism=4
Output of Producer Vertex is
connected to input of
Consumer Vertex by
configuration of the Edge.
Thread 1 Thread 2
Execution Service
Tasklets – Unit of Execution
Parallelism=2
© 2017 Hazelcast Inc. Confidential & Proprietary
Edges
• Different types of edges:
- Variable Unicast – pick any downstream processor
- Broadcast – emit to all downstream processors
- Partitioned – pick based on a key
• Vertices can have more than one input: allows joins and co-
group
• Vertices can have more than one output: splits
• Edges can be local or distributed
© 2017 Hazelcast Inc. Confidential & Proprietary
Data input and output
• Reader and writers for Hazelcast IMap, ILap and IList
• HDFS
• Kafka
• Socket (text encoding)
• File
• Easy to implement your own – just implement the Processor
interface. If a source, implement complete(). If a sink
implement process();
© 2017 Hazelcast Inc. Confidential & Proprietary
APIs
© 2017 Hazelcast Inc. Confidential & Proprietary
Processor API
• Unified API for sinks, sources and intermediate steps
• Each Processor has an Inbox and Outbox per inbound and outbound
edge.
• Two main methods to implement:
• boolean tryProcess(int ordinal, Object item)
• Process incoming item and emit new items by populating the outbox
• boolean complete()
• Called after all upstream processors are also completed. Typically used for sources and
batch operations such as group by and distinct.
• Slight differences in API for cooperative vs non-cooperative processors
• Non-cooperative processors may block indefinitely
• Cooperative processors must respect Outbox when emitting and yield if Outbox is
already full.
© 2017 Hazelcast Inc. Confidential & Proprietary
Core API – Powerful, Low Level API
DAG describes how vertices are connected to each other:
DAG dag = new DAG();
Vertex source = dag.newVertex("source", readMap(DOCID_NAME));
Vertex docLines = dag.newVertex("doc-lines", DocLinesP::new);
Vertex tokenize = dag.newVertex("tokenize",
flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase()))
.filter(word -> !word.isEmpty())));
Vertex reduce = dag.newVertex("reduce",
groupAndAccumulate(initialZero, (count, x) -> count + 1));
Vertex combine = dag.newVertex("combine",
groupAndAccumulate(Entry<String, Long>::getKey, initialZero,
(Long count, Entry<String, Long> wordAndCount) ->
count + wordAndCount.getValue()));
Vertex sink = dag.newVertex("sink", writeMap("counts"));
dag.edge(between(source.localParallelism(1), docLines))
.edge(between(docLines.localParallelism(1), tokenize))
.edge(between(tokenize, reduce).partitioned(wholeItem(), HASH_CODE))
.edge(between(reduce, combine).distributed().partitioned(entryKey()))
.edge(between(combine, sink));
jet.newJob(dag).execute();
© 2017 Hazelcast Inc. Confidential & Proprietary
Distributed java.util.stream API
• “Classes to support functional-style operations on streams of
elements, such as map-reduce transformations on collections.”
• Jet adds distributed support for the java.util.stream API for
Hazelcast Map and List.
• Supports all j.u.s. operations such as:
-map(), flatMap(), filter()
-reduce(), collect()
-sorted(), distinct()
• Lambda serialization is a problem, but is worked around by creating
serializable versions of the interfaces
• j.u.s Streams are converted to Processor API (DAG) for execution
© 2017 Hazelcast Inc. Confidential & Proprietary
j.u.s API
JetInstance jet = Jet.newJetInstance();
Jet.newJetInstance();
IStreamMap<Long, String> lines = jet.getMap("lines");
Map<String, Long> counts = lines
.stream()
.flatMap(m -> Stream.of(PATTERN.split(m.getValue().toLowerCase())))
.filter(w -> !w.isEmpty())
.collect(DistributedCollectors.toIMap(w -> w, w -> 1L, (left, right) -> left + right));
© 2017 Hazelcast Inc. Confidential & Proprietary
API Comparison
• Processor API is low-level, powerful and can be used to
solve any problem that can be expressed as a DAG
• java.util.stream is familiar, but only limited functionality.
• All data processing frameworks have their own API
• Some attempts at unification are not major successes:
Cascading, Apache Beam …
• Cascading prototype for Jet
© 2017 Hazelcast Inc. Confidential & Proprietary
Demo
© 2017 Hazelcast Inc. Confidential & Proprietary
Roadmap
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet 0.4 (06/17)
38
Features Description
Distributed Computation
Distributed data processing framework. It’s based on directed acyclic graphs (DAGs) and in-memory
computation.
Distributed java.util.stream Java 8 stream API can be used to describe the Jet Job. The execution is distributed.
Windowing with Event-time
Improved streaming support including windowing support with event-time semantics. Out of the box
support for tumbling, sliding and session window aggregations.
Distributed Data Structures
Distributed implementations of Map, Cache and List data structures highly optimized to be used for the
processing.
Pivotal Cloud Foundry Tile Hazelcast Jet is available in Pivotal Cloud Foundry
Discovery and Networking Hazelcast discovery for finding members in the cluster or in the Cloud.
Hazelcast IMDG Connector
Allows Jet to integrate with both embedded and non-embedded Hazelcast instances to be used for
reading and writing data for distributed computation.
File and Socket Connector Ability to read and write unbounded data from text sockets and files.
Kafka Connector Reads and writes data streams to the Kafka message broker.
HDFS Connector Reads and writes data streams to Hadoop Distributed File System
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast 2017 Jet Roadmap
39
Features Description
Robust Stream Processing Guarantees for stateful stream processing
Usability Enhancements High-Level API | Tooling
High Performance
Hazelcast Integrations
Streaming Map and Cache events | Projection and Predicate support for
Map source
Management Center Management and monitoring features for Jet.
More Connectors JMS | JDBC
© 2017 Hazelcast Inc. Confidential & Proprietary
Questions?
Version 0.4 is the current release with 0.5 coming in September
http://jet.hazelcast.org
Minimum JDK 8

More Related Content

What's hot

Building scalable applications with hazelcast
Building scalable applications with hazelcastBuilding scalable applications with hazelcast
Building scalable applications with hazelcast
Fuad Malikov
 
Hazelcast For Beginners (Paris JUG-1)
Hazelcast For Beginners (Paris JUG-1)Hazelcast For Beginners (Paris JUG-1)
Hazelcast For Beginners (Paris JUG-1)
Emrah Kocaman
 
Hazelcast 101
Hazelcast 101Hazelcast 101
Hazelcast 101
Emrah Kocaman
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Hazelcast
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
DataWorks Summit
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using Hazelcast
Taras Matyashovsky
 
20151027 sahara + manila final
20151027 sahara + manila final20151027 sahara + manila final
20151027 sahara + manila final
Wei Ting Chen
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
DataWorks Summit
 
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and SpringSpring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
Emrah Kocaman
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
DataWorks Summit
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
Peter Clapham
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
DataWorks Summit
 
Red Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and FutureRed Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and Future
Red_Hat_Storage
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
InMobi Technology
 
Hazelcast Introduction
Hazelcast IntroductionHazelcast Introduction
Hazelcast Introduction
CodeOps Technologies LLP
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
DataWorks Summit
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
Yahoo Developer Network
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
DataStax
 

What's hot (20)

Building scalable applications with hazelcast
Building scalable applications with hazelcastBuilding scalable applications with hazelcast
Building scalable applications with hazelcast
 
Hazelcast For Beginners (Paris JUG-1)
Hazelcast For Beginners (Paris JUG-1)Hazelcast For Beginners (Paris JUG-1)
Hazelcast For Beginners (Paris JUG-1)
 
Hazelcast 101
Hazelcast 101Hazelcast 101
Hazelcast 101
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using Hazelcast
 
20151027 sahara + manila final
20151027 sahara + manila final20151027 sahara + manila final
20151027 sahara + manila final
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
 
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and SpringSpring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
Red Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and FutureRed Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and Future
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Hazelcast Introduction
Hazelcast IntroductionHazelcast Introduction
Hazelcast Introduction
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 

Similar to Hazelcast Jet v0.4 - August 9, 2017

Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
Data Con LA
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
Vladimír Schreiner
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
Hortonworks
 
Coherence RoadMap 2018
Coherence RoadMap 2018Coherence RoadMap 2018
Coherence RoadMap 2018
harvraja
 
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Amazon Web Services
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
Architecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptxArchitecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptx
shaikshazil1
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
hitesh1892
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Docker vs. Kubernetes vs. Serverless
Docker vs. Kubernetes vs. ServerlessDocker vs. Kubernetes vs. Serverless
Docker vs. Kubernetes vs. Serverless
LogicworksNY
 
Apache ignite v1.3
Apache ignite v1.3Apache ignite v1.3
Apache ignite v1.3
Klearchos Klearchou
 
Streaming solutions for real time problems
Streaming solutions for real time problems Streaming solutions for real time problems
Streaming solutions for real time problems
Aparna Gaonkar
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
Matthew Rocklin
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
InfluxData
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
 
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam DillardInfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxData
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
MLconf
 
Mysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High PerformanceMysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High Performance
Bernd Ocklin
 

Similar to Hazelcast Jet v0.4 - August 9, 2017 (20)

Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Coherence RoadMap 2018
Coherence RoadMap 2018Coherence RoadMap 2018
Coherence RoadMap 2018
 
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Architecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptxArchitecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptx
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Docker vs. Kubernetes vs. Serverless
Docker vs. Kubernetes vs. ServerlessDocker vs. Kubernetes vs. Serverless
Docker vs. Kubernetes vs. Serverless
 
Apache ignite v1.3
Apache ignite v1.3Apache ignite v1.3
Apache ignite v1.3
 
Streaming solutions for real time problems
Streaming solutions for real time problems Streaming solutions for real time problems
Streaming solutions for real time problems
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam DillardInfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Mysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High PerformanceMysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High Performance
 

Recently uploaded

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 

Recently uploaded (20)

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 

Hazelcast Jet v0.4 - August 9, 2017

  • 2. © 2017 Hazelcast Inc. Confidential & Proprietary What is Jet? A general purpose distributed data processing engine built on Hazelcast and based on directed acyclic graphs (DAGs) to model data flow. DISTRIBUTED COMPUTING.SIMPLIFIED.
  • 3. © 2017 Hazelcast Inc. Confidential & Proprietary Why? • Distributed java.util.stream support for Hazelcast • Extend computational capabilities in Hazelcast (IExecutorService, EntryProcessor, JobTracker, …) • Hazelcast provides a nice foundation for distributed data processing so it is relatively easy to do this. • Unifying IMDG and advanced data processing capabilities
  • 4. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet Overview 4
  • 5. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet – Fast Big Data 5 Stream and Fast In-Memory Batch Processing Enrichment Databases IoT Social Networks Enterprise Applications Databases/ Hazelcast IMDG HDFS/ Spark Stream Stream Stream Batch Batch Ingest Alerts Enterprise Applications Interactive Analytics Databases/ Hazelcast IMDG Output
  • 6. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet Architecture 6 Streaming Readers and Writers (Kafka, File, Socket) Stream Processing (Tumbling, Sliding and Session Windows) java.util.stream Batch Readers and Writers (Hazelcast IMDG IMap, ICache & IList, HDFS, File) Networking (IPv4, IPv6) Deployment (On Premise, Embedded, AWS, Azure, Docker, Pivotal Cloud Foundry) Hazelcast IMDG Data Structures (Distributed Map, Cache, List) Partition Management (Members, Lite Members, Master Partition, Replicas, Migrations, Partition Groups, Partition Aware) Job Management (Jet Job Lifecycle Management, Resource Distribution and Deployment) Execution Engine (Distributed DAG Execution, Processors, Back Pressure Handling, Data Distribution) Cluster Management with Cloud Discovery SPI (Apache jclouds, AWS, Azure, Consul, etcd, Heroku, IP List, Kubernetes, Multicast, Zookeeper) Java Client Jet Core API Hazelcast Jet Core Connectors High-Level API Processing Batch Readers and Writers (Hazelcast IMDG IMap, ICache & IList, HDFS, File) Batch Processing (Aggregations, Join)
  • 7. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet Key Competitive Differentiators 7 • Works great with Hazelcast IMDG | Source, Sink, Enrichment • Very simple to program | Leverages existing standards • Very simple to deploy | Same lower stack as Hazelcast • Works in every Cloud | Same as Hazelcast IMDG • High Performance | Fastest word count in town! • For Developers by Developers | Code it
  • 8. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Application Deployment Options IMDG IMDG • No separate process to manage • Great for microservices • Great for OEM • Simplest for Ops – nothing extra Embedded Application Java API Application Java API Application Java API Java API Application Client-Server Jet Jet Jet Jet Jet Jet Java API Application Java API Application Java API Application • Separate Jet Cluster • Scale Jet independent of Applications • Isolate Jet from Application Server Lifecycle • Managed by Ops
  • 9. © 2017 Hazelcast Inc. Confidential & Proprietary Jet with Hazelcast Deployment Choices Jet Compute Cluster Hazelcast IMDG Cluster Sink Enrichment Message Broker (Kafka) Data Enrichment HDFS Jet Cluster Sink Source / Enrichment Good when: • Where source and sink are primarily Hazelcast • Jet and Hazelcast have equivalent sizing needs Good when: • Where source and sink are primarily Hazelcast • Where you want isolation of the Jet cluster
  • 10. © 2017 Hazelcast Inc. Confidential & Proprietary Performance - Latency 10
  • 11. © 2017 Hazelcast Inc. Confidential & Proprietary Performance - Throughput
  • 12. © 2017 Hazelcast Inc. Confidential & Proprietary Performance - ForkJoin • Jet is faster than JDK’s j.u.s implementation. The issue is in the JDK the character stream reports "unknown size" and thus it cannot be parallelised. • If you first load the data into a List in RAM then run the JDK’s j.u.s it comes out a little faster
  • 13. © 2017 Hazelcast Inc. Confidential & Proprietary Jet is a DAG Processing Engine 13
  • 14. © 2017 Hazelcast Inc. Confidential & Proprietary DAG Processing • Directed Acyclic Graphs are used to model computations • Each vertex is a step in the computation • It is a generalisation of the MapReduce paradigm • Supports both batch and stream processing • Other systems that use DAGs: Apache Tez, Flink, Spark…
  • 15. © 2017 Hazelcast Inc. Confidential & Proprietary Example: Word Count If we lived in a single-threaded world: 1. Iterate through all the lines 2. Split the line into words 3. Update running total of counts with each word final String text = "..."; final Pattern pattern = Pattern.compile("s+"); final Map<String, Long> counts = new HashMap<>(); for (String word : pattern.split(text)) { counts.compute(word, (w, c) -> c == null ? 1L : c + 1); }
  • 16. © 2017 Hazelcast Inc. Confidential & Proprietary Input Output Still single-threaded execution: each Vertex is executed in turn Tokenizer Reducer Split the text into words For each word emit (word) Collect running totals Once everything is finished, emit all pairs of (word, count) (text) (word) (word, count) We can represent the computation as a DAG
  • 17. © 2017 Hazelcast Inc. Confidential & Proprietary Input (text) (word) Output (word, count) Tokenizer Reducer Split the text into words For each word emit (word) Collect running totals. Once everything is finished, emit all pairs of (word, count) We can parallelise the execution of the vertices by introducing concurrent queues between the vertices.
  • 18. © 2017 Hazelcast Inc. Confidential & Proprietary Output (word, count) Reducer By dividing the text into lines, and having multiple threads, each tokenizer can process a separate line thus parallelizing the work. Input Tokenizer Tokenizer
  • 19. © 2017 Hazelcast Inc. Confidential & Proprietary (word) (word) Input Output Tokenizer We only need to ensure the same words go to the same Reducer. The Reducer can also be executed in parallel by partitioning by the individual words. Tokenizer Reducer Reducer
  • 20. © 2017 Hazelcast Inc. Confidential & Proprietary Node Node The steps can also be distributed across multiple nodes. To do this you need a distributed partitioning scheme. Input Output Tokenizer Tokenizer Reducer Reducer Combiner Combiner Input Output Tokenizer Tokenizer Reducer Reducer Combiner Combiner
  • 21. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet Overview 21
  • 22. © 2017 Hazelcast Inc. Confidential & Proprietary jet-core • Provides low-latency and high-throughput distributed DAG execution • Hazelcast provides partitioning, discovery, networking and serialization and elasticity • Jobs are submitted to the cluster and run in an isolated class loader with spans the lifetime of the job. Many jobs can be run in the cluster at once. Each vertex in the graph is represented by a processor which does not need to be thread safe • Vertices are connected by edges. • Processors are executed by Tasklets, which are allocated to threads.
  • 23. © 2017 Hazelcast Inc. Confidential & Proprietary Processors • Main work horse of a Jet application – each vertex must have a Processor • Just Java code • It typically takes some input, and emits some output • Also can act as a source or a sink • Convenience processors for: flatMap, groupAndAccumulate, filter Vertex tokenize = dag.newVertex("tokenize", flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase())) .filter(word -> !word.isEmpty())) ); Vertex reduce = dag.newVertex("reduce", groupAndAccumulate(initialZero, (count, x) -> count + 1) ); Vertex combine = dag.newVertex("combine", groupAndAccumulate(Entry<String, Long>::getKey, initialZero, (Long count, Entry<String, Long> wordAndCount) -> count + wordAndCount.getValue()) );
  • 24. © 2017 Hazelcast Inc. Confidential & Proprietary DAG Execution • Each node in the cluster runs the whole graph • Each vertex is executed by a number of tasklets • Bounded number of execution threads (typically system processor count) • Work-stealing between execution threads • Back pressure applies between vertices
  • 25. © 2017 Hazelcast Inc. Confidential & Proprietary Cooperative Multithreading • Similar to green threads • Tasklets run in a loop serviced by the same native thread. - No context switching. - Almost guaranteed core affinity • Each tasklet does small amount of work at a time (<1ms) • Cooperative tasklets must be non-blocking. • Each thread can handle thousands of cooperative tasklets • If there isn’t any work for a thread, it backs off
  • 26. © 2017 Hazelcast Inc. Confidential & Proprietary Cooperative Multithreading • Edges are implemented by lock-free single producer, single consumer queues - It employs wait-free algorithms on both sides and avoids volatile writes by using lazySet. • Load balancing via back pressure • Tasklets can also be non-cooperative, in which case they have a dedicated thread and may perform blocking operations.
  • 27. © 2017 Hazelcast Inc. Confidential & Proprietary Producer Tasklet 3 Producer Tasklet 4 Producer Tasklet 4 Producer Tasklet 2 Consumer Tasklet 2 Consumer Tasklet 1 Producer Tasklet 4 Producer Tasklet 2 Producer Tasklet 1 Consumer Tasklet 2 Consumer Tasklet 1 Producer Tasklet 1 Consumer Vertex Producer Vertex Edge Parallelism=4 Output of Producer Vertex is connected to input of Consumer Vertex by configuration of the Edge. Thread 1 Thread 2 Execution Service Tasklets – Unit of Execution Parallelism=2
  • 28. © 2017 Hazelcast Inc. Confidential & Proprietary Edges • Different types of edges: - Variable Unicast – pick any downstream processor - Broadcast – emit to all downstream processors - Partitioned – pick based on a key • Vertices can have more than one input: allows joins and co- group • Vertices can have more than one output: splits • Edges can be local or distributed
  • 29. © 2017 Hazelcast Inc. Confidential & Proprietary Data input and output • Reader and writers for Hazelcast IMap, ILap and IList • HDFS • Kafka • Socket (text encoding) • File • Easy to implement your own – just implement the Processor interface. If a source, implement complete(). If a sink implement process();
  • 30. © 2017 Hazelcast Inc. Confidential & Proprietary APIs
  • 31. © 2017 Hazelcast Inc. Confidential & Proprietary Processor API • Unified API for sinks, sources and intermediate steps • Each Processor has an Inbox and Outbox per inbound and outbound edge. • Two main methods to implement: • boolean tryProcess(int ordinal, Object item) • Process incoming item and emit new items by populating the outbox • boolean complete() • Called after all upstream processors are also completed. Typically used for sources and batch operations such as group by and distinct. • Slight differences in API for cooperative vs non-cooperative processors • Non-cooperative processors may block indefinitely • Cooperative processors must respect Outbox when emitting and yield if Outbox is already full.
  • 32. © 2017 Hazelcast Inc. Confidential & Proprietary Core API – Powerful, Low Level API DAG describes how vertices are connected to each other: DAG dag = new DAG(); Vertex source = dag.newVertex("source", readMap(DOCID_NAME)); Vertex docLines = dag.newVertex("doc-lines", DocLinesP::new); Vertex tokenize = dag.newVertex("tokenize", flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase())) .filter(word -> !word.isEmpty()))); Vertex reduce = dag.newVertex("reduce", groupAndAccumulate(initialZero, (count, x) -> count + 1)); Vertex combine = dag.newVertex("combine", groupAndAccumulate(Entry<String, Long>::getKey, initialZero, (Long count, Entry<String, Long> wordAndCount) -> count + wordAndCount.getValue())); Vertex sink = dag.newVertex("sink", writeMap("counts")); dag.edge(between(source.localParallelism(1), docLines)) .edge(between(docLines.localParallelism(1), tokenize)) .edge(between(tokenize, reduce).partitioned(wholeItem(), HASH_CODE)) .edge(between(reduce, combine).distributed().partitioned(entryKey())) .edge(between(combine, sink)); jet.newJob(dag).execute();
  • 33. © 2017 Hazelcast Inc. Confidential & Proprietary Distributed java.util.stream API • “Classes to support functional-style operations on streams of elements, such as map-reduce transformations on collections.” • Jet adds distributed support for the java.util.stream API for Hazelcast Map and List. • Supports all j.u.s. operations such as: -map(), flatMap(), filter() -reduce(), collect() -sorted(), distinct() • Lambda serialization is a problem, but is worked around by creating serializable versions of the interfaces • j.u.s Streams are converted to Processor API (DAG) for execution
  • 34. © 2017 Hazelcast Inc. Confidential & Proprietary j.u.s API JetInstance jet = Jet.newJetInstance(); Jet.newJetInstance(); IStreamMap<Long, String> lines = jet.getMap("lines"); Map<String, Long> counts = lines .stream() .flatMap(m -> Stream.of(PATTERN.split(m.getValue().toLowerCase()))) .filter(w -> !w.isEmpty()) .collect(DistributedCollectors.toIMap(w -> w, w -> 1L, (left, right) -> left + right));
  • 35. © 2017 Hazelcast Inc. Confidential & Proprietary API Comparison • Processor API is low-level, powerful and can be used to solve any problem that can be expressed as a DAG • java.util.stream is familiar, but only limited functionality. • All data processing frameworks have their own API • Some attempts at unification are not major successes: Cascading, Apache Beam … • Cascading prototype for Jet
  • 36. © 2017 Hazelcast Inc. Confidential & Proprietary Demo
  • 37. © 2017 Hazelcast Inc. Confidential & Proprietary Roadmap
  • 38. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet 0.4 (06/17) 38 Features Description Distributed Computation Distributed data processing framework. It’s based on directed acyclic graphs (DAGs) and in-memory computation. Distributed java.util.stream Java 8 stream API can be used to describe the Jet Job. The execution is distributed. Windowing with Event-time Improved streaming support including windowing support with event-time semantics. Out of the box support for tumbling, sliding and session window aggregations. Distributed Data Structures Distributed implementations of Map, Cache and List data structures highly optimized to be used for the processing. Pivotal Cloud Foundry Tile Hazelcast Jet is available in Pivotal Cloud Foundry Discovery and Networking Hazelcast discovery for finding members in the cluster or in the Cloud. Hazelcast IMDG Connector Allows Jet to integrate with both embedded and non-embedded Hazelcast instances to be used for reading and writing data for distributed computation. File and Socket Connector Ability to read and write unbounded data from text sockets and files. Kafka Connector Reads and writes data streams to the Kafka message broker. HDFS Connector Reads and writes data streams to Hadoop Distributed File System
  • 39. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast 2017 Jet Roadmap 39 Features Description Robust Stream Processing Guarantees for stateful stream processing Usability Enhancements High-Level API | Tooling High Performance Hazelcast Integrations Streaming Map and Cache events | Projection and Predicate support for Map source Management Center Management and monitoring features for Jet. More Connectors JMS | JDBC
  • 40. © 2017 Hazelcast Inc. Confidential & Proprietary Questions? Version 0.4 is the current release with 0.5 coming in September http://jet.hazelcast.org Minimum JDK 8

Editor's Notes

  1. General idea across the slide deck: The value of Jet is in simplicity and integration with IMDG Big data isn’t enough, fast big data is what matters (information available in a near real-time fashion) Jet grabs the information from the data in near real-time and stores it to IMDG staying in-memory and distributed
  2. At a business level, Hazelcast Jet is a distributed computing platform for fast processing of big data streaming sets. Jet is based on a parallel, core engine allowing data-intensive applications to operate at near real-time speeds.
  3. Jet vision = fast big data, real-time stream processor + fast batch processor How the ecosystem and looks like. Jet is part of the data-processing pipeline Two major deployment setups - stream and batch. Batch — A complete dataset is available before the processing starts. The data are submitted as one batch to be processed. Usually in Database, IMDG, on filesystem Data are loaded from (potentially multiple) sources, joined, transformed, cleaned. Then stored and/or left in IMDG for further in-memory usage. - Typically used for ETL (Extract-transfer-load) tasks. Stream - data potentially unbounded and infinite the goal is to process the data as soon as possible. Process = extract important information from the data. The whole processing stays in memory, therefore another systems may be alerted in near real-time. Data ingested to the system via message broker (Kafka) The streaming deployment is useful for use-cases, where real-time behaviour matters. Streaming features Streaming core of Jet (low latency, per-event processing). Windowing and late-events processing
  4. Hazelcast can be used in two different modes: Embedded Mode is very easy to setup and run. It can only be done with peer Java applications. Developers really like this mode during prototyping phases. Client-Server Mode is more work but efficient once an application is required to scale to larger volume of transactions or users as it separates the application from Hazelcast IMDG for better manageability. It can be done with different types of applications (C+, C#/.NET, Python, Scala, Node.js) using native clients provided by Hazelcast (C+, C#/.NET, Python, Scala, Node.js). Hazelcast Open Binary Protocol enables clients for any other language to be developed as needed (see Hazelcast IMDG Docs). With both client-server and embedded architecture, Hazelcast IMDG offers: + Elasticity and scalability + Transparent integration with backend databases using Map Store interfaces + Management of the cluster through your web browser with Management Center
  5. Jet is build on top of Hazelcast IMDG with the benefits of close integration. Hazelcast Jet and IMDG can run in these deployments: Every Jet cluster works on top of embedded Hazelcast in-memory data grid. Embedded deployment suits well for: Sharing processing state among Jet nodes Caching intermediate processing results Enriching processed events. Cache remote data (e.g. fact tables from database) on Jet nodes. Running advanced data processing tasks on top of Hazelcast data structures Development purposes. Since starting standalone Jet cluster is simple and fast. Remote deployment Share state or intermediate results among more Jet jobs Cache intermediate results (e.g. to show real-time processing stats on dashboard) Load input or store results of Jet computation in in-memory fashion to IMDG Isolation of processing and storage layer Jet x IMDG – product differentiation Both Jet and IMDG are for parallel and distributed computing, why Jet is more expensive? Is Hazelcast IMDG not good enough?
  6. The streaming benchmark is based on a simple stock exchange aggregation. Each message representing a trade is published to Kafka and then a simple windowed aggregation which calculates the number of traders per ticker per window is done using various data processing frameworks.
  7. Compared to Spark, Flink doing word count, data on HDFS/IMap, cluster of 10 nodes, 32 cores each…
  8. Here one thread is doing tokenizing and one is doing accumulation. Each can operate at its own speed using the concurrent queue to buffer work.
  9. Not the same as Fork-Join in JDK as it uses a single thread for all processing steps. parallelStream allocates a Fork-Join task per data partition but that’s all.
  10. Each Reducer is a different instance. It can be across the network or local depending on how you create your DAG model. In our j.u.s implementation reducers are always local and there is a network step for the combiner.
  11. All the nodes need to agree on who owns what partition. The data is processed locally and then sent to the partition owner as part of the combiner step. Jet uses the same partition scheme as Hazelcast. So if your Combiner step is a collect to a Hazelcast iMap the final output will be local. So there is only one network hop in the entire DAG processing. Writing to HDFS is local but then is replicated out.
  12. It‘s to be replaced by high-level What API in Jet 0.5 (opposed to „How“ core API)