SlideShare a Scribd company logo
1 of 40
INTRODUCING
1
Rahul Gupta
@wildnez
2
© 2017 Hazelcast Inc. Confidential & Proprietary
What is Jet?
A general purpose distributed data processing engine
built on Hazelcast and based on directed acyclic graphs
(DAGs) to model data flow.
DISTRIBUTED COMPUTING.SIMPLIFIED.
© 2017 Hazelcast Inc. Confidential & Proprietary
Why?
• Distributed java.util.stream support for Hazelcast
• Extend computational capabilities in Hazelcast
(IExecutorService, EntryProcessor, JobTracker, …)
• Hazelcast provides a nice foundation for distributed data processing so it is
relatively easy to do this.
• Unifying IMDG and advanced data processing capabilities
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet Overview
4
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet – Fast Big Data
5
Stream and Fast In-Memory Batch Processing
Enrichment
Databases
IoT
Social
Networks
Enterprise
Applications
Databases/
Hazelcast IMDG
HDFS/
Spark
Stream
Stream
Stream
Batch
Batch
Ingest
Alerts
Enterprise
Applications
Interactive
Analytics
Databases/
Hazelcast IMDG
Output
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet Architecture
6
Streaming Readers and Writers
(Kafka, File, Socket)
Stream Processing
(Tumbling, Sliding and Session Windows)
java.util.stream
Batch Readers and Writers
(Hazelcast IMDG IMap, ICache & IList, HDFS, File)
Networking
(IPv4, IPv6)
Deployment
(On Premise, Embedded, AWS, Azure, Docker, Pivotal Cloud Foundry)
Hazelcast IMDG Data Structures
(Distributed Map, Cache, List)
Partition Management
(Members, Lite Members, Master Partition, Replicas, Migrations, Partition Groups, Partition Aware)
Job Management
(Jet Job Lifecycle Management, Resource Distribution and Deployment)
Execution Engine
(Distributed DAG Execution, Processors, Back Pressure Handling, Data Distribution)
Cluster Management with Cloud Discovery SPI
(Apache jclouds, AWS, Azure, Consul, etcd, Heroku, IP List, Kubernetes, Multicast, Zookeeper)
Java Client
Jet Core API
Hazelcast Jet Core
Connectors
High-Level API
Processing
Batch Readers and Writers
(Hazelcast IMDG IMap, ICache & IList, HDFS, File)
Batch Processing
(Aggregations, Join)
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet Key Competitive Differentiators
7
• Works great with Hazelcast IMDG | Source, Sink, Enrichment
• Very simple to program | Leverages existing standards
• Very simple to deploy | Same lower stack as Hazelcast
• Works in every Cloud | Same as Hazelcast IMDG
• High Performance | Fastest word count in town!
• For Developers by Developers | Code it
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Application Deployment Options
IMDG IMDG
• No separate process to manage
• Great for microservices
• Great for OEM
• Simplest for Ops – nothing extra
Embedded
Application
Java API
Application
Java API
Application
Java API
Java API
Application
Client-Server
Jet
Jet Jet
Jet
Jet Jet
Java API
Application
Java API
Application
Java API
Application
• Separate Jet Cluster
• Scale Jet independent of Applications
• Isolate Jet from Application Server Lifecycle
• Managed by Ops
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet with Hazelcast Deployment Choices
Jet Compute
Cluster
Hazelcast
IMDG Cluster
Sink Enrichment
Message
Broker
(Kafka)
Data
Enrichment
HDFS
Jet Cluster
Sink
Source /
Enrichment
Good when:
• Where source and sink are primarily Hazelcast
• Jet and Hazelcast have equivalent sizing needs
Good when:
• Where source and sink are primarily Hazelcast
• Where you want isolation of the Jet cluster
© 2017 Hazelcast Inc. Confidential & Proprietary
Performance - Latency
10
© 2017 Hazelcast Inc. Confidential & Proprietary
Performance - Throughput
© 2017 Hazelcast Inc. Confidential & Proprietary
Performance - ForkJoin
• Jet is faster than JDK’s j.u.s implementation. The issue is in the JDK the
character stream reports "unknown size" and thus it cannot be parallelised.
• If you first load the data into a List in RAM then run the JDK’s j.u.s it comes out
a little faster
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet is a DAG Processing Engine
13
© 2017 Hazelcast Inc. Confidential & Proprietary
DAG Processing
• Directed Acyclic Graphs are used to model computations
• Each vertex is a step in the computation
• It is a generalisation of the MapReduce paradigm
• Supports both batch and stream processing
• Other systems that use DAGs: Apache Tez, Flink, Spark…
© 2017 Hazelcast Inc. Confidential & Proprietary
Example: Word Count
If we lived in a single-threaded world:
1. Iterate through all the lines
2. Split the line into words
3. Update running total of counts with each word
final String text = "...";
final Pattern pattern = Pattern.compile("s+");
final Map<String, Long> counts = new HashMap<>();
for (String word : pattern.split(text)) {
counts.compute(word, (w, c) -> c == null ? 1L : c + 1);
}
© 2017 Hazelcast Inc. Confidential & Proprietary
Input Output
Still single-threaded execution:
each Vertex is executed in turn
Tokenizer Reducer
Split the text into words
For each word emit (word)
Collect running totals
Once everything is finished,
emit all pairs of (word, count)
(text) (word) (word, count)
We can represent the computation as a DAG
© 2017 Hazelcast Inc. Confidential & Proprietary
Input
(text) (word)
Output
(word, count)
Tokenizer Reducer
Split the text into words
For each word emit (word)
Collect running totals.
Once everything is finished,
emit all pairs of (word, count)
We can parallelise the execution of the vertices by
introducing concurrent queues between the vertices.
© 2017 Hazelcast Inc. Confidential & Proprietary
Output
(word, count)
Reducer
By dividing the text into lines, and having multiple threads,
each tokenizer can process a separate line thus
parallelizing the work.
Input
Tokenizer
Tokenizer
© 2017 Hazelcast Inc. Confidential & Proprietary
(word)
(word)
Input Output
Tokenizer
We only need to ensure the same words
go to the same Reducer.
The Reducer can also be executed in parallel by
partitioning by the individual words.
Tokenizer
Reducer
Reducer
© 2017 Hazelcast Inc. Confidential & Proprietary
Node
Node
The steps can also be distributed across multiple nodes.
To do this you need a distributed partitioning scheme.
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Combiner
Combiner
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Combiner
Combiner
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet Overview
21
© 2017 Hazelcast Inc. Confidential & Proprietary
jet-core
• Provides low-latency and high-throughput distributed
DAG execution
• Hazelcast provides partitioning, discovery, networking
and serialization and elasticity
• Jobs are submitted to the cluster and run in an isolated class
loader with spans the lifetime of the job. Many jobs can be
run in the cluster at once. Each vertex in the graph is
represented by a processor which does not need to be
thread safe
• Vertices are connected by edges.
• Processors are executed by Tasklets, which are allocated to
threads.
© 2017 Hazelcast Inc. Confidential & Proprietary
Processors
• Main work horse of a Jet application – each vertex must have a Processor
• Just Java code
• It typically takes some input, and emits some output
• Also can act as a source or a sink
• Convenience processors for: flatMap, groupAndAccumulate, filter
Vertex tokenize = dag.newVertex("tokenize",
flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase()))
.filter(word -> !word.isEmpty()))
);
Vertex reduce = dag.newVertex("reduce",
groupAndAccumulate(initialZero, (count, x) -> count + 1)
);
Vertex combine = dag.newVertex("combine",
groupAndAccumulate(Entry<String, Long>::getKey, initialZero,
(Long count, Entry<String, Long> wordAndCount) -> count + wordAndCount.getValue())
);
© 2017 Hazelcast Inc. Confidential & Proprietary
DAG Execution
• Each node in the cluster runs the whole graph
• Each vertex is executed by a number of tasklets
• Bounded number of execution threads (typically system
processor count)
• Work-stealing between execution threads
• Back pressure applies between vertices
© 2017 Hazelcast Inc. Confidential & Proprietary
Cooperative Multithreading
• Similar to green threads
• Tasklets run in a loop serviced by the same native thread.
- No context switching.
- Almost guaranteed core affinity
• Each tasklet does small amount of work at a time (<1ms)
• Cooperative tasklets must be non-blocking.
• Each thread can handle thousands of cooperative tasklets
• If there isn’t any work for a thread, it backs off
© 2017 Hazelcast Inc. Confidential & Proprietary
Cooperative Multithreading
• Edges are implemented by lock-free single producer, single
consumer queues
- It employs wait-free algorithms on both sides and avoids
volatile writes by using lazySet.
• Load balancing via back pressure
• Tasklets can also be non-cooperative, in which case they have a
dedicated thread and may perform blocking operations.
© 2017 Hazelcast Inc. Confidential & Proprietary
Producer
Tasklet 3
Producer
Tasklet 4
Producer
Tasklet 4
Producer
Tasklet 2
Consumer
Tasklet 2
Consumer
Tasklet 1
Producer
Tasklet 4
Producer
Tasklet 2
Producer
Tasklet 1
Consumer
Tasklet 2
Consumer
Tasklet 1
Producer
Tasklet 1
Consumer
Vertex
Producer
Vertex
Edge
Parallelism=4
Output of Producer Vertex is
connected to input of
Consumer Vertex by
configuration of the Edge.
Thread 1 Thread 2
Execution Service
Tasklets – Unit of Execution
Parallelism=2
© 2017 Hazelcast Inc. Confidential & Proprietary
Edges
• Different types of edges:
- Variable Unicast – pick any downstream processor
- Broadcast – emit to all downstream processors
- Partitioned – pick based on a key
• Vertices can have more than one input: allows joins and co-
group
• Vertices can have more than one output: splits
• Edges can be local or distributed
© 2017 Hazelcast Inc. Confidential & Proprietary
Data input and output
• Reader and writers for Hazelcast IMap, ILap and IList
• HDFS
• Kafka
• Socket (text encoding)
• File
• Easy to implement your own – just implement the Processor
interface. If a source, implement complete(). If a sink
implement process();
© 2017 Hazelcast Inc. Confidential & Proprietary
APIs
© 2017 Hazelcast Inc. Confidential & Proprietary
Processor API
• Unified API for sinks, sources and intermediate steps
• Each Processor has an Inbox and Outbox per inbound and outbound
edge.
• Two main methods to implement:
• boolean tryProcess(int ordinal, Object item)
• Process incoming item and emit new items by populating the outbox
• boolean complete()
• Called after all upstream processors are also completed. Typically used for sources and
batch operations such as group by and distinct.
• Slight differences in API for cooperative vs non-cooperative processors
• Non-cooperative processors may block indefinitely
• Cooperative processors must respect Outbox when emitting and yield if Outbox is
already full.
© 2017 Hazelcast Inc. Confidential & Proprietary
Core API – Powerful, Low Level API
DAG describes how vertices are connected to each other:
DAG dag = new DAG();
Vertex source = dag.newVertex("source", readMap(DOCID_NAME));
Vertex docLines = dag.newVertex("doc-lines", DocLinesP::new);
Vertex tokenize = dag.newVertex("tokenize",
flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase()))
.filter(word -> !word.isEmpty())));
Vertex reduce = dag.newVertex("reduce",
groupAndAccumulate(initialZero, (count, x) -> count + 1));
Vertex combine = dag.newVertex("combine",
groupAndAccumulate(Entry<String, Long>::getKey, initialZero,
(Long count, Entry<String, Long> wordAndCount) ->
count + wordAndCount.getValue()));
Vertex sink = dag.newVertex("sink", writeMap("counts"));
dag.edge(between(source.localParallelism(1), docLines))
.edge(between(docLines.localParallelism(1), tokenize))
.edge(between(tokenize, reduce).partitioned(wholeItem(), HASH_CODE))
.edge(between(reduce, combine).distributed().partitioned(entryKey()))
.edge(between(combine, sink));
jet.newJob(dag).execute();
© 2017 Hazelcast Inc. Confidential & Proprietary
Distributed java.util.stream API
• “Classes to support functional-style operations on streams of
elements, such as map-reduce transformations on collections.”
• Jet adds distributed support for the java.util.stream API for
Hazelcast Map and List.
• Supports all j.u.s. operations such as:
-map(), flatMap(), filter()
-reduce(), collect()
-sorted(), distinct()
• Lambda serialization is a problem, but is worked around by creating
serializable versions of the interfaces
• j.u.s Streams are converted to Processor API (DAG) for execution
© 2017 Hazelcast Inc. Confidential & Proprietary
j.u.s API
JetInstance jet = Jet.newJetInstance();
Jet.newJetInstance();
IStreamMap<Long, String> lines = jet.getMap("lines");
Map<String, Long> counts = lines
.stream()
.flatMap(m -> Stream.of(PATTERN.split(m.getValue().toLowerCase())))
.filter(w -> !w.isEmpty())
.collect(DistributedCollectors.toIMap(w -> w, w -> 1L, (left, right) -> left + right));
© 2017 Hazelcast Inc. Confidential & Proprietary
API Comparison
• Processor API is low-level, powerful and can be used to
solve any problem that can be expressed as a DAG
• java.util.stream is familiar, but only limited functionality.
• All data processing frameworks have their own API
• Some attempts at unification are not major successes:
Cascading, Apache Beam …
• Cascading prototype for Jet
© 2017 Hazelcast Inc. Confidential & Proprietary
Demo
© 2017 Hazelcast Inc. Confidential & Proprietary
Roadmap
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast Jet 0.4 (06/17)
38
Features Description
Distributed Computation
Distributed data processing framework. It’s based on directed acyclic graphs (DAGs) and in-memory
computation.
Distributed java.util.stream Java 8 stream API can be used to describe the Jet Job. The execution is distributed.
Windowing with Event-time
Improved streaming support including windowing support with event-time semantics. Out of the box
support for tumbling, sliding and session window aggregations.
Distributed Data Structures
Distributed implementations of Map, Cache and List data structures highly optimized to be used for the
processing.
Pivotal Cloud Foundry Tile Hazelcast Jet is available in Pivotal Cloud Foundry
Discovery and Networking Hazelcast discovery for finding members in the cluster or in the Cloud.
Hazelcast IMDG Connector
Allows Jet to integrate with both embedded and non-embedded Hazelcast instances to be used for
reading and writing data for distributed computation.
File and Socket Connector Ability to read and write unbounded data from text sockets and files.
Kafka Connector Reads and writes data streams to the Kafka message broker.
HDFS Connector Reads and writes data streams to Hadoop Distributed File System
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast 2017 Jet Roadmap
39
Features Description
Robust Stream Processing Guarantees for stateful stream processing
Usability Enhancements High-Level API | Tooling
High Performance
Hazelcast Integrations
Streaming Map and Cache events | Projection and Predicate support for
Map source
Management Center Management and monitoring features for Jet.
More Connectors JMS | JDBC
© 2017 Hazelcast Inc. Confidential & Proprietary
Questions?
Version 0.4 is the current release with 0.5 coming in September
http://jet.hazelcast.org
Minimum JDK 8

More Related Content

What's hot

Building scalable applications with hazelcast
Building scalable applications with hazelcastBuilding scalable applications with hazelcast
Building scalable applications with hazelcastFuad Malikov
 
Hazelcast For Beginners (Paris JUG-1)
Hazelcast For Beginners (Paris JUG-1)Hazelcast For Beginners (Paris JUG-1)
Hazelcast For Beginners (Paris JUG-1)Emrah Kocaman
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentHazelcast
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopDataWorks Summit
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using HazelcastTaras Matyashovsky
 
20151027 sahara + manila final
20151027 sahara + manila final20151027 sahara + manila final
20151027 sahara + manila finalWei Ting Chen
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataDataWorks Summit
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudDataWorks Summit
 
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and SpringSpring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and SpringEmrah Kocaman
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...DataWorks Summit
 
Red Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and FutureRed Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and FutureRed_Hat_Storage
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?DataWorks Summit
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...DataStax
 

What's hot (20)

Building scalable applications with hazelcast
Building scalable applications with hazelcastBuilding scalable applications with hazelcast
Building scalable applications with hazelcast
 
Hazelcast For Beginners (Paris JUG-1)
Hazelcast For Beginners (Paris JUG-1)Hazelcast For Beginners (Paris JUG-1)
Hazelcast For Beginners (Paris JUG-1)
 
Hazelcast 101
Hazelcast 101Hazelcast 101
Hazelcast 101
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using Hazelcast
 
20151027 sahara + manila final
20151027 sahara + manila final20151027 sahara + manila final
20151027 sahara + manila final
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
 
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and SpringSpring Meetup Paris - Getting Distributed with Hazelcast and Spring
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
Red Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and FutureRed Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and Future
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Hazelcast Introduction
Hazelcast IntroductionHazelcast Introduction
Hazelcast Introduction
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 

Similar to Hazelcast Jet v0.4 - August 9, 2017

Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesVladimír Schreiner
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Coherence RoadMap 2018
Coherence RoadMap 2018Coherence RoadMap 2018
Coherence RoadMap 2018harvraja
 
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Amazon Web Services
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Architecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptxArchitecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptxshaikshazil1
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Docker vs. Kubernetes vs. Serverless
Docker vs. Kubernetes vs. ServerlessDocker vs. Kubernetes vs. Serverless
Docker vs. Kubernetes vs. ServerlessLogicworksNY
 
Streaming solutions for real time problems
Streaming solutions for real time problems Streaming solutions for real time problems
Streaming solutions for real time problems Aparna Gaonkar
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKInfluxData
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam DillardInfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam DillardInfluxData
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsSatya Narayan
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 
Mysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High PerformanceMysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High PerformanceBernd Ocklin
 

Similar to Hazelcast Jet v0.4 - August 9, 2017 (20)

Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Coherence RoadMap 2018
Coherence RoadMap 2018Coherence RoadMap 2018
Coherence RoadMap 2018
 
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Architecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptxArchitecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptx
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Docker vs. Kubernetes vs. Serverless
Docker vs. Kubernetes vs. ServerlessDocker vs. Kubernetes vs. Serverless
Docker vs. Kubernetes vs. Serverless
 
Apache ignite v1.3
Apache ignite v1.3Apache ignite v1.3
Apache ignite v1.3
 
Streaming solutions for real time problems
Streaming solutions for real time problems Streaming solutions for real time problems
Streaming solutions for real time problems
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam DillardInfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Mysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High PerformanceMysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High Performance
 

Recently uploaded

WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsUXDXConf
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2DianaGray10
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxEasyPrinterHelp
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 

Recently uploaded (20)

WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 

Hazelcast Jet v0.4 - August 9, 2017

  • 2. © 2017 Hazelcast Inc. Confidential & Proprietary What is Jet? A general purpose distributed data processing engine built on Hazelcast and based on directed acyclic graphs (DAGs) to model data flow. DISTRIBUTED COMPUTING.SIMPLIFIED.
  • 3. © 2017 Hazelcast Inc. Confidential & Proprietary Why? • Distributed java.util.stream support for Hazelcast • Extend computational capabilities in Hazelcast (IExecutorService, EntryProcessor, JobTracker, …) • Hazelcast provides a nice foundation for distributed data processing so it is relatively easy to do this. • Unifying IMDG and advanced data processing capabilities
  • 4. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet Overview 4
  • 5. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet – Fast Big Data 5 Stream and Fast In-Memory Batch Processing Enrichment Databases IoT Social Networks Enterprise Applications Databases/ Hazelcast IMDG HDFS/ Spark Stream Stream Stream Batch Batch Ingest Alerts Enterprise Applications Interactive Analytics Databases/ Hazelcast IMDG Output
  • 6. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet Architecture 6 Streaming Readers and Writers (Kafka, File, Socket) Stream Processing (Tumbling, Sliding and Session Windows) java.util.stream Batch Readers and Writers (Hazelcast IMDG IMap, ICache & IList, HDFS, File) Networking (IPv4, IPv6) Deployment (On Premise, Embedded, AWS, Azure, Docker, Pivotal Cloud Foundry) Hazelcast IMDG Data Structures (Distributed Map, Cache, List) Partition Management (Members, Lite Members, Master Partition, Replicas, Migrations, Partition Groups, Partition Aware) Job Management (Jet Job Lifecycle Management, Resource Distribution and Deployment) Execution Engine (Distributed DAG Execution, Processors, Back Pressure Handling, Data Distribution) Cluster Management with Cloud Discovery SPI (Apache jclouds, AWS, Azure, Consul, etcd, Heroku, IP List, Kubernetes, Multicast, Zookeeper) Java Client Jet Core API Hazelcast Jet Core Connectors High-Level API Processing Batch Readers and Writers (Hazelcast IMDG IMap, ICache & IList, HDFS, File) Batch Processing (Aggregations, Join)
  • 7. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet Key Competitive Differentiators 7 • Works great with Hazelcast IMDG | Source, Sink, Enrichment • Very simple to program | Leverages existing standards • Very simple to deploy | Same lower stack as Hazelcast • Works in every Cloud | Same as Hazelcast IMDG • High Performance | Fastest word count in town! • For Developers by Developers | Code it
  • 8. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Application Deployment Options IMDG IMDG • No separate process to manage • Great for microservices • Great for OEM • Simplest for Ops – nothing extra Embedded Application Java API Application Java API Application Java API Java API Application Client-Server Jet Jet Jet Jet Jet Jet Java API Application Java API Application Java API Application • Separate Jet Cluster • Scale Jet independent of Applications • Isolate Jet from Application Server Lifecycle • Managed by Ops
  • 9. © 2017 Hazelcast Inc. Confidential & Proprietary Jet with Hazelcast Deployment Choices Jet Compute Cluster Hazelcast IMDG Cluster Sink Enrichment Message Broker (Kafka) Data Enrichment HDFS Jet Cluster Sink Source / Enrichment Good when: • Where source and sink are primarily Hazelcast • Jet and Hazelcast have equivalent sizing needs Good when: • Where source and sink are primarily Hazelcast • Where you want isolation of the Jet cluster
  • 10. © 2017 Hazelcast Inc. Confidential & Proprietary Performance - Latency 10
  • 11. © 2017 Hazelcast Inc. Confidential & Proprietary Performance - Throughput
  • 12. © 2017 Hazelcast Inc. Confidential & Proprietary Performance - ForkJoin • Jet is faster than JDK’s j.u.s implementation. The issue is in the JDK the character stream reports "unknown size" and thus it cannot be parallelised. • If you first load the data into a List in RAM then run the JDK’s j.u.s it comes out a little faster
  • 13. © 2017 Hazelcast Inc. Confidential & Proprietary Jet is a DAG Processing Engine 13
  • 14. © 2017 Hazelcast Inc. Confidential & Proprietary DAG Processing • Directed Acyclic Graphs are used to model computations • Each vertex is a step in the computation • It is a generalisation of the MapReduce paradigm • Supports both batch and stream processing • Other systems that use DAGs: Apache Tez, Flink, Spark…
  • 15. © 2017 Hazelcast Inc. Confidential & Proprietary Example: Word Count If we lived in a single-threaded world: 1. Iterate through all the lines 2. Split the line into words 3. Update running total of counts with each word final String text = "..."; final Pattern pattern = Pattern.compile("s+"); final Map<String, Long> counts = new HashMap<>(); for (String word : pattern.split(text)) { counts.compute(word, (w, c) -> c == null ? 1L : c + 1); }
  • 16. © 2017 Hazelcast Inc. Confidential & Proprietary Input Output Still single-threaded execution: each Vertex is executed in turn Tokenizer Reducer Split the text into words For each word emit (word) Collect running totals Once everything is finished, emit all pairs of (word, count) (text) (word) (word, count) We can represent the computation as a DAG
  • 17. © 2017 Hazelcast Inc. Confidential & Proprietary Input (text) (word) Output (word, count) Tokenizer Reducer Split the text into words For each word emit (word) Collect running totals. Once everything is finished, emit all pairs of (word, count) We can parallelise the execution of the vertices by introducing concurrent queues between the vertices.
  • 18. © 2017 Hazelcast Inc. Confidential & Proprietary Output (word, count) Reducer By dividing the text into lines, and having multiple threads, each tokenizer can process a separate line thus parallelizing the work. Input Tokenizer Tokenizer
  • 19. © 2017 Hazelcast Inc. Confidential & Proprietary (word) (word) Input Output Tokenizer We only need to ensure the same words go to the same Reducer. The Reducer can also be executed in parallel by partitioning by the individual words. Tokenizer Reducer Reducer
  • 20. © 2017 Hazelcast Inc. Confidential & Proprietary Node Node The steps can also be distributed across multiple nodes. To do this you need a distributed partitioning scheme. Input Output Tokenizer Tokenizer Reducer Reducer Combiner Combiner Input Output Tokenizer Tokenizer Reducer Reducer Combiner Combiner
  • 21. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet Overview 21
  • 22. © 2017 Hazelcast Inc. Confidential & Proprietary jet-core • Provides low-latency and high-throughput distributed DAG execution • Hazelcast provides partitioning, discovery, networking and serialization and elasticity • Jobs are submitted to the cluster and run in an isolated class loader with spans the lifetime of the job. Many jobs can be run in the cluster at once. Each vertex in the graph is represented by a processor which does not need to be thread safe • Vertices are connected by edges. • Processors are executed by Tasklets, which are allocated to threads.
  • 23. © 2017 Hazelcast Inc. Confidential & Proprietary Processors • Main work horse of a Jet application – each vertex must have a Processor • Just Java code • It typically takes some input, and emits some output • Also can act as a source or a sink • Convenience processors for: flatMap, groupAndAccumulate, filter Vertex tokenize = dag.newVertex("tokenize", flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase())) .filter(word -> !word.isEmpty())) ); Vertex reduce = dag.newVertex("reduce", groupAndAccumulate(initialZero, (count, x) -> count + 1) ); Vertex combine = dag.newVertex("combine", groupAndAccumulate(Entry<String, Long>::getKey, initialZero, (Long count, Entry<String, Long> wordAndCount) -> count + wordAndCount.getValue()) );
  • 24. © 2017 Hazelcast Inc. Confidential & Proprietary DAG Execution • Each node in the cluster runs the whole graph • Each vertex is executed by a number of tasklets • Bounded number of execution threads (typically system processor count) • Work-stealing between execution threads • Back pressure applies between vertices
  • 25. © 2017 Hazelcast Inc. Confidential & Proprietary Cooperative Multithreading • Similar to green threads • Tasklets run in a loop serviced by the same native thread. - No context switching. - Almost guaranteed core affinity • Each tasklet does small amount of work at a time (<1ms) • Cooperative tasklets must be non-blocking. • Each thread can handle thousands of cooperative tasklets • If there isn’t any work for a thread, it backs off
  • 26. © 2017 Hazelcast Inc. Confidential & Proprietary Cooperative Multithreading • Edges are implemented by lock-free single producer, single consumer queues - It employs wait-free algorithms on both sides and avoids volatile writes by using lazySet. • Load balancing via back pressure • Tasklets can also be non-cooperative, in which case they have a dedicated thread and may perform blocking operations.
  • 27. © 2017 Hazelcast Inc. Confidential & Proprietary Producer Tasklet 3 Producer Tasklet 4 Producer Tasklet 4 Producer Tasklet 2 Consumer Tasklet 2 Consumer Tasklet 1 Producer Tasklet 4 Producer Tasklet 2 Producer Tasklet 1 Consumer Tasklet 2 Consumer Tasklet 1 Producer Tasklet 1 Consumer Vertex Producer Vertex Edge Parallelism=4 Output of Producer Vertex is connected to input of Consumer Vertex by configuration of the Edge. Thread 1 Thread 2 Execution Service Tasklets – Unit of Execution Parallelism=2
  • 28. © 2017 Hazelcast Inc. Confidential & Proprietary Edges • Different types of edges: - Variable Unicast – pick any downstream processor - Broadcast – emit to all downstream processors - Partitioned – pick based on a key • Vertices can have more than one input: allows joins and co- group • Vertices can have more than one output: splits • Edges can be local or distributed
  • 29. © 2017 Hazelcast Inc. Confidential & Proprietary Data input and output • Reader and writers for Hazelcast IMap, ILap and IList • HDFS • Kafka • Socket (text encoding) • File • Easy to implement your own – just implement the Processor interface. If a source, implement complete(). If a sink implement process();
  • 30. © 2017 Hazelcast Inc. Confidential & Proprietary APIs
  • 31. © 2017 Hazelcast Inc. Confidential & Proprietary Processor API • Unified API for sinks, sources and intermediate steps • Each Processor has an Inbox and Outbox per inbound and outbound edge. • Two main methods to implement: • boolean tryProcess(int ordinal, Object item) • Process incoming item and emit new items by populating the outbox • boolean complete() • Called after all upstream processors are also completed. Typically used for sources and batch operations such as group by and distinct. • Slight differences in API for cooperative vs non-cooperative processors • Non-cooperative processors may block indefinitely • Cooperative processors must respect Outbox when emitting and yield if Outbox is already full.
  • 32. © 2017 Hazelcast Inc. Confidential & Proprietary Core API – Powerful, Low Level API DAG describes how vertices are connected to each other: DAG dag = new DAG(); Vertex source = dag.newVertex("source", readMap(DOCID_NAME)); Vertex docLines = dag.newVertex("doc-lines", DocLinesP::new); Vertex tokenize = dag.newVertex("tokenize", flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase())) .filter(word -> !word.isEmpty()))); Vertex reduce = dag.newVertex("reduce", groupAndAccumulate(initialZero, (count, x) -> count + 1)); Vertex combine = dag.newVertex("combine", groupAndAccumulate(Entry<String, Long>::getKey, initialZero, (Long count, Entry<String, Long> wordAndCount) -> count + wordAndCount.getValue())); Vertex sink = dag.newVertex("sink", writeMap("counts")); dag.edge(between(source.localParallelism(1), docLines)) .edge(between(docLines.localParallelism(1), tokenize)) .edge(between(tokenize, reduce).partitioned(wholeItem(), HASH_CODE)) .edge(between(reduce, combine).distributed().partitioned(entryKey())) .edge(between(combine, sink)); jet.newJob(dag).execute();
  • 33. © 2017 Hazelcast Inc. Confidential & Proprietary Distributed java.util.stream API • “Classes to support functional-style operations on streams of elements, such as map-reduce transformations on collections.” • Jet adds distributed support for the java.util.stream API for Hazelcast Map and List. • Supports all j.u.s. operations such as: -map(), flatMap(), filter() -reduce(), collect() -sorted(), distinct() • Lambda serialization is a problem, but is worked around by creating serializable versions of the interfaces • j.u.s Streams are converted to Processor API (DAG) for execution
  • 34. © 2017 Hazelcast Inc. Confidential & Proprietary j.u.s API JetInstance jet = Jet.newJetInstance(); Jet.newJetInstance(); IStreamMap<Long, String> lines = jet.getMap("lines"); Map<String, Long> counts = lines .stream() .flatMap(m -> Stream.of(PATTERN.split(m.getValue().toLowerCase()))) .filter(w -> !w.isEmpty()) .collect(DistributedCollectors.toIMap(w -> w, w -> 1L, (left, right) -> left + right));
  • 35. © 2017 Hazelcast Inc. Confidential & Proprietary API Comparison • Processor API is low-level, powerful and can be used to solve any problem that can be expressed as a DAG • java.util.stream is familiar, but only limited functionality. • All data processing frameworks have their own API • Some attempts at unification are not major successes: Cascading, Apache Beam … • Cascading prototype for Jet
  • 36. © 2017 Hazelcast Inc. Confidential & Proprietary Demo
  • 37. © 2017 Hazelcast Inc. Confidential & Proprietary Roadmap
  • 38. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast Jet 0.4 (06/17) 38 Features Description Distributed Computation Distributed data processing framework. It’s based on directed acyclic graphs (DAGs) and in-memory computation. Distributed java.util.stream Java 8 stream API can be used to describe the Jet Job. The execution is distributed. Windowing with Event-time Improved streaming support including windowing support with event-time semantics. Out of the box support for tumbling, sliding and session window aggregations. Distributed Data Structures Distributed implementations of Map, Cache and List data structures highly optimized to be used for the processing. Pivotal Cloud Foundry Tile Hazelcast Jet is available in Pivotal Cloud Foundry Discovery and Networking Hazelcast discovery for finding members in the cluster or in the Cloud. Hazelcast IMDG Connector Allows Jet to integrate with both embedded and non-embedded Hazelcast instances to be used for reading and writing data for distributed computation. File and Socket Connector Ability to read and write unbounded data from text sockets and files. Kafka Connector Reads and writes data streams to the Kafka message broker. HDFS Connector Reads and writes data streams to Hadoop Distributed File System
  • 39. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast 2017 Jet Roadmap 39 Features Description Robust Stream Processing Guarantees for stateful stream processing Usability Enhancements High-Level API | Tooling High Performance Hazelcast Integrations Streaming Map and Cache events | Projection and Predicate support for Map source Management Center Management and monitoring features for Jet. More Connectors JMS | JDBC
  • 40. © 2017 Hazelcast Inc. Confidential & Proprietary Questions? Version 0.4 is the current release with 0.5 coming in September http://jet.hazelcast.org Minimum JDK 8

Editor's Notes

  1. General idea across the slide deck: The value of Jet is in simplicity and integration with IMDG Big data isn’t enough, fast big data is what matters (information available in a near real-time fashion) Jet grabs the information from the data in near real-time and stores it to IMDG staying in-memory and distributed
  2. At a business level, Hazelcast Jet is a distributed computing platform for fast processing of big data streaming sets. Jet is based on a parallel, core engine allowing data-intensive applications to operate at near real-time speeds.
  3. Jet vision = fast big data, real-time stream processor + fast batch processor How the ecosystem and looks like. Jet is part of the data-processing pipeline Two major deployment setups - stream and batch. Batch — A complete dataset is available before the processing starts. The data are submitted as one batch to be processed. Usually in Database, IMDG, on filesystem Data are loaded from (potentially multiple) sources, joined, transformed, cleaned. Then stored and/or left in IMDG for further in-memory usage. - Typically used for ETL (Extract-transfer-load) tasks. Stream - data potentially unbounded and infinite the goal is to process the data as soon as possible. Process = extract important information from the data. The whole processing stays in memory, therefore another systems may be alerted in near real-time. Data ingested to the system via message broker (Kafka) The streaming deployment is useful for use-cases, where real-time behaviour matters. Streaming features Streaming core of Jet (low latency, per-event processing). Windowing and late-events processing
  4. Hazelcast can be used in two different modes: Embedded Mode is very easy to setup and run. It can only be done with peer Java applications. Developers really like this mode during prototyping phases. Client-Server Mode is more work but efficient once an application is required to scale to larger volume of transactions or users as it separates the application from Hazelcast IMDG for better manageability. It can be done with different types of applications (C+, C#/.NET, Python, Scala, Node.js) using native clients provided by Hazelcast (C+, C#/.NET, Python, Scala, Node.js). Hazelcast Open Binary Protocol enables clients for any other language to be developed as needed (see Hazelcast IMDG Docs). With both client-server and embedded architecture, Hazelcast IMDG offers: + Elasticity and scalability + Transparent integration with backend databases using Map Store interfaces + Management of the cluster through your web browser with Management Center
  5. Jet is build on top of Hazelcast IMDG with the benefits of close integration. Hazelcast Jet and IMDG can run in these deployments: Every Jet cluster works on top of embedded Hazelcast in-memory data grid. Embedded deployment suits well for: Sharing processing state among Jet nodes Caching intermediate processing results Enriching processed events. Cache remote data (e.g. fact tables from database) on Jet nodes. Running advanced data processing tasks on top of Hazelcast data structures Development purposes. Since starting standalone Jet cluster is simple and fast. Remote deployment Share state or intermediate results among more Jet jobs Cache intermediate results (e.g. to show real-time processing stats on dashboard) Load input or store results of Jet computation in in-memory fashion to IMDG Isolation of processing and storage layer Jet x IMDG – product differentiation Both Jet and IMDG are for parallel and distributed computing, why Jet is more expensive? Is Hazelcast IMDG not good enough?
  6. The streaming benchmark is based on a simple stock exchange aggregation. Each message representing a trade is published to Kafka and then a simple windowed aggregation which calculates the number of traders per ticker per window is done using various data processing frameworks.
  7. Compared to Spark, Flink doing word count, data on HDFS/IMap, cluster of 10 nodes, 32 cores each…
  8. Here one thread is doing tokenizing and one is doing accumulation. Each can operate at its own speed using the concurrent queue to buffer work.
  9. Not the same as Fork-Join in JDK as it uses a single thread for all processing steps. parallelStream allocates a Fork-Join task per data partition but that’s all.
  10. Each Reducer is a different instance. It can be across the network or local depending on how you create your DAG model. In our j.u.s implementation reducers are always local and there is a network step for the combiner.
  11. All the nodes need to agree on who owns what partition. The data is processed locally and then sent to the partition owner as part of the combiner step. Jet uses the same partition scheme as Hazelcast. So if your Combiner step is a collect to a Hazelcast iMap the final output will be local. So there is only one network hop in the entire DAG processing. Writing to HDFS is local but then is replicated out.
  12. It‘s to be replaced by high-level What API in Jet 0.5 (opposed to „How“ core API)