SlideShare a Scribd company logo
1 of 38
In-Memory Data Streams
With
NEIL STEVENSON
neil@hazelcast.com
27th May 2017
13:25-14:10
© 2017 Hazelcast Inc. Confidential & Proprietary
Outline
• Hazelcast
• → The company, the software, and my role
• Background
• → Why stream at all ?
• Java 8 streams
• → What did Java 8 add to Java 7
• → Why isn’t this good enough ?
• Hazelcast Jet, part #1
• → Introduction and outline architecture
• → Low level abstractions : directed acyclic graphs
• A sample application, available to download : not Word Count
• Hazelcast Jet, part #2
• → Higher level abstractions → distributed java.util.stream
© 2017 Hazelcast Inc. Confidential & Proprietary
Hazelcast : The company, the software and my role
The Company
Founded in 2008, based out of Palo Alto, California with offices worldwide
Provides commercial support and valid-add subscription features for open source Hazelcast software
The Software
Apache 2 licensed, available to download from Github, from https://hazelcast.org or
https://hazelcast.com
My Role
Solutions Architect – help customers, give talks, drink coffee, write code, drink coffee
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Fast Big Data
DAG = Directed Acyclic Graph
Model the flow of data from processing stage to processing stage
→ a stream of data, potentially infinite
→ process as it comes in, don’t save first, maybe never save
→ enrich, deplete, filter, split, etc as data passes through
→ at memory speeds, no waiting for disks
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Fast Big Data
DAG = Directed Acyclic Graph
Model the flow of data from processing stage to processing stage
→ a stream of data, potentially infinite
→ process as it comes in, don’t save first, maybe never save
→ enrich, deplete, filter, split, etc as data passes through
→ at memory speeds, no waiting for disks
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Fast Big Data
6
Stream and Fast In-Memory Batch Processing
Enrichment
Databases
IoT
Social
Networks
Enterprise
Applications
Databases/
Hazelcast IMDG
HDFS/
Spark
Stream
Stream
Stream
Batch
Batch
Ingest
Alerts
Enterprise
Applications
Interactive
Analytics
Databases/
Hazelcast IMDG
Output
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet : Directed Acyclic Graph
VERTEX
The vertex is just the processing node in a pipeline.
→ Input comes in from somewhere, the first stage or the previous stage
→ Output goes out somewhere, the last stage or the next state stage
→ Stateless or stateful
→ Split, filter, enrich, deplete, fan-out, fan-in the data, many possibilities
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet : Directed Acyclic Graph
EDGE
The edge is just the data transmission in the pipeline.
→ Out of one processor into the next one
→ Out of one processor into the next ones
→ The next processor can be on any JVM, local or distributed routing
→ Back-pressure system throttles producer when consumer cannot keep up
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Jet Engine
Stream Processing
Traditional processing is based on calculations on stored data
Stream processing is about calculations prior to storage
Streams are immutable
Streams may be infinite
The “pipeline” paradigm, (input →process →output)
Pipeline stages are lambdas : (x, y) -> {return x * y;}
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 –Jet Engine
What does it do ?
Stream Processing
In-memory
Distributed
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Word Count is the “hello world” of stream processing:
The Problem
 Count how many times each word occurs in some text
 Trivial, but shows some major concepts
Input
 Hamlet’s Soliloquy
1: To be, or not to be, that is the Question:
2: Whether ’tis Nobler in the mind to suffer
3: The Slings and Arrows of outragious Fortune,
4: Or to take Armes against a Sea of troubles,
Output
the=23
to=14
and=13
be=4
…
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Set<Map.Entry<Integer, String>> entrySet = sourceMap.entrySet();
Map<String, Integer> wordCounts = entrySet.stream()
.flatMap(m ->
Stream.of(Constants.WORDS_PATTERN.split(m.getValue())))
.map(String::toLowerCase)
.filter(m -> m.length() >= 5)
.collect(toMap(
key -> key,
value -> 1,
Integer::sum));
In Java we would basically iterate and tally
How can the JVM optimise?
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Input OutputTokenizer Reducer
Split the text into words
For each word emit (word)
Collect running totals
Once everything is finished,
emit all pairs of (word, count)
(text) (word) (word, count)
But really this is just a pipeline, so a DAG
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Input
(text) (word)
Output
(word, count)
Tokenizer Reducer
Split the text into words
For each word emit (word)
Collect running totals.
Once everything is finished,
emit all pairs of (word, count)
Using queues between vertices allows each to run in parallel, at their own speed
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Output
(word, count)
ReducerInput
Tokenizer
Tokenizer
We can exploit multiple CPUs because lines can be processed in parallel
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
(word)
(word)
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Use routing algorithms to select the next vertex or vertices
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 1 : Word Count
Node
Node
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Combiner
Combiner
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Combiner
Combiner
Distribute!!
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
The Problem :
Time-series foreign exchange prices.
We want to compute moving averages in various ways
→ last n measurements, last 15, last 50, etc
Why ?
→ rapidly changing data
→ time-to-market benefits from fast processing
Why ?
→ gives a clearer view of the trend
Why ?
→ to demonstrate a different architecture pattern
→ processing a stream of data, don’t save first then analyse
→ partitioning a stream of data, for scaling
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
The Data
For convenience, we’re using end of day prices rather than live prices, so frequency is one
sample per 24x60x60x1000 milliseconds. And only for the Euro.
<gesmes:Sender>
<gesmes:name>European Central Bank</gesmes:name>
</gesmes:Sender>
<Cube>
<Cube time="2017-04-20">
<Cube currency="USD" rate="1.0745"/>
<Cube currency="JPY" rate="117.16"/>
<Cube currency="BGN" rate="1.9558"/>
<Cube currency="CZK" rate="26.907"/>
<Cube currency="DKK" rate="7.4381"/>
<Cube currency="GBP" rate="0.8392"/>
<Cube currency="HUF" rate="313.5"/>
<Cube currency="PLN" rate="4.2588"/>
<Cube currency="RON" rate="4.5405"/>
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
Last n
Window
Input:
FX feed
(from,to,price)
One Solution
Input arrives as a stream of individual prices. Eg ”EUR,GBP,0.8392”
Collate these into batch of n per pair
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
Last n
Window
Simple
Average
Weighted
Average
Input:
FX feed
(from,to,price)
n * (from,to,price)
n * (from,to,price)
One Solution
Send a self-contained parcel of work to each calculator
A batch of n prices for a pair, eg. ”EUR,GBP,0.8392, 0.8391, 0.8390, 0.8389, …”
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
Output:
Store A
Last n
Window
Simple
Average
Weighted
Average
Input:
FX feed
Output:
Store B
(from,to,price)
n * (from,to,price)
n * (from,to,price)
(from,to,average)
(from,to,average)
One Solution
Stream out the averages….
Your output is someone else’s input
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
Output:
Store A
Input:
FX feed
Output:
Store B
(from,CAD,price)
Last n
Window
Simple
Average
Weighted
Average
n * (from, USD,price)
n * (from, USD,price)
(from, USD,,average)
(from, CAD,,average)
Last n
Window
Simple
Average
Weighted
Average
n * (from CAD,price)
n * (from, CAD,price)
(from,USD,price)
(from, CAD,,average)
(from, USD,,average)
One Solution
Partition provides performance. Send US Dollars and Canadian Dollars to different processor
clones
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
One Solution
DEMO
https://github.com/neilstevenson/jeeconf2017
© 2017 Hazelcast Inc. Confidential & Proprietary
Example 2 : Foreign Currency
One Solution
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
Jet capability is easy to add to IMDG
Two steps and you’re ready to submit jobs!
<dependency>
<groupId>com.hazelcast.jet</groupId>
<artifactId>hazelcast-jet</artifactId>
<version>0.3.1</version>
</dependency>
@Bean
public JetInstance jetInstance(Config config) {
JetConfig jetConfig = new JetConfig();
jetConfig.setHazelcastConfig(config);
return Jet.newJetInstance(jetConfig);
}
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
Jet capability is the processing, but what about the start and end of the pipelines ?
A source creates output without input.
A sink consumes input without output.
Where it goes is just a matter of plumbing
→ Hazelcast IMDG, IMap and IList
→ Kafka
→ HDFS
→ flat files
→ sockets
→ easy to write your own, they’re just vertices
implement process() to consume input
implement complete() to generate output
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
DAG construction is easy(?)
Create vertices, and edges to link them
public MaDAG (final int last) {
Vertex mapSource = this.newVertex("mapSource",
Processors.readMap(Constants.MAP_HISTORIC_CURRENCY));
Vertex lastN = this.newVertex("lastN", new LastNProcessorSupplier(last));
this.edge(Edge.between(mapSource, lastN).partitioned(new MaKeyExtractor()));
Vertex sma = this.newVertex("sma", SmaProcessor::new);
this.edge(Edge.from(lastN, 0).to(sma));
Vertex smaMapSink = this.newVertex("smaMapSink",
Processors.writeMap(Constants.MAP_SMA));
this.edge(Edge.between(sma, smaMapSink));
But is there any easier way ?
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
java.util.stream
An easier(?) way to construct a pipeline
Change from Java 8
Set<Map.Entry<Integer, String>> entrySet = sourceMap.entrySet();
Map<String, Integer> wordCounts = entrySet.stream()
.flatMap(m ->
Stream.of(Constants.WORDS_PATTERN.split(m.getValue())))
.map(String::toLowerCase)
.filter(m -> m.length() >= 5)
.collect(toMap(
key -> key,
value -> 1,
Integer::sum));
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
com.hazelcast.jet.stream
An easier(?) way to construct a pipeline
Change to Jet
IStreamMap<Integer, String> streamMap = IStreamMap.streamMap(sourceMap);
IMap<String, Integer> wordCounts = streamMap.stream()
.flatMap(m ->
Stream.of(Constants.WORDS_PATTERN.split(m.getValue())))
.map(String::toLowerCase)
.filter(m -> m.length() >= 5)
.collect(toIMap(
key -> key,
value -> 1,
Integer::sum));
More thinking than typing
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
DAG v java.util.stream
JET provides java.util.stream interface – high-level constructs
like Java 8’s collect(), distinct(),filter(), reduce(), sorted() etc
but run distributed
Or use the DAG approach, for low-level fine grained approached
Or mix & match
Vertex tokenize = dag.newVertex("tokenize",
flatMap((String line) ->
traverseArray(delimiter.split(line.toLowerCase()))
.filter(word -> !word.isEmpty())));
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
DAG v java.util.stream
Vertex tokenize = dag.newVertex("tokenize",
flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase()))
.filter(word -> !word.isEmpty())));
Here filter implements
java.util.stream.Stream<T>
java.util.stream.Stream.filter(Predicate<? super T> predicate)
But the Jet version is
com.hazelcast.jet.stream.DistributedStream<T>
com.hazelcast.jet.stream.DistributedStream.filter(
(com.hazelcast.jet.Distribtued.Predicate<? super T> predicate)
So you can send copies to the grid to execute, remotely and in parallel
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Engine
Architecture
© 2017 Hazelcast Inc. Confidential & Proprietary
Jet Roadmap
34
Features Description
Robust Stream Processing
Processing guarantees for stream processing | Streaming specific
features (windowing, triggering)
High Performance
Hazelcast Integrations
JCache | Map and Cache events using partition ring buffer | CQ
Cache | Projection and Predicate for Map source
Management Center Management and monitoring features for Jet.
More Connectors JMS | JDBC
Cloud Deployment Pivotal Cloud Foundry | Open Shift
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Jet Engine
Performance
Fastest in town!
© 2017 Hazelcast Inc. Confidential & Proprietary
Part 1 – Jet Engine
Performance
Run the graph on as many machines as necessary or available
→ Fan-out the input
→ Send from node to node, local or distributed
→ Fan-in the output
© 2017 Hazelcast Inc. Confidential & Proprietary
Conclusions
Stream Processing
• Suitable when data arrives too fast to process after storing, or where you don’t care to store
• Needs a much more functional programming style than tradition Java
• → lambdas feature heavily
• Java streams is ok, might be all you need
• → makes good use of a single machine
• Jet streams is better, for bigger volumes
• → makes use of multiple machines
• Jet is from Hazelcast
• → easy to get going, deploy to bare metal or any cloud
• Alternatives exist, such as Spark and Flink
• → Jet is open-source, Java, faster, no Zookeeper
© 2017 Hazelcast Inc. Confidential & Proprietary
The End
https://github.com/neilstevenson/jeeconf2017
neil@hazelcast.com
https://jet.hazelcast.org/
https://github.com/hazelcast/hazelcast-jet
Stack Overflow “hazelcast-jet” or Google Group
https://gitter.im/hazelcast/home

More Related Content

Similar to JEEConf 2017 - In-Memory Data Streams With Hazelcast Jet

OSMC 2018 | Distributed Tracing FAQ by Gianluca Arbezzano
OSMC 2018 | Distributed Tracing FAQ by Gianluca ArbezzanoOSMC 2018 | Distributed Tracing FAQ by Gianluca Arbezzano
OSMC 2018 | Distributed Tracing FAQ by Gianluca ArbezzanoNETWAYS
 
GDG Helwan Introduction to python
GDG Helwan Introduction to pythonGDG Helwan Introduction to python
GDG Helwan Introduction to pythonMohamed Hegazy
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxSumant Tambe
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataMatt Stubbs
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataMatt Stubbs
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuningYosuke Mizutani
 
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...confluent
 
Automatski - How We Reinvented Machine Learning, Solved NP-Complete ML Proble...
Automatski - How We Reinvented Machine Learning, Solved NP-Complete ML Proble...Automatski - How We Reinvented Machine Learning, Solved NP-Complete ML Proble...
Automatski - How We Reinvented Machine Learning, Solved NP-Complete ML Proble...Aditya Yadav
 
Spring Framework 5.0による Reactive Web Application #JavaDayTokyo
Spring Framework 5.0による Reactive Web Application #JavaDayTokyoSpring Framework 5.0による Reactive Web Application #JavaDayTokyo
Spring Framework 5.0による Reactive Web Application #JavaDayTokyoToshiaki Maki
 
A Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen DonigianA Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen DonigianData Con LA
 
The "Holy Grail" of Dev/Ops
The "Holy Grail" of Dev/OpsThe "Holy Grail" of Dev/Ops
The "Holy Grail" of Dev/OpsErik Osterman
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Codemotion
 
Deep Learning at AWS: Embedding & Attention Models
Deep Learning at AWS: Embedding & Attention ModelsDeep Learning at AWS: Embedding & Attention Models
Deep Learning at AWS: Embedding & Attention ModelsAmazon Web Services
 
Going open source with small teams
Going open source with small teamsGoing open source with small teams
Going open source with small teamsJamie Thomas
 
Plan a successful enterprise Linux migration
Plan a successful enterprise Linux migrationPlan a successful enterprise Linux migration
Plan a successful enterprise Linux migrationRogue Wave Software
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersEdelweiss Kammermann
 

Similar to JEEConf 2017 - In-Memory Data Streams With Hazelcast Jet (20)

OSMC 2018 | Distributed Tracing FAQ by Gianluca Arbezzano
OSMC 2018 | Distributed Tracing FAQ by Gianluca ArbezzanoOSMC 2018 | Distributed Tracing FAQ by Gianluca Arbezzano
OSMC 2018 | Distributed Tracing FAQ by Gianluca Arbezzano
 
GDG Helwan Introduction to python
GDG Helwan Introduction to pythonGDG Helwan Introduction to python
GDG Helwan Introduction to python
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and Rx
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...
 
Automatski - How We Reinvented Machine Learning, Solved NP-Complete ML Proble...
Automatski - How We Reinvented Machine Learning, Solved NP-Complete ML Proble...Automatski - How We Reinvented Machine Learning, Solved NP-Complete ML Proble...
Automatski - How We Reinvented Machine Learning, Solved NP-Complete ML Proble...
 
Spring Framework 5.0による Reactive Web Application #JavaDayTokyo
Spring Framework 5.0による Reactive Web Application #JavaDayTokyoSpring Framework 5.0による Reactive Web Application #JavaDayTokyo
Spring Framework 5.0による Reactive Web Application #JavaDayTokyo
 
A Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen DonigianA Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen Donigian
 
The "Holy Grail" of Dev/Ops
The "Holy Grail" of Dev/OpsThe "Holy Grail" of Dev/Ops
The "Holy Grail" of Dev/Ops
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
 
Deep Learning at AWS: Embedding & Attention Models
Deep Learning at AWS: Embedding & Attention ModelsDeep Learning at AWS: Embedding & Attention Models
Deep Learning at AWS: Embedding & Attention Models
 
Going open source with small teams
Going open source with small teamsGoing open source with small teams
Going open source with small teams
 
Plan a successful enterprise Linux migration
Plan a successful enterprise Linux migrationPlan a successful enterprise Linux migration
Plan a successful enterprise Linux migration
 
GluonCV
GluonCVGluonCV
GluonCV
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
STG401_This Is My Architecture
STG401_This Is My ArchitectureSTG401_This Is My Architecture
STG401_This Is My Architecture
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

JEEConf 2017 - In-Memory Data Streams With Hazelcast Jet

  • 1. In-Memory Data Streams With NEIL STEVENSON neil@hazelcast.com 27th May 2017 13:25-14:10
  • 2. © 2017 Hazelcast Inc. Confidential & Proprietary Outline • Hazelcast • → The company, the software, and my role • Background • → Why stream at all ? • Java 8 streams • → What did Java 8 add to Java 7 • → Why isn’t this good enough ? • Hazelcast Jet, part #1 • → Introduction and outline architecture • → Low level abstractions : directed acyclic graphs • A sample application, available to download : not Word Count • Hazelcast Jet, part #2 • → Higher level abstractions → distributed java.util.stream
  • 3. © 2017 Hazelcast Inc. Confidential & Proprietary Hazelcast : The company, the software and my role The Company Founded in 2008, based out of Palo Alto, California with offices worldwide Provides commercial support and valid-add subscription features for open source Hazelcast software The Software Apache 2 licensed, available to download from Github, from https://hazelcast.org or https://hazelcast.com My Role Solutions Architect – help customers, give talks, drink coffee, write code, drink coffee
  • 4. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Fast Big Data DAG = Directed Acyclic Graph Model the flow of data from processing stage to processing stage → a stream of data, potentially infinite → process as it comes in, don’t save first, maybe never save → enrich, deplete, filter, split, etc as data passes through → at memory speeds, no waiting for disks
  • 5. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Fast Big Data DAG = Directed Acyclic Graph Model the flow of data from processing stage to processing stage → a stream of data, potentially infinite → process as it comes in, don’t save first, maybe never save → enrich, deplete, filter, split, etc as data passes through → at memory speeds, no waiting for disks
  • 6. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Fast Big Data 6 Stream and Fast In-Memory Batch Processing Enrichment Databases IoT Social Networks Enterprise Applications Databases/ Hazelcast IMDG HDFS/ Spark Stream Stream Stream Batch Batch Ingest Alerts Enterprise Applications Interactive Analytics Databases/ Hazelcast IMDG Output
  • 7. © 2017 Hazelcast Inc. Confidential & Proprietary Jet : Directed Acyclic Graph VERTEX The vertex is just the processing node in a pipeline. → Input comes in from somewhere, the first stage or the previous stage → Output goes out somewhere, the last stage or the next state stage → Stateless or stateful → Split, filter, enrich, deplete, fan-out, fan-in the data, many possibilities
  • 8. © 2017 Hazelcast Inc. Confidential & Proprietary Jet : Directed Acyclic Graph EDGE The edge is just the data transmission in the pipeline. → Out of one processor into the next one → Out of one processor into the next ones → The next processor can be on any JVM, local or distributed routing → Back-pressure system throttles producer when consumer cannot keep up
  • 9. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Jet Engine Stream Processing Traditional processing is based on calculations on stored data Stream processing is about calculations prior to storage Streams are immutable Streams may be infinite The “pipeline” paradigm, (input →process →output) Pipeline stages are lambdas : (x, y) -> {return x * y;}
  • 10. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 –Jet Engine What does it do ? Stream Processing In-memory Distributed
  • 11. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Word Count is the “hello world” of stream processing: The Problem  Count how many times each word occurs in some text  Trivial, but shows some major concepts Input  Hamlet’s Soliloquy 1: To be, or not to be, that is the Question: 2: Whether ’tis Nobler in the mind to suffer 3: The Slings and Arrows of outragious Fortune, 4: Or to take Armes against a Sea of troubles, Output the=23 to=14 and=13 be=4 …
  • 12. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Set<Map.Entry<Integer, String>> entrySet = sourceMap.entrySet(); Map<String, Integer> wordCounts = entrySet.stream() .flatMap(m -> Stream.of(Constants.WORDS_PATTERN.split(m.getValue()))) .map(String::toLowerCase) .filter(m -> m.length() >= 5) .collect(toMap( key -> key, value -> 1, Integer::sum)); In Java we would basically iterate and tally How can the JVM optimise?
  • 13. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Input OutputTokenizer Reducer Split the text into words For each word emit (word) Collect running totals Once everything is finished, emit all pairs of (word, count) (text) (word) (word, count) But really this is just a pipeline, so a DAG
  • 14. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Input (text) (word) Output (word, count) Tokenizer Reducer Split the text into words For each word emit (word) Collect running totals. Once everything is finished, emit all pairs of (word, count) Using queues between vertices allows each to run in parallel, at their own speed
  • 15. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Output (word, count) ReducerInput Tokenizer Tokenizer We can exploit multiple CPUs because lines can be processed in parallel
  • 16. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count (word) (word) Input Output Tokenizer Tokenizer Reducer Reducer Use routing algorithms to select the next vertex or vertices
  • 17. © 2017 Hazelcast Inc. Confidential & Proprietary Example 1 : Word Count Node Node Input Output Tokenizer Tokenizer Reducer Reducer Combiner Combiner Input Output Tokenizer Tokenizer Reducer Reducer Combiner Combiner Distribute!!
  • 18. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency The Problem : Time-series foreign exchange prices. We want to compute moving averages in various ways → last n measurements, last 15, last 50, etc Why ? → rapidly changing data → time-to-market benefits from fast processing Why ? → gives a clearer view of the trend Why ? → to demonstrate a different architecture pattern → processing a stream of data, don’t save first then analyse → partitioning a stream of data, for scaling
  • 19. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency The Data For convenience, we’re using end of day prices rather than live prices, so frequency is one sample per 24x60x60x1000 milliseconds. And only for the Euro. <gesmes:Sender> <gesmes:name>European Central Bank</gesmes:name> </gesmes:Sender> <Cube> <Cube time="2017-04-20"> <Cube currency="USD" rate="1.0745"/> <Cube currency="JPY" rate="117.16"/> <Cube currency="BGN" rate="1.9558"/> <Cube currency="CZK" rate="26.907"/> <Cube currency="DKK" rate="7.4381"/> <Cube currency="GBP" rate="0.8392"/> <Cube currency="HUF" rate="313.5"/> <Cube currency="PLN" rate="4.2588"/> <Cube currency="RON" rate="4.5405"/>
  • 20. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency Last n Window Input: FX feed (from,to,price) One Solution Input arrives as a stream of individual prices. Eg ”EUR,GBP,0.8392” Collate these into batch of n per pair
  • 21. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency Last n Window Simple Average Weighted Average Input: FX feed (from,to,price) n * (from,to,price) n * (from,to,price) One Solution Send a self-contained parcel of work to each calculator A batch of n prices for a pair, eg. ”EUR,GBP,0.8392, 0.8391, 0.8390, 0.8389, …”
  • 22. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency Output: Store A Last n Window Simple Average Weighted Average Input: FX feed Output: Store B (from,to,price) n * (from,to,price) n * (from,to,price) (from,to,average) (from,to,average) One Solution Stream out the averages…. Your output is someone else’s input
  • 23. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency Output: Store A Input: FX feed Output: Store B (from,CAD,price) Last n Window Simple Average Weighted Average n * (from, USD,price) n * (from, USD,price) (from, USD,,average) (from, CAD,,average) Last n Window Simple Average Weighted Average n * (from CAD,price) n * (from, CAD,price) (from,USD,price) (from, CAD,,average) (from, USD,,average) One Solution Partition provides performance. Send US Dollars and Canadian Dollars to different processor clones
  • 24. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency One Solution DEMO https://github.com/neilstevenson/jeeconf2017
  • 25. © 2017 Hazelcast Inc. Confidential & Proprietary Example 2 : Foreign Currency One Solution
  • 26. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine Jet capability is easy to add to IMDG Two steps and you’re ready to submit jobs! <dependency> <groupId>com.hazelcast.jet</groupId> <artifactId>hazelcast-jet</artifactId> <version>0.3.1</version> </dependency> @Bean public JetInstance jetInstance(Config config) { JetConfig jetConfig = new JetConfig(); jetConfig.setHazelcastConfig(config); return Jet.newJetInstance(jetConfig); }
  • 27. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine Jet capability is the processing, but what about the start and end of the pipelines ? A source creates output without input. A sink consumes input without output. Where it goes is just a matter of plumbing → Hazelcast IMDG, IMap and IList → Kafka → HDFS → flat files → sockets → easy to write your own, they’re just vertices implement process() to consume input implement complete() to generate output
  • 28. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine DAG construction is easy(?) Create vertices, and edges to link them public MaDAG (final int last) { Vertex mapSource = this.newVertex("mapSource", Processors.readMap(Constants.MAP_HISTORIC_CURRENCY)); Vertex lastN = this.newVertex("lastN", new LastNProcessorSupplier(last)); this.edge(Edge.between(mapSource, lastN).partitioned(new MaKeyExtractor())); Vertex sma = this.newVertex("sma", SmaProcessor::new); this.edge(Edge.from(lastN, 0).to(sma)); Vertex smaMapSink = this.newVertex("smaMapSink", Processors.writeMap(Constants.MAP_SMA)); this.edge(Edge.between(sma, smaMapSink)); But is there any easier way ?
  • 29. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine java.util.stream An easier(?) way to construct a pipeline Change from Java 8 Set<Map.Entry<Integer, String>> entrySet = sourceMap.entrySet(); Map<String, Integer> wordCounts = entrySet.stream() .flatMap(m -> Stream.of(Constants.WORDS_PATTERN.split(m.getValue()))) .map(String::toLowerCase) .filter(m -> m.length() >= 5) .collect(toMap( key -> key, value -> 1, Integer::sum));
  • 30. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine com.hazelcast.jet.stream An easier(?) way to construct a pipeline Change to Jet IStreamMap<Integer, String> streamMap = IStreamMap.streamMap(sourceMap); IMap<String, Integer> wordCounts = streamMap.stream() .flatMap(m -> Stream.of(Constants.WORDS_PATTERN.split(m.getValue()))) .map(String::toLowerCase) .filter(m -> m.length() >= 5) .collect(toIMap( key -> key, value -> 1, Integer::sum)); More thinking than typing
  • 31. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine DAG v java.util.stream JET provides java.util.stream interface – high-level constructs like Java 8’s collect(), distinct(),filter(), reduce(), sorted() etc but run distributed Or use the DAG approach, for low-level fine grained approached Or mix & match Vertex tokenize = dag.newVertex("tokenize", flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase())) .filter(word -> !word.isEmpty())));
  • 32. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine DAG v java.util.stream Vertex tokenize = dag.newVertex("tokenize", flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase())) .filter(word -> !word.isEmpty()))); Here filter implements java.util.stream.Stream<T> java.util.stream.Stream.filter(Predicate<? super T> predicate) But the Jet version is com.hazelcast.jet.stream.DistributedStream<T> com.hazelcast.jet.stream.DistributedStream.filter( (com.hazelcast.jet.Distribtued.Predicate<? super T> predicate) So you can send copies to the grid to execute, remotely and in parallel
  • 33. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Engine Architecture
  • 34. © 2017 Hazelcast Inc. Confidential & Proprietary Jet Roadmap 34 Features Description Robust Stream Processing Processing guarantees for stream processing | Streaming specific features (windowing, triggering) High Performance Hazelcast Integrations JCache | Map and Cache events using partition ring buffer | CQ Cache | Projection and Predicate for Map source Management Center Management and monitoring features for Jet. More Connectors JMS | JDBC Cloud Deployment Pivotal Cloud Foundry | Open Shift
  • 35. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Jet Engine Performance Fastest in town!
  • 36. © 2017 Hazelcast Inc. Confidential & Proprietary Part 1 – Jet Engine Performance Run the graph on as many machines as necessary or available → Fan-out the input → Send from node to node, local or distributed → Fan-in the output
  • 37. © 2017 Hazelcast Inc. Confidential & Proprietary Conclusions Stream Processing • Suitable when data arrives too fast to process after storing, or where you don’t care to store • Needs a much more functional programming style than tradition Java • → lambdas feature heavily • Java streams is ok, might be all you need • → makes good use of a single machine • Jet streams is better, for bigger volumes • → makes use of multiple machines • Jet is from Hazelcast • → easy to get going, deploy to bare metal or any cloud • Alternatives exist, such as Spark and Flink • → Jet is open-source, Java, faster, no Zookeeper
  • 38. © 2017 Hazelcast Inc. Confidential & Proprietary The End https://github.com/neilstevenson/jeeconf2017 neil@hazelcast.com https://jet.hazelcast.org/ https://github.com/hazelcast/hazelcast-jet Stack Overflow “hazelcast-jet” or Google Group https://gitter.im/hazelcast/home