The capabilities of machine learning are now pretty well understood and there are great tools to do data science and construct models that answer nontrivial questions about your data. These tools are often used from Python.
The key new challenge is making the trained prediction model usable in real time, while the user is interacting with your software. Getting answers from an ML model takes a lot of CPU/GPU and must be done at serious scale. The ML tools are optimized mainly for batch-processing a lot of data at once, and often the implementations aren't parallelized.
In this talk I will show one approach which allows you to write a low-latency, auto-parallelized and distributed stream processing pipeline in Java that seamlessly integrates with a data scientist's work taken in almost unchanged form from their Python development environment.
The talk includes a live demo using the command line and going through some Python and Java code snippets.
2. About Us
Hazelcast started as an In-Memory Data Grid project
(distributed caching with data-local computation)
Java codebase, polyglot clients
we focus on Simplicity
in 2017 we added Hazelcast Jet: In-Memory Distributed Stream
Processing
My role: core engineer at Hazelcast Jet
2
13. Jet Node 1
Jet Job's Execution Plan
Source
mapUsing
Python
mapUsing
Python
Sink
13
Kafka Topic A
Python
process
Kafka Topic B
Batch
Python
process
Batch
14. Cluster Elasticity and Resilience
● Jet jobs are fault-tolerant
● nodes can join and leave the cluster, jobs go on
● automatically rescale to available hardware
14
16. Cluster Self-Formation
16
Hazelcast Jet natively supports:
● Amazon AWS
● Google GCP
● Kubernetes
With simple configuration, the nodes self-discover in these
environments
17. Source and Sink Connectors
● Kafka
● Hadoop HDFS
● S3 bucket
● JDBC
● JMS queue and topic
● custom connector
● test connectors: convenient to test the code before submitting
17
18. Stream Operators
● windowed aggregation using Event Time
○ sliding, session window
○ count, sum, average, linear regression, ...
○ custom aggregate function
● rolling aggregation
● streaming join (co-grouping)
● hash join (enrichment)
● contact arbitrary external services
○ mapUsingPython uses this
18