SlideShare a Scribd company logo
Real-Time Streaming with
Python ML Inference
Marko Topolnik
1
About Us
Hazelcast started as an In-Memory Data Grid project
(distributed caching with data-local computation)
Java codebase, polyglot clients
we focus on Simplicity
in 2017 we added Hazelcast Jet: In-Memory Distributed Stream
Processing
My role: core engineer at Hazelcast Jet
2
Example: Salary Prediction
Python, SciKit Learn
3
Sample Input and Output
Input: {
"age": 25,
"workclass": "Self-emp",
"fnlwgt": 176756,
"education": "HS-grad",
"education-num": 9,
"marital-status": "Never-married",
"occupation": "Farming-fishing",
"relationship": "Own-child",
"capital-gain": 0,
"capital-loss": 0,
"hours-per-week": 35,
"native-country": "United-States"
}
Output: {
"probability": 0.85
"income": "<=50K"
}
4
(Showing Project Directory)
5
Data Scientist's Perspective
Do Data Science Profit!
6
Data Engineer's Perspective
Do Data Science
Monitor
Profit
(fingers crossed)
7
Package
Load-Balance
Scale Up
Scale Out
Deploy
Re-Train
Productionizing the REST service
8
Request #1
Request #2
Request #3
R
E
S
T
Batching?
Productionizing the REST service
9
Request #1
Request #2
Request #3
R
E
S
T
Request
Queue
Batch
Response
Queue
Parallelism?
Productionizing the REST service
10
Request #1
Request #2
Request #3
R
E
S
T
R
E
S
T
Load
Balancer
Request
Queue
Response
Queue
Batch
Batch
Replace REST with Distributed Streaming
11
Request #1
Request #2
Request #3
Kafka
Hazelcast
Jet Cluster
$ jet submit ML
Hazelcast Jet Code
Pipeline p = Pipeline.create();
p.readFrom(Kafka.source())
.apply(mapUsingPython(new PythonServiceConfig()
.setBaseDir("/Users/mtopol/dev/python/sklearn")
.setHandlerModule("example_1_inference_jet")))
.writeTo(Kafka.sink());
jet.newJob(p);
$ mvn package
$ jet submit target/my-job.jar
12
Jet Node 1
Jet Job's Execution Plan
Source
mapUsing
Python
mapUsing
Python
Sink
13
Kafka Topic A
Python
process
Kafka Topic B
Batch
Python
process
Batch
Cluster Elasticity and Resilience
● Jet jobs are fault-tolerant
● nodes can join and leave the cluster, jobs go on
● automatically rescale to available hardware
14
Let's Start a Jet Cluster!
15
Cluster Self-Formation
16
Hazelcast Jet natively supports:
● Amazon AWS
● Google GCP
● Kubernetes
With simple configuration, the nodes self-discover in these
environments
Source and Sink Connectors
● Kafka
● Hadoop HDFS
● S3 bucket
● JDBC
● JMS queue and topic
● custom connector
● test connectors: convenient to test the code before submitting
17
Stream Operators
● windowed aggregation using Event Time
○ sliding, session window
○ count, sum, average, linear regression, ...
○ custom aggregate function
● rolling aggregation
● streaming join (co-grouping)
● hash join (enrichment)
● contact arbitrary external services
○ mapUsingPython uses this
18
Thanks for attending!
Q&A
marko@hazelcast.com
@mtopolnik
19

More Related Content

Similar to Real-Time Streaming with Python ML Inference

Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
Webinar elastic stack {on telecom} english webinar part (1)
Webinar elastic stack {on telecom} english webinar part (1)Webinar elastic stack {on telecom} english webinar part (1)
Webinar elastic stack {on telecom} english webinar part (1)
Yassine, LASRI
 
Transforming your application with Elasticsearch
Transforming your application with ElasticsearchTransforming your application with Elasticsearch
Transforming your application with Elasticsearch
Brian Ritchie
 
Test Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsTest Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely tests
Hugh McCamphill
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
Dataiku
 
Contract-driven development with OpenAPI 3 and Vert.x | DevNation Tech Talk
Contract-driven development with OpenAPI 3 and Vert.x | DevNation Tech TalkContract-driven development with OpenAPI 3 and Vert.x | DevNation Tech Talk
Contract-driven development with OpenAPI 3 and Vert.x | DevNation Tech Talk
Red Hat Developers
 
Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)
ibwhite
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
Russell Jurney
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
DECK36
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
Russell Jurney
 
Open Source World June '21 -- JSON Within a Relational Database
Open Source World June '21 -- JSON Within a Relational DatabaseOpen Source World June '21 -- JSON Within a Relational Database
Open Source World June '21 -- JSON Within a Relational Database
Dave Stokes
 
Learning analytics inside government
Learning analytics inside governmentLearning analytics inside government
Learning analytics inside government
mwebbjisc
 
Montreal Elasticsearch Meetup
Montreal Elasticsearch MeetupMontreal Elasticsearch Meetup
Montreal Elasticsearch Meetup
Loïc Bertron
 
Docker Summit MongoDB - Data Democratization
Docker Summit MongoDB - Data Democratization Docker Summit MongoDB - Data Democratization
Docker Summit MongoDB - Data Democratization
Chris Grabosky
 
DevSecCon London 2018: Open DevSecOps
DevSecCon London 2018: Open DevSecOpsDevSecCon London 2018: Open DevSecOps
DevSecCon London 2018: Open DevSecOps
DevSecCon
 
Xain.io exhibiting at Berlin Tech Job Fair Spring 2020
Xain.io exhibiting at Berlin Tech Job Fair Spring 2020Xain.io exhibiting at Berlin Tech Job Fair Spring 2020
Xain.io exhibiting at Berlin Tech Job Fair Spring 2020
TechMeetups
 
Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and Flink
Michael Häusler
 
Telling the LivePerson Technology Story at Couchbase [SF] 2013
Telling the LivePerson Technology Story at Couchbase [SF] 2013Telling the LivePerson Technology Story at Couchbase [SF] 2013
Telling the LivePerson Technology Story at Couchbase [SF] 2013
LivePerson
 
Human-in-the-loop: a design pattern for managing teams which leverage ML by P...
Human-in-the-loop: a design pattern for managing teams which leverage ML by P...Human-in-the-loop: a design pattern for managing teams which leverage ML by P...
Human-in-the-loop: a design pattern for managing teams which leverage ML by P...
Big Data Spain
 
ICONUK 2016: REST Assured, Freeing Your Domino Data Has Never Been That Easy!
ICONUK 2016: REST Assured, Freeing Your Domino Data Has Never Been That Easy!ICONUK 2016: REST Assured, Freeing Your Domino Data Has Never Been That Easy!
ICONUK 2016: REST Assured, Freeing Your Domino Data Has Never Been That Easy!
Serdar Basegmez
 

Similar to Real-Time Streaming with Python ML Inference (20)

Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Webinar elastic stack {on telecom} english webinar part (1)
Webinar elastic stack {on telecom} english webinar part (1)Webinar elastic stack {on telecom} english webinar part (1)
Webinar elastic stack {on telecom} english webinar part (1)
 
Transforming your application with Elasticsearch
Transforming your application with ElasticsearchTransforming your application with Elasticsearch
Transforming your application with Elasticsearch
 
Test Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsTest Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely tests
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
Contract-driven development with OpenAPI 3 and Vert.x | DevNation Tech Talk
Contract-driven development with OpenAPI 3 and Vert.x | DevNation Tech TalkContract-driven development with OpenAPI 3 and Vert.x | DevNation Tech Talk
Contract-driven development with OpenAPI 3 and Vert.x | DevNation Tech Talk
 
Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Open Source World June '21 -- JSON Within a Relational Database
Open Source World June '21 -- JSON Within a Relational DatabaseOpen Source World June '21 -- JSON Within a Relational Database
Open Source World June '21 -- JSON Within a Relational Database
 
Learning analytics inside government
Learning analytics inside governmentLearning analytics inside government
Learning analytics inside government
 
Montreal Elasticsearch Meetup
Montreal Elasticsearch MeetupMontreal Elasticsearch Meetup
Montreal Elasticsearch Meetup
 
Docker Summit MongoDB - Data Democratization
Docker Summit MongoDB - Data Democratization Docker Summit MongoDB - Data Democratization
Docker Summit MongoDB - Data Democratization
 
DevSecCon London 2018: Open DevSecOps
DevSecCon London 2018: Open DevSecOpsDevSecCon London 2018: Open DevSecOps
DevSecCon London 2018: Open DevSecOps
 
Xain.io exhibiting at Berlin Tech Job Fair Spring 2020
Xain.io exhibiting at Berlin Tech Job Fair Spring 2020Xain.io exhibiting at Berlin Tech Job Fair Spring 2020
Xain.io exhibiting at Berlin Tech Job Fair Spring 2020
 
Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and Flink
 
Telling the LivePerson Technology Story at Couchbase [SF] 2013
Telling the LivePerson Technology Story at Couchbase [SF] 2013Telling the LivePerson Technology Story at Couchbase [SF] 2013
Telling the LivePerson Technology Story at Couchbase [SF] 2013
 
Human-in-the-loop: a design pattern for managing teams which leverage ML by P...
Human-in-the-loop: a design pattern for managing teams which leverage ML by P...Human-in-the-loop: a design pattern for managing teams which leverage ML by P...
Human-in-the-loop: a design pattern for managing teams which leverage ML by P...
 
ICONUK 2016: REST Assured, Freeing Your Domino Data Has Never Been That Easy!
ICONUK 2016: REST Assured, Freeing Your Domino Data Has Never Been That Easy!ICONUK 2016: REST Assured, Freeing Your Domino Data Has Never Been That Easy!
ICONUK 2016: REST Assured, Freeing Your Domino Data Has Never Been That Easy!
 

Recently uploaded

Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Requirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional SafetyRequirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional Safety
Ayan Halder
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
aymanquadri279
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 

Recently uploaded (20)

Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
Requirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional SafetyRequirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional Safety
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 

Real-Time Streaming with Python ML Inference

  • 1. Real-Time Streaming with Python ML Inference Marko Topolnik 1
  • 2. About Us Hazelcast started as an In-Memory Data Grid project (distributed caching with data-local computation) Java codebase, polyglot clients we focus on Simplicity in 2017 we added Hazelcast Jet: In-Memory Distributed Stream Processing My role: core engineer at Hazelcast Jet 2
  • 4. Sample Input and Output Input: { "age": 25, "workclass": "Self-emp", "fnlwgt": 176756, "education": "HS-grad", "education-num": 9, "marital-status": "Never-married", "occupation": "Farming-fishing", "relationship": "Own-child", "capital-gain": 0, "capital-loss": 0, "hours-per-week": 35, "native-country": "United-States" } Output: { "probability": 0.85 "income": "<=50K" } 4
  • 6. Data Scientist's Perspective Do Data Science Profit! 6
  • 7. Data Engineer's Perspective Do Data Science Monitor Profit (fingers crossed) 7 Package Load-Balance Scale Up Scale Out Deploy Re-Train
  • 8. Productionizing the REST service 8 Request #1 Request #2 Request #3 R E S T Batching?
  • 9. Productionizing the REST service 9 Request #1 Request #2 Request #3 R E S T Request Queue Batch Response Queue Parallelism?
  • 10. Productionizing the REST service 10 Request #1 Request #2 Request #3 R E S T R E S T Load Balancer Request Queue Response Queue Batch Batch
  • 11. Replace REST with Distributed Streaming 11 Request #1 Request #2 Request #3 Kafka Hazelcast Jet Cluster $ jet submit ML
  • 12. Hazelcast Jet Code Pipeline p = Pipeline.create(); p.readFrom(Kafka.source()) .apply(mapUsingPython(new PythonServiceConfig() .setBaseDir("/Users/mtopol/dev/python/sklearn") .setHandlerModule("example_1_inference_jet"))) .writeTo(Kafka.sink()); jet.newJob(p); $ mvn package $ jet submit target/my-job.jar 12
  • 13. Jet Node 1 Jet Job's Execution Plan Source mapUsing Python mapUsing Python Sink 13 Kafka Topic A Python process Kafka Topic B Batch Python process Batch
  • 14. Cluster Elasticity and Resilience ● Jet jobs are fault-tolerant ● nodes can join and leave the cluster, jobs go on ● automatically rescale to available hardware 14
  • 15. Let's Start a Jet Cluster! 15
  • 16. Cluster Self-Formation 16 Hazelcast Jet natively supports: ● Amazon AWS ● Google GCP ● Kubernetes With simple configuration, the nodes self-discover in these environments
  • 17. Source and Sink Connectors ● Kafka ● Hadoop HDFS ● S3 bucket ● JDBC ● JMS queue and topic ● custom connector ● test connectors: convenient to test the code before submitting 17
  • 18. Stream Operators ● windowed aggregation using Event Time ○ sliding, session window ○ count, sum, average, linear regression, ... ○ custom aggregate function ● rolling aggregation ● streaming join (co-grouping) ● hash join (enrichment) ● contact arbitrary external services ○ mapUsingPython uses this 18

Editor's Notes

  1. Ask "Are you using Python in your data science?" What languages are you using?
  2. Step one: Do Data Science Step two: Profit!
  3. Then along comes a Data Engineer and asks "Wait, aren't we forgetting something?"