Event-Driven Machine Learning at Scale
Timothy Spann
Developer Advocate
Timothy
Spann
StreamNative
Developer Advocate
@PaaSDev
/in/timothyspann
ABSTRACT SUMMARY
Event-Driven Machine Learning at Scale
Utilizing the FLiP stack we can drive NLP and other machine learning classifications in the cloud, on-premise, hybrid and at the edge
utilizing the open source Apache Pulsar project. We often utilize the full FLiPN stack which includes Apache Flink, Apache NiFi and
Apache Pulsar.
Apache Pulsar is a unified messaging platform that can host and execute classifications as events arrive asynchronously to topics.
We deploy our functionality in lightweight functions that can run as processes, threads or in K8. We often use the serverless
FunctionMesh to run these. This allows us to run machine learning classes built in Go, Python and Java. We will show you utilizing
Vader Sentiment, Pytorch, TensorFlow, DJL.AI, MXNet and other machine learning options with ease.
AGENDA:
▪ Welcome
▪ Introduction to Apache Pulsar
▪ Basics of Pulsar
▪ Pulsar Functions
▪ Let’s Build an ML App!
▪ Demo
▪ Resources
▪ Q&A
streamnative.io
Passionate and dedicated team.
Founded by the original developers of
Apache Pulsar.
StreamNative helps teams to capture,
manage, and leverage data using Pulsar’s
unified messaging and streaming
platform.
Unified
Messaging
Platform
Guaranteed
Message
Delivery
Resiliency Infinite
Scalability
WHY APACHE PULSAR?
● “Bookies”
● Stores messages and cursors
● Messages are grouped in
segments/ledgers
● A group of bookies form an
“ensemble” to store a ledger
● “Brokers”
● Handles message routing and
connections
● Stateless, but with caches
● Automatic load-balancing
● Topics are composed of
multiple segments
●
● Stores metadata for both
Pulsar and BookKeeper
● Service discovery
Store
Messages
Metadata &
Service Discovery
Metadata &
Service Discovery
Metadata Store
(ZK, RocksDB, etcd, …)
Pulsar Cluster
Tenants
(Compliance)
Tenants
(Data Services)
Namespace
(Microservices)
Topic-1
(Cust Auth)
Topic-1
(Location Resolution)
Topic-2
(Demographics)
Topic-1
(Budgeted Spend)
Topic-1
(Acct History)
Topic-1
(Risk Detection)
Namespace
(ETL)
Namespace
(Campaigns)
Namespace
(ETL)
Tenants
(Marketing)
Namespace
(Risk Assessment)
Pulsar Instance
Pulsar Cluster
Topics
Producer Consumer
Publisher sends data and
doesn't know about the
subscribers or their status.
All interactions go through
Pulsar and it handles all
communication.
Subscriber receives data
from publisher and never
directly interacts with it
Topic
Topic
Producer / Consumer
Component Description
Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data
can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like
topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a producer name, the
default name is used.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the
message is its order in that sequence.
Messages - the Basic Unit of Pulsar
Connectivity
• Functions - Lightweight Stream
Processing (Java, Python, Go)
• Connectors - Sources & Sinks
(Cassandra, Kafka, …)
• Protocol Handlers - AoP (AMQP), KoP
(Kafka), MoP (MQTT)
• Processing Engines - Flink, Spark,
Presto/Trino via Pulsar SQL
• Data Offloaders - Tiered Storage - (S3)
hub.streamnative.io
● Buffer
● Batch
● Route
● Filter
● Aggregate
● Enrich
● Replicate
● Dedupe
● Decouple
● Distribute
Pulsar Functions
● Lightweight computation
similar to AWS Lambda.
● Specifically designed to use
Apache Pulsar as a message
bus.
● Function runtime can be
located within Pulsar
Broker.
A serverless event streaming
framework
● Consume messages from one or
more Pulsar topics.
● Apply user-supplied processing
logic to each message.
● Publish the results of the
computation to another topic.
● Support multiple programming
languages (Java, Python, Go)
● Can leverage 3rd-party libraries
to support the execution of ML
models on the edge.
Pulsar Functions
Function Mesh
Pulsar Functions, along with Pulsar
IO/Connectors, provide a powerful API for
ingesting, transforming, and outputting
data.
Function Mesh, another StreamNative
project, makes it easier for developers to
create entire applications built from
sources, functions, and sinks all through a
declarative API.
ML
• Visual Question and Answer
• NLP (Natural Language Processing)
• Sentiment Analysis
• Text Classification
• Named Entity Recognition
• Content-based Recommendations
• Predictive Maintenance
• Fault Detection
• Fraud Detection
• Time-Series Predictions
• Naive Bayes
Using Pulsar For ML Models
High performance
High security
Multiple data consumers
Large data volumes,
high scalability
Multi-tenancy and
geo-replication
Deploying AI With an
Event-Driven
Platform
https://dzone.com/trendreports/enterprise-ai-1
Streaming FLiP-ML Apps
StreamNative Hub
StreamNative Cloud
Unified Batch and Stream COMPUTING
Batch
(Batch + Stream)
Unified Batch and Stream STORAGE
Offload
(Queuing + Streaming)
Tiered Storage
Pulsar
---
KoP
---
MoP
---
Websocket
Pulsar
Sink
Streaming
Edge Gateway
Protocols
CDC
Apps
Web UI
Chat with
Sentiment
21
Connect - Resources
FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar, Apache
NiFi, Apache Spark and open source friends.
https://bit.ly/32dAJft
Apache Pulsar Training
• Instructor-led courses
– Pulsar Fundamentals
– Pulsar Developers
– Pulsar Operations
• On-demand learning with labs
• 300+ engineers, admins and architects trained!
StreamNative Academy
Now Available
On-Demand
Pulsar Training
QUESTIONS?
THANK YOU!
@PaaSDev
/in/timothyspann

MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf

  • 1.
    Event-Driven Machine Learningat Scale Timothy Spann Developer Advocate
  • 3.
  • 4.
    ABSTRACT SUMMARY Event-Driven MachineLearning at Scale Utilizing the FLiP stack we can drive NLP and other machine learning classifications in the cloud, on-premise, hybrid and at the edge utilizing the open source Apache Pulsar project. We often utilize the full FLiPN stack which includes Apache Flink, Apache NiFi and Apache Pulsar. Apache Pulsar is a unified messaging platform that can host and execute classifications as events arrive asynchronously to topics. We deploy our functionality in lightweight functions that can run as processes, threads or in K8. We often use the serverless FunctionMesh to run these. This allows us to run machine learning classes built in Go, Python and Java. We will show you utilizing Vader Sentiment, Pytorch, TensorFlow, DJL.AI, MXNet and other machine learning options with ease. AGENDA: ▪ Welcome ▪ Introduction to Apache Pulsar ▪ Basics of Pulsar ▪ Pulsar Functions ▪ Let’s Build an ML App! ▪ Demo ▪ Resources ▪ Q&A
  • 5.
    streamnative.io Passionate and dedicatedteam. Founded by the original developers of Apache Pulsar. StreamNative helps teams to capture, manage, and leverage data using Pulsar’s unified messaging and streaming platform.
  • 6.
  • 7.
    ● “Bookies” ● Storesmessages and cursors ● Messages are grouped in segments/ledgers ● A group of bookies form an “ensemble” to store a ledger ● “Brokers” ● Handles message routing and connections ● Stateless, but with caches ● Automatic load-balancing ● Topics are composed of multiple segments ● ● Stores metadata for both Pulsar and BookKeeper ● Service discovery Store Messages Metadata & Service Discovery Metadata & Service Discovery Metadata Store (ZK, RocksDB, etcd, …) Pulsar Cluster
  • 8.
    Tenants (Compliance) Tenants (Data Services) Namespace (Microservices) Topic-1 (Cust Auth) Topic-1 (LocationResolution) Topic-2 (Demographics) Topic-1 (Budgeted Spend) Topic-1 (Acct History) Topic-1 (Risk Detection) Namespace (ETL) Namespace (Campaigns) Namespace (ETL) Tenants (Marketing) Namespace (Risk Assessment) Pulsar Instance Pulsar Cluster Topics
  • 9.
    Producer Consumer Publisher sendsdata and doesn't know about the subscribers or their status. All interactions go through Pulsar and it handles all communication. Subscriber receives data from publisher and never directly interacts with it Topic Topic Producer / Consumer
  • 10.
    Component Description Value /data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data can also conform to data schemas. Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like topic compaction. Properties An optional key/value map of user-defined properties. Producer name The name of the producer who produces the message. If you do not specify a producer name, the default name is used. Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the message is its order in that sequence. Messages - the Basic Unit of Pulsar
  • 11.
    Connectivity • Functions -Lightweight Stream Processing (Java, Python, Go) • Connectors - Sources & Sinks (Cassandra, Kafka, …) • Protocol Handlers - AoP (AMQP), KoP (Kafka), MoP (MQTT) • Processing Engines - Flink, Spark, Presto/Trino via Pulsar SQL • Data Offloaders - Tiered Storage - (S3) hub.streamnative.io
  • 12.
    ● Buffer ● Batch ●Route ● Filter ● Aggregate ● Enrich ● Replicate ● Dedupe ● Decouple ● Distribute
  • 13.
    Pulsar Functions ● Lightweightcomputation similar to AWS Lambda. ● Specifically designed to use Apache Pulsar as a message bus. ● Function runtime can be located within Pulsar Broker. A serverless event streaming framework
  • 14.
    ● Consume messagesfrom one or more Pulsar topics. ● Apply user-supplied processing logic to each message. ● Publish the results of the computation to another topic. ● Support multiple programming languages (Java, Python, Go) ● Can leverage 3rd-party libraries to support the execution of ML models on the edge. Pulsar Functions
  • 15.
    Function Mesh Pulsar Functions,along with Pulsar IO/Connectors, provide a powerful API for ingesting, transforming, and outputting data. Function Mesh, another StreamNative project, makes it easier for developers to create entire applications built from sources, functions, and sinks all through a declarative API.
  • 16.
    ML • Visual Questionand Answer • NLP (Natural Language Processing) • Sentiment Analysis • Text Classification • Named Entity Recognition • Content-based Recommendations • Predictive Maintenance • Fault Detection • Fraud Detection • Time-Series Predictions • Naive Bayes
  • 17.
    Using Pulsar ForML Models High performance High security Multiple data consumers Large data volumes, high scalability Multi-tenancy and geo-replication
  • 18.
    Deploying AI Withan Event-Driven Platform https://dzone.com/trendreports/enterprise-ai-1
  • 19.
    Streaming FLiP-ML Apps StreamNativeHub StreamNative Cloud Unified Batch and Stream COMPUTING Batch (Batch + Stream) Unified Batch and Stream STORAGE Offload (Queuing + Streaming) Tiered Storage Pulsar --- KoP --- MoP --- Websocket Pulsar Sink Streaming Edge Gateway Protocols CDC Apps
  • 20.
  • 21.
  • 22.
  • 23.
    FLiP Stack Weekly Thisweek in Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark and open source friends. https://bit.ly/32dAJft
  • 24.
    Apache Pulsar Training •Instructor-led courses – Pulsar Fundamentals – Pulsar Developers – Pulsar Operations • On-demand learning with labs • 300+ engineers, admins and architects trained! StreamNative Academy Now Available On-Demand Pulsar Training
  • 25.
  • 26.