Blog Post:
https://www.kai-waehner.de/apache-kafka-event-streaming-pharmaceuticals-pharma-life-sciences-use-cases-architecture
Video Recording:
https://youtu.be/t2IH0brwGTg
AI/Machine learning and the Apache Kafka ecosystem are a great combination for training, deploying and monitoring analytic models at scale in real-time. They are showing up more and more in projects but still, feel like buzzwords and hype for science projects.
See how to connect the dots!
--How are Kafka and Machine Learning related?
--How can they be combined to productionize analytic models in mission-critical and scalable real-time applications?
--We will discuss a step-by-step approach to build a scalable and reliable real-time infrastructure for drug discovery doing data integration, feature engineering, image processing, model scoring and processing orchestration.
Use Cases:
R&D Engineering
Sales & Marketing
Manufacturing & Quality Assurance
Supply Chain
Product Monitoring & After Sales Support
VoC (Voice of Customer)
Single View Customer
Yield/Quality Optimization
Improved Drug Yield
Proactive Service Scheduling
Testing & Simulation
Drug Diversion
Process/Quality Monitoring
Inventory & Supply Chain Optimization
Proactive Service Offers
Patent Research and Analytics
Personalized Offers / Ads
EDW Offload
Supply Chain Network Design/Risk Management
Product Predictive Maintenance
Clinical Trials
Customer Segmentation
Smart Products
Serialization & e-Pedigree
Product Usage Tracking
GTM
Global Facilities
Inventory and Logistics Visibility
Warranty & Recall Management
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
Machine Learning with Apache Kafka in Pharma and Life Sciences
1. 1
Kai Waehner | Technology Evangelist, Confluent
contact@kai-waehner.de | LinkedIn | @KaiWaehner | www.confluent.io | www.kai-waehner.de
Streaming Machine Learning with
Apache Kafka and Confluent
in Pharma and Life Sciences
2. 3
Use Cases in Pharma and Life Sciences
for Event Streaming
R&D
Engineering
Sales &
Marketing
Manufacturing &
Quality
Assurance
Supply
Chain
Product
Monitoring &
After Sales
Support
VoC (Voice of
Customer)
Single View
Customer
Yield/Quality
Optimization
Improved Drug Yield
Proactive Service
Scheduling
Testing &
Simulation
Drug Diversion
Process/Quality
Monitoring
Inventory & Supply
Chain Optimization
Proactive Service
Offers
Patent Research and
Analytics
Personalized
Offers / Ads
EDW Offload
Supply Chain
Network Design/Risk
Management
Product Predictive
Maintenance
Clinical Trials
Customer
Segmentation
Smart Products
Serialization &
e-Pedigree
Product Usage
Tracking
GTM Global Facilities
Inventory and
Logistics Visibility
Warranty & Recall
Management
www.kai-waehner.de | @KaiWaehner
3. 4
Event Streaming in Pharma and Life Sciences
Use Cases Supporting Business Value
IoT Sensor
Ingestion
Digital
Replatformi
ng/
Mainframe
Offload
Customer
360
Faster
Transactional
Processing /
Analysis
Incl. Machine
Learning / AI
Microservices
Architecture
Online Fraud
Detection
Online Security
(Syslog, Lg
Aggregation,
Splunk
Replacement)
Middleware
Replacement
Website / Core
Operations
(Central
Nervous
System)
Real-time
App
Updates
New Cloud App /
Services + T2M
Develop & Market
New Drugs
Connected Health /
Remote Monitoring
Global Shortage in
Health Care Workers
Rise in New &
Chronic Health
Issues
Cybersecurity
Threads
T2M: Generic
Competition
Increase Revenue
(Make Money)
Decrease Costs
(Save Money)
Mitigate Risk
(Protect Money)
Business Value
1 | Business
Use Case
Strategic
Driver
2 | Business
Use Case
Data Eng. /
Infrastructure
Use Case
Web Click
Streams
Data
Pipelines
Messaging
Microservice /
Event
Sourcing
Stream
Processing
Data
Ingestion
Streaming
ETL
Log
Aggregation
www.kai-waehner.de | @KaiWaehner
6. 777
Bayer AG
On Premise and Cloud
+
Hybrid Real Time
Replication at Scale
https://www.confluent.io/kafka-summit-
sf18/bringing-streaming-data-to-the-
masses
Bayer Crop Science (formerly Monsanto) adopted a cloud first strategy and started a multi-year
transition to the cloud. A Kafka-based cross-datacenter DataHub was created to facilitate this
migration and to drive the shift to real-time stream processing. The DataHub has seen strong
enterprise adoption and supports a myriad of use cases.
7. 888
celmatix
Real Time Aggregation
of Heterogeneous Data
+
Governance / Security
https://www.confluent.io/customers/
celmatix/
Through the development of digital tools and genetic insights focused on fertility, Celmatix is
disrupting how women approach their lifelong reproductive health journey by empowering them
and their physicians with more personalized information.
8. 9Machine Learning and Event Streaming
Improve Traditional and to Build New Use Cases
in Pharma and Life Sciences
www.kai-waehner.de | @KaiWaehner
Streams Processing / AI / ML
Clinical Trials
Patents,
Text etc
Structured &
unstructured
Data
IoT & Business
Applications
Multi-Hybrid-
Cloud
9. 10
Use Case: Drug Discovery
“On average, it takes at least ten
years for a new medicine to
complete the journey from initial
discovery to the marketplace”
PhRMA
http://phrma-docs.phrma.org/sites/default/files/pdf/rd_brochure_022307.pdf
www.kai-waehner.de | @KaiWaehner
10. 121212
Recursion
Pharmaceutical
Discovering Drugs in
Real Time
+
Machine Learning
https://www.confluent.io/customers/recursion
https://www.confluent.io/kafka-summit-san-
francisco-2019/discovering-drugs-with-kafka-
streams
Massively parallel system that combines experimental biology, artificial intelligence,
automation and real-time event streaming to accelerate drug discovery.
www.kai-waehner.de | @KaiWaehner
11. 13
Image and Video Processing
… (on high level) is “just” pixel (arrays of 0s and 1s) and matrix multiplication
www.kai-waehner.de | @KaiWaehner
12. 14
Drug Discovery in manual and slow, bursty batch
mode, not scalable
www.kai-waehner.de | @KaiWaehner
13. 15
Drug Discovery in automated, scalable, reliable
real time Mode
www.kai-waehner.de | @KaiWaehner
14. 16
Digital Image
Processing
(e.g. noise
reduction)
Streaming Analytics for
Drug Discovery in Real Time at Scale
Real Time
Integration
Layer
Batch
Reporting
Platform
BI
DashboardEvent
Streaming
Platform
Real Time
Integration
Layer
Laboratory
Streaming Platform
Other Components
Automated
Drug
Analysis
All
Data
Processed
Images
Ingest
Images
Human
Intelligence
www.kai-waehner.de | @KaiWaehner
Data
Processing
(e.g. filtering)
Stateful
Workflow
Orchestration
15. 18
Digital Image Processing
for Drug Discovery
Find drug treatments:
• ML models can be trained to decide between healthy cells and disease
cells with problematic genes
• Grow healthy cells and disease cells in labs
• Apply different drugs à Make disease cells look healthy again
www.kai-waehner.de | @KaiWaehner
16. 20
The First
Analytic Models
How to deploy the models
in production?
…real-time processing?
…at scale?
…24/7 zero uptime?
www.kai-waehner.de | @KaiWaehner
17. 21
Hidden Technical Debt
in Machine Learning Systems
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
www.kai-waehner.de | @KaiWaehner
20. 26
A Streaming Platform
is the Underpinning of an Event-driven Architecture
Microservices
DBs
SaaS apps
Mobile
Customer 360
Real-time fraud
detection
Data warehouse
Producers
Consumers
Database
change
Microservices
events
SaaS
data
Customer
experiences
Streams of real time events
Stream processing apps
Connectors
Connectors
Stream processing apps
www.kai-waehner.de | @KaiWaehner
21. 27
Apache Kafka at Scale
at Tech Giants
> 7 trillion messages / day > 6 Petabytes / day
“You name it”
* Kafka Is not just used by tech giants
** Kafka is not just used for big datawww.kai-waehner.de | @KaiWaehner
23. 29
Apache Kafka’s
Open Ecosystem as Infrastructure for ML
Kafka
Streams /
ksqlDB
Kafka
Connect
Rest Proxy
Schema Registry
Go/ .NET
Kafka Producer
ksqlDB
Python
Consumer
www.kai-waehner.de | @KaiWaehner
24. 30
Digital Image
Processing
(External SaaS
Service + REST)
Kafka, ksqlDB and TensorFlow for
Drug Discovery in Real Time at Scale
Kafka Client
(.NET C++)
Batch
Reporting
Platform
BI
Dashboard
Confluent
Server
Kafka
Connect
Laboratory
(Windows Machines)
Confluent Platform
Other Components
Model Training
and Scoring
(Confluent Python
Client + TensorFlow)
All Data
Processed
Images
Images
Human
Intelligence
www.kai-waehner.de | @KaiWaehner
Streaming
ETL
(ksqlDB)
Stateful
Workflow
Orchestration
(Kafka Streams)
Database
(MySQL) Kafka Connect
(Debezium CDC)
Historical Drugs Data
25. 31
How do you implement a
Kafka ML infrastructure?
www.kai-waehner.de | @KaiWaehner
28. 34
SELECT image_id, experiment_id, image_details
FROM image_channel i
LEFT JOIN experiment_database e ON i.experiment_id =
e.experiment_id
WHERE e.image_type = ‘black_and_white';
Data Processing with ksqlDB
www.kai-waehner.de | @KaiWaehner
29. 37
TensorFlow Model —
Convolutional Neural Network (CNN)
for Image Recognition (as part of the ML Pipeline)
www.kai-waehner.de | @KaiWaehner
30. 38
Direct streaming ingestion
for model training and / or scoring
with TensorFlow I/O + Kafka Plugin
(no additional data storage
like S3 or HDFS required!)
Time
Model BModel A
Producer
Distributed Commit Log
Streaming Ingestion and Model Training
with TensorFlow IO
https://github.com/tensorflow/io
www.kai-waehner.de | @KaiWaehner
32. 40
Today, Kafka works well
for recent events, short
horizon storage, and
manual data balancing
CONFIDENTIAL
Kafka’s present-day design offers
extraordinarily low messaging latency by
storing topic data on fast disks that are
collocated with brokers. This is usually good.
But sometimes, you need to store a huge
amount of data for a long time.
Kafka
Processing
App
Storage
Transactions, auth,
quota enforcement,
compaction, ...
www.kai-waehner.de | @KaiWaehner
33. 41Tiered Storage for Kafka
Object Store
Processing Storage
Transactions,
auth, quota
enforcement,
compaction, ...
Local
Remote
Kafka
Apps
Store Forever
Older data is offloaded to inexpensive object
storage, permitting it to be consumed at any time.
Save $$$
Storage limitations, like capacity and duration, are
effectively uncapped.
Instantaneously scale up and down
Your Kafka clusters will be able to automatically
self-balance load and hence elastically scale
(Only available in Confluent Platform)www.kai-waehner.de | @KaiWaehner
34. 43
Reprocessing of Events
● New Consumer
○ e.g. a complete new microservices or a replacement of an existing application
● Error-Handling
○ Re-processing of data in case of error: Fix error and process events again
● Compliance / Regulatory Processing
○ Reprocessing of already processed data for legal reasons
○ Could be very old data (e.g. pharma: 10 years old)
● Query and Analysis of Existing Events
○ No need for another data store / data lake
○ Kafka Client Consumer for offset- or timestamp-based consumption of old events
○ ksqlDB (for simple pull queries)
○ Kafka-native analytics tool (e.g. Rockset with Kafka connector and ANSI SQL support for Tableau et al)
● Model Training
○ Consume events for model training with a) different one ML framework and different hyperparameters or b)
different ML frameworks
www.kai-waehner.de | @KaiWaehner
35. 44
Streaming Machine Learning with
Apache Kafka and Tiered Storage
https://www.confluent.io/blog/streaming-machine-
learning-with-tiered-storage/
www.kai-waehner.de | @KaiWaehner
36. 45
How to deploy the
analytic models?
www.kai-waehner.de | @KaiWaehner
37. 46
Local Predictions
Model Training
in Cloud
Model Deployment
at the Edge
Analytic Model
Separation of
Model Training and Model Inference
www.kai-waehner.de | @KaiWaehner
41. 52
“CREATE STREAM ImageAnalysis AS
SELECT image_id, analyzeImage(image_details)
FROM image_channel;“
User Defined Function (UDF)
Model Deployment with
Apache Kafka, ksqlDB
and TensorFlow
www.kai-waehner.de | @KaiWaehner
42. 54
Model Training and Scoring
with the same ML Pipeline (or even in the same Application)
• Data Science team responsible for the whole model lifecycle
• Beloved Python tool stack (Pandas, scikit learn, TensorFlow, Jupyter, …)
• 24/7 production scale with Confluent Python Client (e.g. deployed in Docker containers on Kubernetes)
www.kai-waehner.de | @KaiWaehner
43. 55
Digital Image
Processing
(External SaaS
Service + REST)
Kafka, ksqlDB and TensorFlow for
Drug Discovery in Real Time at Scale
Kafka Client
(.NET C++)
Batch
Reporting
Platform
BI
Dashboard
Confluent
Server
Kafka
Connect
Laboratory
(Windows Machines)
Confluent Platform
Other Components
Model Training
and Scoring
(Confluent Python
Client + TensorFlow)
All Data
Processed
Images
Images
Human
Intelligence
www.kai-waehner.de | @KaiWaehner
Streaming
ETL
(ksqlDB)
Stateful
Workflow
Orchestration
(Kafka Streams)
Database
(MySQL) Kafka Connect
(Debezium CDC)
Historical Drugs Data
46. 58
Image / Video Processing with Kafka
Kafka-native Image / Video Processing
vs.
Chunk + re-assemble
vs.
Metadata-only + Object Store
à All approaches are fine! J
https://www.slideshare.net/JiangjieQin/handle-large-messages-in-apache-kafka-58692297
https://www.confluent.io/blog/bust-the-burglars-machine-learning-with-tensorflow-and-apache-kafka/
www.kai-waehner.de | @KaiWaehner
47. 59
Workflow Orchestration
Kafka-native vs. External Tool
à both approaches are fine! J
https://github.com/nsaje/dagger
https://eventil.com/events/using-apache-kafka-in-a-closed-environment-with-centralized-orchestration
https://zeebe.io/blog/2019/08/official-kafka-connector-for-zeebe/
www.kai-waehner.de | @KaiWaehner
48. 60
Legacy Integration
Date Amount
1/27/2017 $4.56
1/22/2017 $32.14
Mainframe
Transaction Data
Traditional
Middleware
Application
Kafka
Microservices
Agile, lightweight
(but scalable robust)
Kafka microservice
Big Data project
(Elastic, Spark,
AWS Services, …)
1) Direct Legacy MQ Communication with App
2) Kafka for decoupling between MQ and App
3) Direct communication via Kafka (no MQ anymore)
4) New projects and applications
(independent or related to the existing migration projects)
External
Solution
www.kai-waehner.de | @KaiWaehner
50. 63
Generate added value from data
The pharmaceutical industry today has an
unprecedented wealth of opportunities to
generate added value from data.
These possibilities cover all
relevant areas such as:
• R&D / Engineering
• Sales & Marketing
• Manufacturing & QA
• Supply Chain
• Product Monitoring / After
Sales Support
Novel data use:
• Better therapies
• Faster and more accurate
diagnoses
• Faster drug development
• Improvement of clinical studies
• Real-World Data Generation
• real-world evidence
• Precision Medicine
• Support Remote Health etc
Challenges:
• Data silos
• Data integration
• Data growth/explosion
• Cloud & on-prem
• Use of new technologies like
AI/ML
• Time2Market
• Regulatory Affairs
• Security
• Performance
• API
Streams Processing / AI / ML
Clinical Trials
Patents,
Text etc
Structured &
unstructured
Data
IoT & Business
Applications
Multi-Hybrid-
Cloud