Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Applying multiple
ML Pipelines
to heterogenous data streams
Gevorg Soghomonian, AI Research Engineer
Maciej Dabrowski, Chi...
About Altocloud
2
• Understand customers better
• Improve customer experience
• Predict customer behaviour and engage
• Accelerate revenue c...
EVENT STREAMS EVENT PROCESSORS
Altocloud platform
BATCH
MODEL
LEARNING
ENRICHMENT PREDICTIONS
STORAGE
QUEUES
WEB EVENTS
CA...
Heterogenous data
5
Background
6
Spark Machine Learning framework
7
Spark Pipeline
8
Spark Transformers
High level interface for all operations on Datasets
Act as stages in Pipelines
Require various configur...
The challenge
10
Spark Pipeline deployment flow
11
Altocloud deployment flow
12
Altocloud deployment flow
13
How to operationalise
hundreds of models in one
Spark Streaming job?
Possible (deployment) solutions
Serialisation
✓ Ability to use any streaming technology
x Increased complexity
x Potential...
Heterogeneous datasets in Spark
Spark Pipelines can be applied only to Datasets
Spark Pipelines cannot be combined in a co...
Applying hundreds of models to
heterogenous streams in
Spark Streaming
16
The foundation
Redefine Transformer API to allow compositional flexibility.
Adapt Spark Pipelines to redefined Transformer...
Transformer API definition
18
One-to-One correspondence / adaptation
19
Adapt DataFrame Pipeline to Stream Pipeline
20
Adapt & combine
21
Composition example
22
Composition example
23
Apache Spark ML Pipeline flow revisited
24
Challenges
Re-implementation of transformer logic
Train-serving skew
Scaling beyond 1000s of models
25
Summary
26
Summary
Applying Pipelines to heterogenous data streams is hard.
Spark Pipelines cannot be combined and applied in a compo...
Upcoming SlideShare
Loading in …5
×

Spark Summit Europe 2017 - Applying multiple ML pipelines to heterogenous data streams

112 views

Published on

Spark Summit Europe 2017
Applying Multiple ML Pipelines to heterogenous data streams
This talk explains how we adapted Spark mllib to deploy hundreds of ML pipelines in one streaming job to make real time predictions on heterogenous data streams.

Published in: Engineering
  • Hi there! Get Your Professional Job-Winning Resume Here - Check our website! http://bit.ly/resumpro
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Spark Summit Europe 2017 - Applying multiple ML pipelines to heterogenous data streams

  1. 1. Applying multiple ML Pipelines to heterogenous data streams Gevorg Soghomonian, AI Research Engineer Maciej Dabrowski, Chief Data Scientist 1
  2. 2. About Altocloud 2
  3. 3. • Understand customers better • Improve customer experience • Predict customer behaviour and engage • Accelerate revenue conversion • Reduce shopping cart abandons Customer journey analytics
  4. 4. EVENT STREAMS EVENT PROCESSORS Altocloud platform BATCH MODEL LEARNING ENRICHMENT PREDICTIONS STORAGE QUEUES WEB EVENTS CALL, IVR, MESSAGE EVENTS ACTIONS SEGMENTATION Web Hook AGGREGATION ACTION CREATION OUTCOME PROBABILITIES REAL-TIME CUSTOMER JOURNEY TICKET, LEAD, GENERIC EVENTS CONTEXT
  5. 5. Heterogenous data 5
  6. 6. Background 6
  7. 7. Spark Machine Learning framework 7
  8. 8. Spark Pipeline 8
  9. 9. Spark Transformers High level interface for all operations on Datasets Act as stages in Pipelines Require various configuration: inputColumn, outputColumn, data, … Dataset is the only input/output format 9
  10. 10. The challenge 10
  11. 11. Spark Pipeline deployment flow 11
  12. 12. Altocloud deployment flow 12
  13. 13. Altocloud deployment flow 13 How to operationalise hundreds of models in one Spark Streaming job?
  14. 14. Possible (deployment) solutions Serialisation ✓ Ability to use any streaming technology x Increased complexity x Potentially increased deployment latency Spark ML Pipelines ✓ Simplicity - one technology for training and evaluation x Support only for the “one model scenario” 14
  15. 15. Heterogeneous datasets in Spark Spark Pipelines can be applied only to Datasets Spark Pipelines cannot be combined in a composite pipeline to be applied on Pipeline-per-row basis Questions: How to apply different Pipeline(s) for each row in the Datasets? How to identify which Pipelines should be applied to a row? 15
  16. 16. Applying hundreds of models to heterogenous streams in Spark Streaming 16
  17. 17. The foundation Redefine Transformer API to allow compositional flexibility. Adapt Spark Pipelines to redefined Transformers 17
  18. 18. Transformer API definition 18
  19. 19. One-to-One correspondence / adaptation 19
  20. 20. Adapt DataFrame Pipeline to Stream Pipeline 20
  21. 21. Adapt & combine 21
  22. 22. Composition example 22
  23. 23. Composition example 23
  24. 24. Apache Spark ML Pipeline flow revisited 24
  25. 25. Challenges Re-implementation of transformer logic Train-serving skew Scaling beyond 1000s of models 25
  26. 26. Summary 26
  27. 27. Summary Applying Pipelines to heterogenous data streams is hard. Spark Pipelines cannot be combined and applied in a composite pipeline on Pipeline-per-row basis Our solution: Extend Transformer API to apply different Pipeline(s) for each row Extend Streams to include Pipeline composition macdab@altocloud.com & gevorg@altocloud.com 27

×