SlideShare a Scribd company logo
Robust and declarative
machine learning
pipelines for predictive
buying
GIANMARIO SPACAGNA
ADVANCED DATA ANALYTICS, BARCLAYS
2016/04/07
“
”
Latest technology advancements made data
processing accessible to everyone, cheaply and
quickly.
We believe better value can be produced at the
intersection of engineering practices and the
correct application of the scientific method.
DATA SCIENCE MANIFESTO*
Principle 2: “All validation of data, hypotheses and performance should be
tracked, reviewed and automated.”
Principle 3: “Prior to building a model, construct an evaluation framework with
end-to-end business focused acceptance criteria.”
* The current manifesto is still in beta version, check the full list of principles at datasciencemanifesto.org
MLlib
u Machine Learning library for Apache Spark
u Scalable, production-oriented
(theoretically can handle any size of data given enough cluster resources)
u Built on top of RDD collections of vector objects
u Provides:
u Linear algebra
u Basic Statistics
u Classification and regression
u Collaborative filtering
u Clustering
u Dimensionality Reduction
u Feature extraction and transformation
u Pattern mining
u Evaluation
ML pipelines
u ML is the Mllib evolution on top of DataFrame
u DataFrame: SQL-like schema of representing data, similar to the original
concept in R and Python Pandas.
u Offers optimised execution plans without engineering efforts
u High-Level APIs for machine learning pipelines
u Transformer: DataFrame => DataFrame
u Features transformer
u Fitted model usable for inferences
u Estimator: def fit(data: DataFrame): Transformer
u The algorithm that trains your model
Great but…
u No type safety at compile time (only dynamic types)
u A features transformer and a model have the same API?
u Non fully functional
u functions should be registered as UDFs
u explicit casting between SQL Row objects and JVM classes
u complex logic harder to be tested and expressed in isolation
u Can’t explicitly distinguishing between metadata and feature fields,
though you can have as many columns you want (and also remove
them at run-time!)
u Harder to “safely” implement bespoke algorithms
u Underlying technology is very robust but the higher API, although is
declarative, could be error-prone when refactoring and
maintaining in a real production system
Barclays use case
u Binary classifier for predicting users willing to buy a financial product
in the following 3 months
u Each user has different characteristics at different points in time
(multiple samples of the same user)
u A sample is uniquely defined by the customer id and date
u Features consists of a mix of sequence of events, categorical
attributes and numerical values
u Each sample is labeled as true or false whether the user bought or
not in the following horizon window
u It must be production-ready but we want to estimate performances
before deployment based on retrospective analysis
Our Requirements
u Nested data structures (mix of sequences and dictionaries) with
some meta information attached
u Abstraction between the implementations of a binary classifier in
vector form and its domain-specific features extraction/selection
u Fitted model should simply be a function Features => Double
u Robust cross-fold validation, no Data Leakage:
u No correlation between training and test set
u Temporal order, only predict future based on the past
u Don’t predict same users multiple times
u Scientifically correct: every procedure, every stage in the workflow
should be transparent, tested and reproducible.
Sparkz
u Extension of Spark for better functional programming
u Type-safe, everything checked at compile time
u Re-implementing some components for working around limitations
due to non functional nature of the development
u Currently just a proof-of-concept of designs and patterns of data
science applications, examples:
u Data Validation using monads and applicative functors
u Binary classifier evaluation using sceval
u A bunch of utils (lazy loggers, pimps, monoids…)
u A typed API for machine learning pipelines
https://github.com/gm-spacagna/sparkz
Our domain specific data type is UserFeaturesWithBooleanLabel and the information contained in it
can be separated into:
• features: events, attributes and numericals
• Metadata: user id and date
• target label, bought the product in the next 3 months?
By expressing our API in functions of the raw data type we don’t drop any information from the source
and we can always explain the results at any stage of pipeline
A Binary Classifieris generic to the type Features and offers two way of implementing it:
• Generic typed API (for domain specific algorithms)
• Vector API (for standard machine learning algorithms)
• LabeledPoint is a pair of vector and label (no meta data though)
By splitting the vector-form implementation and the transformation of the source data into a vector gives
us more control and robustness of the process. The generic typed API could be used for bespoke
implementations that are not based on standard linear algebra (thus don’t require vectors).
The simplest implementation of a generic model is a Random classifier, no matter what
the features look like it will return a random number.
No vector transformation required.
We can optionally specify a seed in case we want it to be systematically repeatable.
In this example the train method does nothing and e the anonymous implementation of
the trained model simply returns a random number between 0 and 1.
This vector classifierwraps the MLlib implementation but the score function is re-implemented in
DecisionTreeInference such way that traverses the tree and computes the proportion of true and false
samples. Thus the fitted model only needs to store the top node of the tree.
We had to implement it ourselves because MLlib API is limited to returning only the output class (true/false)
without the associated score.
A model implemented via the vector API then requires the features type to be transformed into a vector.
The FeaturesTransformer API is generic to the Features type and has a sort of train method that given the
full training dataset will return the mapping function Features => Vector.
This API allows us to learn the transformation rules directly from the underlying data and can be used to
turn vector-based models into our domain specific models.
The original structure of the source data is always preserved, the transformation function is lazily applied
on-the-fly when needed.
If our data type is hybrid we want to apply different transformation for each subset of features and then
combining them together by concatenating the resulting sub-vectors.
The abstract class SubFeaturesTransformerspecifies both the SubFeatures type and extend the base
transformer on type Features. That means can be threated as transformer of the base type but only consider
a subset of the features.
The EnsembleTransformer will concatenate multiple subvectors together.
The simplest numerical transformer, takes a getter function that extract a Map[Key, Double] from the
generic Features type and return a SubFeaturesTransformer where they key-value map is flattened into a
vector with the original values (no normalization).
If we want to apply for example a standard scaling we could simply train on OriginalNumericalsTransformer
and pipe the function returned by subfeaturesToVector into a function Vector => Vector which would
represent our scaler.
The One Hot Transformer will flatten each distinct combination of (key, value) pairs into a vector of 1s and 0s.
The values represent categories. We encoded them as String but you could easily extend it to be of a
generic type as long as it has an equality function or a singleton property (e.g. case objects).
Hierarchies and granularities can also be pushed in this logic.
The Term FrequencyTransformer takes a sequence of terms and turn them into frequency counts.
The index of each term is learnt during training and shared into a broadcast variables so that we don’t have
to use the hashing trick which may lead to incorrectnesses.
The Term type must have an Ordering defined so that we can preserve the indexing on different runs.
This transformer can easily be piped with an IDF function and be turned into a TF-IDF transformer.
We can easily create multiple typed-API classifiers by combining the vector trainer algorithm with one or any
combination of sub-features transformers.
The cross-fold validation uses meta data to stop data leakage and simulate the training/inference stage
that we would do in the live production environment:
It requires the following parameters:
• data: full dataset with features, label and meta information
• k: number of folds
• classifiers: list of classifiers to be evaluated
• uniqueId: function extracting the unique id from the meta data
• orderingField: function extracting the temporal order of the samples for the before/after splitting
• singleInference: if enabled the test set contains at most one sample for each unique id
• seed: seed of the random generator (for repeatability)
It returns a map of Classifier-> Collection of test samples and associated score (no metrics aggregation).
We used the customer id as unique identifier for our samples and the date as temporal ordering field.
We enable the singleInference so that we only predict each user once during the tests (at a random point in
time).
The returning evaluation outputs are then mapped into collections of score and boolean label so that we
can use sceval for computing the confusions matrices and finally the area under the ROC.
As general advice we discourage to use AUC as evaluation metric in a real business context. A pool of
measures should be created mapping the true business needs, e.g. precision @ 1000, uplift w.r.t. prior, % of
new buyers predicted when running the inference after a couple of weeks time, variance of performances
across multiple runs, seasonality...
Source
Data
Features
Meta
User Id
Date
Events
sequence
Demographic
attributes
Financial
statements
TF
One Hot
Numerical
value
Vector
Decision Tree
Trainer
Binary
Classifier
Trainer
Binary
Classifier
Fitted
Model
Decision
Tree
Model
Cross Fold Validation
Decision
Tree
Model 1
Decision
Tree
Model k…
Scores and
Labels
Performance
metrics
Conclusions
1. In this PoC we wanted to show a specific use case of which we could not
find suitable out-of-the-box machine learning frameworks.
2. We focused on the production quality aspects rather than quick and dirty
prototypes.
3. We wanted the whole workflow to be transparent, tested and reproducible.
4. Nevertheless, we discourage to implement any premature abstractions until
you get your first MVP released and possibly a couple of extra iterations.
5. Our final requirements of what to implement where defined after an
iterative and incremental data exploration and knowledge discovery.
6. We used the same production stack together with notebooks during the
MVP development/investigation but in a simpler and flat code structure.
7. Regardless of the programming language/technology the thoughtful
methodology really makes the difference
Further References
u Sparkz:
https://github.com/gm-spacagna/sparkz
u Other examples and tutorials available at
Data Science Vademecum:
https://datasciencevademecum.wordpress.com/
u Data Science Manifesto:
http://www.datasciencemanifesto.org

More Related Content

What's hot

A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
Natalia Díaz Rodríguez
 
The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoML
Ning Jiang
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
Venkata Reddy Konasani
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
Chris Johnson
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
Yuriy Guts
 
Using SHAP to Understand Black Box Models
Using SHAP to Understand Black Box ModelsUsing SHAP to Understand Black Box Models
Using SHAP to Understand Black Box Models
Jonathan Bechtel
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
Benjamin Bengfort
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
Rebecca Bilbro
 
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
Lola Burgueño
 
Common Problems in Hyperparameter Optimization
Common Problems in Hyperparameter OptimizationCommon Problems in Hyperparameter Optimization
Common Problems in Hyperparameter Optimization
SigOpt
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
Ted Xiao
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
sscdotopen
 
Recommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model EvaluationRecommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model Evaluation
Crossing Minds
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101
QuantUniversity
 
Machine learning 101 dkom 2017
Machine learning 101 dkom 2017Machine learning 101 dkom 2017
Machine learning 101 dkom 2017
fredverheul
 
AI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingAI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data Modeling
Databricks
 
Microsoft azure machine learning
Microsoft azure machine learningMicrosoft azure machine learning
Microsoft azure machine learning
Amol Gholap
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
Justin Basilico
 

What's hot (20)

A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
 
The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoML
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Using SHAP to Understand Black Box Models
Using SHAP to Understand Black Box ModelsUsing SHAP to Understand Black Box Models
Using SHAP to Understand Black Box Models
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
 
Common Problems in Hyperparameter Optimization
Common Problems in Hyperparameter OptimizationCommon Problems in Hyperparameter Optimization
Common Problems in Hyperparameter Optimization
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Recommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model EvaluationRecommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model Evaluation
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101
 
Machine learning 101 dkom 2017
Machine learning 101 dkom 2017Machine learning 101 dkom 2017
Machine learning 101 dkom 2017
 
AI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingAI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data Modeling
 
Microsoft azure machine learning
Microsoft azure machine learningMicrosoft azure machine learning
Microsoft azure machine learning
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 

Similar to Robust and declarative machine learning pipelines for predictive buying at Barclays

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
Vijayananda Mohire
 
Oracle plsql and d2 k interview question1
Oracle plsql and d2 k interview question1Oracle plsql and d2 k interview question1
Oracle plsql and d2 k interview question1
Arunkumar Gurav
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
Ahmet Bulut
 
Predictable reactive state management - ngrx
Predictable reactive state management - ngrxPredictable reactive state management - ngrx
Predictable reactive state management - ngrx
Ilia Idakiev
 
House price prediction
House price predictionHouse price prediction
House price prediction
SabahBegum
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Philip Goddard
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
Oracle report from ppt
Oracle report from pptOracle report from ppt
Oracle report from ppt
kingshuk_goswami
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
Oracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questionsOracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questions
Arunkumar Gurav
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Benjamin Bengfort
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
Siddharth Shrivastava
 
Parameter Estimation User Guide
Parameter Estimation User GuideParameter Estimation User Guide
Parameter Estimation User GuideAndy Salmon
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Philip Goddard
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
Jim Dowling
 
C question
C questionC question
C question
Kuntal Bhowmick
 
Practical data science
Practical data sciencePractical data science
Practical data science
Ding Li
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
Lightbend
 

Similar to Robust and declarative machine learning pipelines for predictive buying at Barclays (20)

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 
Oracle plsql and d2 k interview question1
Oracle plsql and d2 k interview question1Oracle plsql and d2 k interview question1
Oracle plsql and d2 k interview question1
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Predictable reactive state management - ngrx
Predictable reactive state management - ngrxPredictable reactive state management - ngrx
Predictable reactive state management - ngrx
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Oracle report from ppt
Oracle report from pptOracle report from ppt
Oracle report from ppt
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Oracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questionsOracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questions
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Parameter Estimation User Guide
Parameter Estimation User GuideParameter Estimation User Guide
Parameter Estimation User Guide
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
 
C question
C questionC question
C question
 
Practical data science
Practical data sciencePractical data science
Practical data science
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
 

More from Gianmario Spacagna

Latent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyLatent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case study
Gianmario Spacagna
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
Gianmario Spacagna
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
Gianmario Spacagna
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-Encoders
Gianmario Spacagna
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupGianmario Spacagna
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
Gianmario Spacagna
 

More from Gianmario Spacagna (7)

Latent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyLatent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case study
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-Encoders
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
 

Recently uploaded

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 

Robust and declarative machine learning pipelines for predictive buying at Barclays

  • 1. Robust and declarative machine learning pipelines for predictive buying GIANMARIO SPACAGNA ADVANCED DATA ANALYTICS, BARCLAYS 2016/04/07
  • 2.
  • 3. “ ” Latest technology advancements made data processing accessible to everyone, cheaply and quickly. We believe better value can be produced at the intersection of engineering practices and the correct application of the scientific method. DATA SCIENCE MANIFESTO* Principle 2: “All validation of data, hypotheses and performance should be tracked, reviewed and automated.” Principle 3: “Prior to building a model, construct an evaluation framework with end-to-end business focused acceptance criteria.” * The current manifesto is still in beta version, check the full list of principles at datasciencemanifesto.org
  • 4. MLlib u Machine Learning library for Apache Spark u Scalable, production-oriented (theoretically can handle any size of data given enough cluster resources) u Built on top of RDD collections of vector objects u Provides: u Linear algebra u Basic Statistics u Classification and regression u Collaborative filtering u Clustering u Dimensionality Reduction u Feature extraction and transformation u Pattern mining u Evaluation
  • 5. ML pipelines u ML is the Mllib evolution on top of DataFrame u DataFrame: SQL-like schema of representing data, similar to the original concept in R and Python Pandas. u Offers optimised execution plans without engineering efforts u High-Level APIs for machine learning pipelines u Transformer: DataFrame => DataFrame u Features transformer u Fitted model usable for inferences u Estimator: def fit(data: DataFrame): Transformer u The algorithm that trains your model
  • 6. Great but… u No type safety at compile time (only dynamic types) u A features transformer and a model have the same API? u Non fully functional u functions should be registered as UDFs u explicit casting between SQL Row objects and JVM classes u complex logic harder to be tested and expressed in isolation u Can’t explicitly distinguishing between metadata and feature fields, though you can have as many columns you want (and also remove them at run-time!) u Harder to “safely” implement bespoke algorithms u Underlying technology is very robust but the higher API, although is declarative, could be error-prone when refactoring and maintaining in a real production system
  • 7. Barclays use case u Binary classifier for predicting users willing to buy a financial product in the following 3 months u Each user has different characteristics at different points in time (multiple samples of the same user) u A sample is uniquely defined by the customer id and date u Features consists of a mix of sequence of events, categorical attributes and numerical values u Each sample is labeled as true or false whether the user bought or not in the following horizon window u It must be production-ready but we want to estimate performances before deployment based on retrospective analysis
  • 8. Our Requirements u Nested data structures (mix of sequences and dictionaries) with some meta information attached u Abstraction between the implementations of a binary classifier in vector form and its domain-specific features extraction/selection u Fitted model should simply be a function Features => Double u Robust cross-fold validation, no Data Leakage: u No correlation between training and test set u Temporal order, only predict future based on the past u Don’t predict same users multiple times u Scientifically correct: every procedure, every stage in the workflow should be transparent, tested and reproducible.
  • 9. Sparkz u Extension of Spark for better functional programming u Type-safe, everything checked at compile time u Re-implementing some components for working around limitations due to non functional nature of the development u Currently just a proof-of-concept of designs and patterns of data science applications, examples: u Data Validation using monads and applicative functors u Binary classifier evaluation using sceval u A bunch of utils (lazy loggers, pimps, monoids…) u A typed API for machine learning pipelines https://github.com/gm-spacagna/sparkz
  • 10. Our domain specific data type is UserFeaturesWithBooleanLabel and the information contained in it can be separated into: • features: events, attributes and numericals • Metadata: user id and date • target label, bought the product in the next 3 months? By expressing our API in functions of the raw data type we don’t drop any information from the source and we can always explain the results at any stage of pipeline
  • 11. A Binary Classifieris generic to the type Features and offers two way of implementing it: • Generic typed API (for domain specific algorithms) • Vector API (for standard machine learning algorithms) • LabeledPoint is a pair of vector and label (no meta data though) By splitting the vector-form implementation and the transformation of the source data into a vector gives us more control and robustness of the process. The generic typed API could be used for bespoke implementations that are not based on standard linear algebra (thus don’t require vectors).
  • 12. The simplest implementation of a generic model is a Random classifier, no matter what the features look like it will return a random number. No vector transformation required. We can optionally specify a seed in case we want it to be systematically repeatable. In this example the train method does nothing and e the anonymous implementation of the trained model simply returns a random number between 0 and 1.
  • 13. This vector classifierwraps the MLlib implementation but the score function is re-implemented in DecisionTreeInference such way that traverses the tree and computes the proportion of true and false samples. Thus the fitted model only needs to store the top node of the tree. We had to implement it ourselves because MLlib API is limited to returning only the output class (true/false) without the associated score.
  • 14. A model implemented via the vector API then requires the features type to be transformed into a vector. The FeaturesTransformer API is generic to the Features type and has a sort of train method that given the full training dataset will return the mapping function Features => Vector. This API allows us to learn the transformation rules directly from the underlying data and can be used to turn vector-based models into our domain specific models. The original structure of the source data is always preserved, the transformation function is lazily applied on-the-fly when needed.
  • 15. If our data type is hybrid we want to apply different transformation for each subset of features and then combining them together by concatenating the resulting sub-vectors. The abstract class SubFeaturesTransformerspecifies both the SubFeatures type and extend the base transformer on type Features. That means can be threated as transformer of the base type but only consider a subset of the features. The EnsembleTransformer will concatenate multiple subvectors together.
  • 16. The simplest numerical transformer, takes a getter function that extract a Map[Key, Double] from the generic Features type and return a SubFeaturesTransformer where they key-value map is flattened into a vector with the original values (no normalization). If we want to apply for example a standard scaling we could simply train on OriginalNumericalsTransformer and pipe the function returned by subfeaturesToVector into a function Vector => Vector which would represent our scaler.
  • 17. The One Hot Transformer will flatten each distinct combination of (key, value) pairs into a vector of 1s and 0s. The values represent categories. We encoded them as String but you could easily extend it to be of a generic type as long as it has an equality function or a singleton property (e.g. case objects). Hierarchies and granularities can also be pushed in this logic.
  • 18. The Term FrequencyTransformer takes a sequence of terms and turn them into frequency counts. The index of each term is learnt during training and shared into a broadcast variables so that we don’t have to use the hashing trick which may lead to incorrectnesses. The Term type must have an Ordering defined so that we can preserve the indexing on different runs. This transformer can easily be piped with an IDF function and be turned into a TF-IDF transformer.
  • 19. We can easily create multiple typed-API classifiers by combining the vector trainer algorithm with one or any combination of sub-features transformers.
  • 20. The cross-fold validation uses meta data to stop data leakage and simulate the training/inference stage that we would do in the live production environment: It requires the following parameters: • data: full dataset with features, label and meta information • k: number of folds • classifiers: list of classifiers to be evaluated • uniqueId: function extracting the unique id from the meta data • orderingField: function extracting the temporal order of the samples for the before/after splitting • singleInference: if enabled the test set contains at most one sample for each unique id • seed: seed of the random generator (for repeatability) It returns a map of Classifier-> Collection of test samples and associated score (no metrics aggregation).
  • 21. We used the customer id as unique identifier for our samples and the date as temporal ordering field. We enable the singleInference so that we only predict each user once during the tests (at a random point in time). The returning evaluation outputs are then mapped into collections of score and boolean label so that we can use sceval for computing the confusions matrices and finally the area under the ROC. As general advice we discourage to use AUC as evaluation metric in a real business context. A pool of measures should be created mapping the true business needs, e.g. precision @ 1000, uplift w.r.t. prior, % of new buyers predicted when running the inference after a couple of weeks time, variance of performances across multiple runs, seasonality...
  • 22. Source Data Features Meta User Id Date Events sequence Demographic attributes Financial statements TF One Hot Numerical value Vector Decision Tree Trainer Binary Classifier Trainer Binary Classifier Fitted Model Decision Tree Model Cross Fold Validation Decision Tree Model 1 Decision Tree Model k… Scores and Labels Performance metrics
  • 23. Conclusions 1. In this PoC we wanted to show a specific use case of which we could not find suitable out-of-the-box machine learning frameworks. 2. We focused on the production quality aspects rather than quick and dirty prototypes. 3. We wanted the whole workflow to be transparent, tested and reproducible. 4. Nevertheless, we discourage to implement any premature abstractions until you get your first MVP released and possibly a couple of extra iterations. 5. Our final requirements of what to implement where defined after an iterative and incremental data exploration and knowledge discovery. 6. We used the same production stack together with notebooks during the MVP development/investigation but in a simpler and flat code structure. 7. Regardless of the programming language/technology the thoughtful methodology really makes the difference
  • 24. Further References u Sparkz: https://github.com/gm-spacagna/sparkz u Other examples and tutorials available at Data Science Vademecum: https://datasciencevademecum.wordpress.com/ u Data Science Manifesto: http://www.datasciencemanifesto.org