Robust and declarative
machine learning
pipelines for predictive
buying
GIANMARIO SPACAGNA
ADVANCED DATA ANALYTICS, BARCLAYS
2016/04/07
“
”
Latest technology advancements made data
processing accessible to everyone, cheaply and
quickly.
We believe better value can be produced at the
intersection of engineering practices and the
correct application of the scientific method.
DATA SCIENCE MANIFESTO*
Principle 2: “All validation of data, hypotheses and performance should be
tracked, reviewed and automated.”
Principle 3: “Prior to building a model, construct an evaluation framework with
end-to-end business focused acceptance criteria.”
* The current manifesto is still in beta version, check the full list of principles at datasciencemanifesto.org
MLlib
u Machine Learning library for Apache Spark
u Scalable, production-oriented
(theoretically can handle any size of data given enough cluster resources)
u Built on top of RDD collections of vector objects
u Provides:
u Linear algebra
u Basic Statistics
u Classification and regression
u Collaborative filtering
u Clustering
u Dimensionality Reduction
u Feature extraction and transformation
u Pattern mining
u Evaluation
ML pipelines
u ML is the Mllib evolution on top of DataFrame
u DataFrame: SQL-like schema of representing data, similar to the original
concept in R and Python Pandas.
u Offers optimised execution plans without engineering efforts
u High-Level APIs for machine learning pipelines
u Transformer: DataFrame => DataFrame
u Features transformer
u Fitted model usable for inferences
u Estimator: def fit(data: DataFrame): Transformer
u The algorithm that trains your model
Great but…
u No type safety at compile time (only dynamic types)
u A features transformer and a model have the same API?
u Non fully functional
u functions should be registered as UDFs
u explicit casting between SQL Row objects and JVM classes
u complex logic harder to be tested and expressed in isolation
u Can’t explicitly distinguishing between metadata and feature fields,
though you can have as many columns you want (and also remove
them at run-time!)
u Harder to “safely” implement bespoke algorithms
u Underlying technology is very robust but the higher API, although is
declarative, could be error-prone when refactoring and
maintaining in a real production system
Barclays use case
u Binary classifier for predicting users willing to buy a financial product
in the following 3 months
u Each user has different characteristics at different points in time
(multiple samples of the same user)
u A sample is uniquely defined by the customer id and date
u Features consists of a mix of sequence of events, categorical
attributes and numerical values
u Each sample is labeled as true or false whether the user bought or
not in the following horizon window
u It must be production-ready but we want to estimate performances
before deployment based on retrospective analysis
Our Requirements
u Nested data structures (mix of sequences and dictionaries) with
some meta information attached
u Abstraction between the implementations of a binary classifier in
vector form and its domain-specific features extraction/selection
u Fitted model should simply be a function Features => Double
u Robust cross-fold validation, no Data Leakage:
u No correlation between training and test set
u Temporal order, only predict future based on the past
u Don’t predict same users multiple times
u Scientifically correct: every procedure, every stage in the workflow
should be transparent, tested and reproducible.
Sparkz
u Extension of Spark for better functional programming
u Type-safe, everything checked at compile time
u Re-implementing some components for working around limitations
due to non functional nature of the development
u Currently just a proof-of-concept of designs and patterns of data
science applications, examples:
u Data Validation using monads and applicative functors
u Binary classifier evaluation using sceval
u A bunch of utils (lazy loggers, pimps, monoids…)
u A typed API for machine learning pipelines
https://github.com/gm-spacagna/sparkz
Our domain specific data type is UserFeaturesWithBooleanLabel and the information contained in it
can be separated into:
• features: events, attributes and numericals
• Metadata: user id and date
• target label, bought the product in the next 3 months?
By expressing our API in functions of the raw data type we don’t drop any information from the source
and we can always explain the results at any stage of pipeline
A Binary Classifieris generic to the type Features and offers two way of implementing it:
• Generic typed API (for domain specific algorithms)
• Vector API (for standard machine learning algorithms)
• LabeledPoint is a pair of vector and label (no meta data though)
By splitting the vector-form implementation and the transformation of the source data into a vector gives
us more control and robustness of the process. The generic typed API could be used for bespoke
implementations that are not based on standard linear algebra (thus don’t require vectors).
The simplest implementation of a generic model is a Random classifier, no matter what
the features look like it will return a random number.
No vector transformation required.
We can optionally specify a seed in case we want it to be systematically repeatable.
In this example the train method does nothing and e the anonymous implementation of
the trained model simply returns a random number between 0 and 1.
This vector classifierwraps the MLlib implementation but the score function is re-implemented in
DecisionTreeInference such way that traverses the tree and computes the proportion of true and false
samples. Thus the fitted model only needs to store the top node of the tree.
We had to implement it ourselves because MLlib API is limited to returning only the output class (true/false)
without the associated score.
A model implemented via the vector API then requires the features type to be transformed into a vector.
The FeaturesTransformer API is generic to the Features type and has a sort of train method that given the
full training dataset will return the mapping function Features => Vector.
This API allows us to learn the transformation rules directly from the underlying data and can be used to
turn vector-based models into our domain specific models.
The original structure of the source data is always preserved, the transformation function is lazily applied
on-the-fly when needed.
If our data type is hybrid we want to apply different transformation for each subset of features and then
combining them together by concatenating the resulting sub-vectors.
The abstract class SubFeaturesTransformerspecifies both the SubFeatures type and extend the base
transformer on type Features. That means can be threated as transformer of the base type but only consider
a subset of the features.
The EnsembleTransformer will concatenate multiple subvectors together.
The simplest numerical transformer, takes a getter function that extract a Map[Key, Double] from the
generic Features type and return a SubFeaturesTransformer where they key-value map is flattened into a
vector with the original values (no normalization).
If we want to apply for example a standard scaling we could simply train on OriginalNumericalsTransformer
and pipe the function returned by subfeaturesToVector into a function Vector => Vector which would
represent our scaler.
The One Hot Transformer will flatten each distinct combination of (key, value) pairs into a vector of 1s and 0s.
The values represent categories. We encoded them as String but you could easily extend it to be of a
generic type as long as it has an equality function or a singleton property (e.g. case objects).
Hierarchies and granularities can also be pushed in this logic.
The Term FrequencyTransformer takes a sequence of terms and turn them into frequency counts.
The index of each term is learnt during training and shared into a broadcast variables so that we don’t have
to use the hashing trick which may lead to incorrectnesses.
The Term type must have an Ordering defined so that we can preserve the indexing on different runs.
This transformer can easily be piped with an IDF function and be turned into a TF-IDF transformer.
We can easily create multiple typed-API classifiers by combining the vector trainer algorithm with one or any
combination of sub-features transformers.
The cross-fold validation uses meta data to stop data leakage and simulate the training/inference stage
that we would do in the live production environment:
It requires the following parameters:
• data: full dataset with features, label and meta information
• k: number of folds
• classifiers: list of classifiers to be evaluated
• uniqueId: function extracting the unique id from the meta data
• orderingField: function extracting the temporal order of the samples for the before/after splitting
• singleInference: if enabled the test set contains at most one sample for each unique id
• seed: seed of the random generator (for repeatability)
It returns a map of Classifier-> Collection of test samples and associated score (no metrics aggregation).
We used the customer id as unique identifier for our samples and the date as temporal ordering field.
We enable the singleInference so that we only predict each user once during the tests (at a random point in
time).
The returning evaluation outputs are then mapped into collections of score and boolean label so that we
can use sceval for computing the confusions matrices and finally the area under the ROC.
As general advice we discourage to use AUC as evaluation metric in a real business context. A pool of
measures should be created mapping the true business needs, e.g. precision @ 1000, uplift w.r.t. prior, % of
new buyers predicted when running the inference after a couple of weeks time, variance of performances
across multiple runs, seasonality...
Source
Data
Features
Meta
User Id
Date
Events
sequence
Demographic
attributes
Financial
statements
TF
One Hot
Numerical
value
Vector
Decision Tree
Trainer
Binary
Classifier
Trainer
Binary
Classifier
Fitted
Model
Decision
Tree
Model
Cross Fold Validation
Decision
Tree
Model 1
Decision
Tree
Model k…
Scores and
Labels
Performance
metrics
Conclusions
1. In this PoC we wanted to show a specific use case of which we could not
find suitable out-of-the-box machine learning frameworks.
2. We focused on the production quality aspects rather than quick and dirty
prototypes.
3. We wanted the whole workflow to be transparent, tested and reproducible.
4. Nevertheless, we discourage to implement any premature abstractions until
you get your first MVP released and possibly a couple of extra iterations.
5. Our final requirements of what to implement where defined after an
iterative and incremental data exploration and knowledge discovery.
6. We used the same production stack together with notebooks during the
MVP development/investigation but in a simpler and flat code structure.
7. Regardless of the programming language/technology the thoughtful
methodology really makes the difference
Further References
u Sparkz:
https://github.com/gm-spacagna/sparkz
u Other examples and tutorials available at
Data Science Vademecum:
https://datasciencevademecum.wordpress.com/
u Data Science Manifesto:
http://www.datasciencemanifesto.org

Robust and declarative machine learning pipelines for predictive buying at Barclays

  • 1.
    Robust and declarative machinelearning pipelines for predictive buying GIANMARIO SPACAGNA ADVANCED DATA ANALYTICS, BARCLAYS 2016/04/07
  • 3.
    “ ” Latest technology advancementsmade data processing accessible to everyone, cheaply and quickly. We believe better value can be produced at the intersection of engineering practices and the correct application of the scientific method. DATA SCIENCE MANIFESTO* Principle 2: “All validation of data, hypotheses and performance should be tracked, reviewed and automated.” Principle 3: “Prior to building a model, construct an evaluation framework with end-to-end business focused acceptance criteria.” * The current manifesto is still in beta version, check the full list of principles at datasciencemanifesto.org
  • 4.
    MLlib u Machine Learninglibrary for Apache Spark u Scalable, production-oriented (theoretically can handle any size of data given enough cluster resources) u Built on top of RDD collections of vector objects u Provides: u Linear algebra u Basic Statistics u Classification and regression u Collaborative filtering u Clustering u Dimensionality Reduction u Feature extraction and transformation u Pattern mining u Evaluation
  • 5.
    ML pipelines u MLis the Mllib evolution on top of DataFrame u DataFrame: SQL-like schema of representing data, similar to the original concept in R and Python Pandas. u Offers optimised execution plans without engineering efforts u High-Level APIs for machine learning pipelines u Transformer: DataFrame => DataFrame u Features transformer u Fitted model usable for inferences u Estimator: def fit(data: DataFrame): Transformer u The algorithm that trains your model
  • 6.
    Great but… u Notype safety at compile time (only dynamic types) u A features transformer and a model have the same API? u Non fully functional u functions should be registered as UDFs u explicit casting between SQL Row objects and JVM classes u complex logic harder to be tested and expressed in isolation u Can’t explicitly distinguishing between metadata and feature fields, though you can have as many columns you want (and also remove them at run-time!) u Harder to “safely” implement bespoke algorithms u Underlying technology is very robust but the higher API, although is declarative, could be error-prone when refactoring and maintaining in a real production system
  • 7.
    Barclays use case uBinary classifier for predicting users willing to buy a financial product in the following 3 months u Each user has different characteristics at different points in time (multiple samples of the same user) u A sample is uniquely defined by the customer id and date u Features consists of a mix of sequence of events, categorical attributes and numerical values u Each sample is labeled as true or false whether the user bought or not in the following horizon window u It must be production-ready but we want to estimate performances before deployment based on retrospective analysis
  • 8.
    Our Requirements u Nesteddata structures (mix of sequences and dictionaries) with some meta information attached u Abstraction between the implementations of a binary classifier in vector form and its domain-specific features extraction/selection u Fitted model should simply be a function Features => Double u Robust cross-fold validation, no Data Leakage: u No correlation between training and test set u Temporal order, only predict future based on the past u Don’t predict same users multiple times u Scientifically correct: every procedure, every stage in the workflow should be transparent, tested and reproducible.
  • 9.
    Sparkz u Extension ofSpark for better functional programming u Type-safe, everything checked at compile time u Re-implementing some components for working around limitations due to non functional nature of the development u Currently just a proof-of-concept of designs and patterns of data science applications, examples: u Data Validation using monads and applicative functors u Binary classifier evaluation using sceval u A bunch of utils (lazy loggers, pimps, monoids…) u A typed API for machine learning pipelines https://github.com/gm-spacagna/sparkz
  • 10.
    Our domain specificdata type is UserFeaturesWithBooleanLabel and the information contained in it can be separated into: • features: events, attributes and numericals • Metadata: user id and date • target label, bought the product in the next 3 months? By expressing our API in functions of the raw data type we don’t drop any information from the source and we can always explain the results at any stage of pipeline
  • 11.
    A Binary Classifierisgeneric to the type Features and offers two way of implementing it: • Generic typed API (for domain specific algorithms) • Vector API (for standard machine learning algorithms) • LabeledPoint is a pair of vector and label (no meta data though) By splitting the vector-form implementation and the transformation of the source data into a vector gives us more control and robustness of the process. The generic typed API could be used for bespoke implementations that are not based on standard linear algebra (thus don’t require vectors).
  • 12.
    The simplest implementationof a generic model is a Random classifier, no matter what the features look like it will return a random number. No vector transformation required. We can optionally specify a seed in case we want it to be systematically repeatable. In this example the train method does nothing and e the anonymous implementation of the trained model simply returns a random number between 0 and 1.
  • 13.
    This vector classifierwrapsthe MLlib implementation but the score function is re-implemented in DecisionTreeInference such way that traverses the tree and computes the proportion of true and false samples. Thus the fitted model only needs to store the top node of the tree. We had to implement it ourselves because MLlib API is limited to returning only the output class (true/false) without the associated score.
  • 14.
    A model implementedvia the vector API then requires the features type to be transformed into a vector. The FeaturesTransformer API is generic to the Features type and has a sort of train method that given the full training dataset will return the mapping function Features => Vector. This API allows us to learn the transformation rules directly from the underlying data and can be used to turn vector-based models into our domain specific models. The original structure of the source data is always preserved, the transformation function is lazily applied on-the-fly when needed.
  • 15.
    If our datatype is hybrid we want to apply different transformation for each subset of features and then combining them together by concatenating the resulting sub-vectors. The abstract class SubFeaturesTransformerspecifies both the SubFeatures type and extend the base transformer on type Features. That means can be threated as transformer of the base type but only consider a subset of the features. The EnsembleTransformer will concatenate multiple subvectors together.
  • 16.
    The simplest numericaltransformer, takes a getter function that extract a Map[Key, Double] from the generic Features type and return a SubFeaturesTransformer where they key-value map is flattened into a vector with the original values (no normalization). If we want to apply for example a standard scaling we could simply train on OriginalNumericalsTransformer and pipe the function returned by subfeaturesToVector into a function Vector => Vector which would represent our scaler.
  • 17.
    The One HotTransformer will flatten each distinct combination of (key, value) pairs into a vector of 1s and 0s. The values represent categories. We encoded them as String but you could easily extend it to be of a generic type as long as it has an equality function or a singleton property (e.g. case objects). Hierarchies and granularities can also be pushed in this logic.
  • 18.
    The Term FrequencyTransformertakes a sequence of terms and turn them into frequency counts. The index of each term is learnt during training and shared into a broadcast variables so that we don’t have to use the hashing trick which may lead to incorrectnesses. The Term type must have an Ordering defined so that we can preserve the indexing on different runs. This transformer can easily be piped with an IDF function and be turned into a TF-IDF transformer.
  • 19.
    We can easilycreate multiple typed-API classifiers by combining the vector trainer algorithm with one or any combination of sub-features transformers.
  • 20.
    The cross-fold validationuses meta data to stop data leakage and simulate the training/inference stage that we would do in the live production environment: It requires the following parameters: • data: full dataset with features, label and meta information • k: number of folds • classifiers: list of classifiers to be evaluated • uniqueId: function extracting the unique id from the meta data • orderingField: function extracting the temporal order of the samples for the before/after splitting • singleInference: if enabled the test set contains at most one sample for each unique id • seed: seed of the random generator (for repeatability) It returns a map of Classifier-> Collection of test samples and associated score (no metrics aggregation).
  • 21.
    We used thecustomer id as unique identifier for our samples and the date as temporal ordering field. We enable the singleInference so that we only predict each user once during the tests (at a random point in time). The returning evaluation outputs are then mapped into collections of score and boolean label so that we can use sceval for computing the confusions matrices and finally the area under the ROC. As general advice we discourage to use AUC as evaluation metric in a real business context. A pool of measures should be created mapping the true business needs, e.g. precision @ 1000, uplift w.r.t. prior, % of new buyers predicted when running the inference after a couple of weeks time, variance of performances across multiple runs, seasonality...
  • 22.
    Source Data Features Meta User Id Date Events sequence Demographic attributes Financial statements TF One Hot Numerical value Vector DecisionTree Trainer Binary Classifier Trainer Binary Classifier Fitted Model Decision Tree Model Cross Fold Validation Decision Tree Model 1 Decision Tree Model k… Scores and Labels Performance metrics
  • 23.
    Conclusions 1. In thisPoC we wanted to show a specific use case of which we could not find suitable out-of-the-box machine learning frameworks. 2. We focused on the production quality aspects rather than quick and dirty prototypes. 3. We wanted the whole workflow to be transparent, tested and reproducible. 4. Nevertheless, we discourage to implement any premature abstractions until you get your first MVP released and possibly a couple of extra iterations. 5. Our final requirements of what to implement where defined after an iterative and incremental data exploration and knowledge discovery. 6. We used the same production stack together with notebooks during the MVP development/investigation but in a simpler and flat code structure. 7. Regardless of the programming language/technology the thoughtful methodology really makes the difference
  • 24.
    Further References u Sparkz: https://github.com/gm-spacagna/sparkz uOther examples and tutorials available at Data Science Vademecum: https://datasciencevademecum.wordpress.com/ u Data Science Manifesto: http://www.datasciencemanifesto.org