databricks ml flow demonstration using automatic features engineering

Overview of a typical machine learning model workflow
Fact #1 : Doing machine learning IS complex

Fact #2 ! Hardest part of AI actually is not AI code...

Machine learning projects main concerns
1- Open source ML ecosystem is crowded : For each phase of ML
process, there is a myriad of tools to choose from ;
2- Tracking : it is difficult to track by hand which parameters, code, and
data went into each experiment to produce a model, especially when
work in teams ;
3- Reproducibility : Without detailed tracking, teams often have
trouble getting the same code to work / achieve same results
5

- First release on june 2018
- Latest version v1.5, released 19 Dec 2019
Introducing MLflow
6

MLflow address machine learning challenges through its
3 main components
7
“MLflow is an open source platform to manage the ML
lifecycle, including experimentation, reproducibility and
deployment. It currently offers three components:
What is ML flow ?
1 2 3

MLflow value proposition:
Track, store, Deploy models from any ML framework
8

ML tracking API
Single API + UI to track for
each experiment :
▸ Parameters
▸ Metrics
▸ Artefacts (training
datasets, …)
Can be used on standalone
script / from a notebook
9

ML projects
- ML projects define a standard
packaging format to manage data
science code.
- It can be a simple directory / git
repo with code to run.
- The running environment
requirements are defined as a
simple YAML file.
11
ML flow projects sample YAML project

ML Models
- MLflow Models is a
convention for packaging
machine learning models in
multiple formats called
“flavors”. MLflow offers a
variety of tools to help you
deploy different flavors of
models.
- Each MLflow Model is saved
as a directory containing
arbitrary files and an MLmodel
descriptor file that lists the
flavors it can be used in.
12
Example of scikit-learn model

Model serving commands
13
mlflow models serve
Deploys the model as a
local REST API server.
mlflow models build-
docker
packages a REST API
endpoint serving the model
as a docker image.
mlflow models predict
uses the model to generate
a prediction for a local CSV
or JSON file.

E-commerce fraud detection
We have some json profiles representing fictional customers from an ecommerce
company.
(cf. courtesy of RAVELIN : https://github.com/unravelin/code-test-data-science)
The profiles contain information about the customers, their orders, their transactions, what
payment methods they used and whether the customer is fraudulent or not.
Our task :
● Transform the json profiles into feature vectors :
a. Automated feature engineering using featuretools package
● Construct a model to predict if a customer is fraudulent based on their profile.
a. modeling phase using python + scikit-learn
b. Track experiments results using databricks ML flow
15

16
Transform input data
1/Transactions / orders
2/Labels : fraudulent (true/false)
Extract features
Count orders
min/max/avg transaction amount...
Store analytical dataset
Parquet file with :
customerID | features X… | label
Python script to decode each user profile json array into relational
pandas dataframes
Data aggregation can be done using SQL / sparkSQL / pandas dataframe
- In our case will use featurestools package to automate this phase
Baseline classifier
Train a random forest model with
default parameters
Tuned classifier
Using gridsearchCV to tune best
parameters based on cross-
validation results
Base
AUC
Optimized
AUC
MLflow tracking goes here
Model Building & tracking process

Appendix #1 :
Deep feature synthesis used in Feature tools python package to generate
aggregates / apply transformations on relational data
17featurelabs.com

18
Appendix #2 : How random forests works ?
Main parameters to tune :
- Max_depth : # trees
- Nb_estimators : # of trees
- Min_rows: Specify the minimum
number of observations for a leaf
- Col_sample: column sample / tree
- Sample_rate : default to 0.63333

19
Complete code on github :
https://github.com/mmejdoubi/mlflow_fraud_ecom/blob/mast
er/ravelin_fraud_RF_mlflow_v1.ipynb

24
THANKS!
Any questions?
Reach me at mejdoubi2005@gmail.com
😉

databricks ml flow demonstration using automatic features engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to databricks ml flow demonstration using automatic features engineering

Similar to databricks ml flow demonstration using automatic features engineering (20)

Recently uploaded

Recently uploaded (20)

databricks ml flow demonstration using automatic features engineering