SlideShare a Scribd company logo
1 of 21
Download to read offline
2 - MLBox
2 - MLBox
a – Features
MLBox : a fully automated pipeline
MLBox: features
Initialisation:
From a raw dataset to a cleaned dataset with numerical features.
• Files reading:
 Reading of several files (csv, xls, json and hdf5)
 Task detection (binary/multiclass classification or regression)
 Creation of the training data set and the test data set
• Preprocessing/cleaning:
 Dropping duplicates and constant features
 Dropping drifting features  see « drift »
• Encoding:
 Converting features to a unique format (float if possible or str)
 Converting lists
 Converting dates into timestamp
 Target encoding for classification task only
 Categorical features encoding (several strategies available !)
 Missing values encoding (several strategies available !)
MLBox: features
Validation:
Several models are tested and cross-validated and the best one is fitted.
• On features:
 Feature engineering : neural network features engineering  see « entity embedding »
 Feature selection : filter methods, wrapper methods and L1 regularization
• On estimator:
 TestIng of a wide range of accurate estimators: Linear model, Random Forest, XGBoost, LightGBM…
 Model blending: stacking, boosting, bagging
 Hyper-parameters tuning (using TPE algorithm)
• On validation :
 Choice of several metrics: accuracy, log-loss, AUC, f1-score, MSE, MAE, … or customized
 Validation parameters: number of folds, random state, …
MLBox: features
Application:
We fit the whole pipeline and predict the target on the test set.
• Prediction:
 Target prediction (class probabilities for classification)
 Dumping fitted models and final predictions (.csv file)
 CPU time display
• Models interpretation:
 Features importance
 Leak detection
2 - MLBox
b – focus #1 on MLBox
« Drift »: a brand new algorithm !
Focus 1 on MLBox: « drift detection »
Training data set Model
Learning
hyper parameters
tuning
Test data set
Scoring on test
data
Prediction
The model predicts the
target
Hypothesis : data are independant
and identically distributed (iid)
 The issue
Focus 1 on MLBox: « drift detection »
Training data set Model
Test data set
In practice: data are biased
Solution
Take into account this drift
when fitting the model
Scoring on test
data
Goal
Control the model’s
performance
 The issue
Lower score !
Focus 1 on MLBox: « drift detection »
 Definition
P(x,y) = P(x) . P(y|x)
Training data set Concept shiftCovariate shift
We deal with this
case
Focus 1 on MLBox: « drift detection »
 The solution
Focus 1 on MLBox: « drift detection »
 The algorithm
Univariate drift estimation
- Labelling data whether it belongs to the training or the test set
- Fitting a classifier to separate both classes
- Validation: roc AUC score gives us a drift measure
1
(Greedy) Recursive Drift Elimination
For each feature in decreasing drift order:
 We drop the feature
 We cross-validate the new model
 If delta score is > p :
- We keep the feature
- If not we drop it permanently
NB :
 Greedy algorithm: we can set a limited number of features to try or a drift
threshold under which features are kept
 We can set different p
2
Focus 1 on MLBox: « drift detection »
Maximum
threshold for
delta score
Delta score
(%)
Drift
measure
 Best case scenario
• 100% drifting features (we drop it)
• 0% drifting features (we keep it)
• Drifting features that are not important (we drop it)
 Worst case scenario
• Important drifting features (2 sub-cases) :
c – Focus #2 on MLBox
« Entity embeddings »
2 - MLBox
Focus 2 on MLBox: « entity embeddings »
 How to encode a categorical feature?
entity embedding How to do ???
Accurate + scalable +
understandable + quick
none ? 
Methods Explanation Advantages Drawbacks
label encoding
For each categorical feature, we encode each
value using the lexicographic order (A -> 1, B ->
2, …)
Quick + scalable
Naive encoding + non
understandable
dummification
We « binarise » all categorical features (ex :
var1_A, var1_B, …)
Quick + less naive encoding +
understandable
Not scalable
Random projection Random label encoding in k dimension
Quick + scalable + less naive
encoding
Non understandable
discretization We gather values scalable + understandable Loss of information
Focus 2 on MLBox: « entity embeddings »
 The idea of entity embedding
We would like to learn the best vectorial representation in an euclidian space:
• Label encoding/random projection : distance between 2 values = random
• Dummification : distance between 2 values = √2
• Entity embedding : distance between 2 « similar » values low and vice versa
 Solution : use a neural network to learn this representation
Focus 2 on MLBox: « entity embeddings »
 The principle of entity embedding
Choice of the number of units for the « embedded layer » :
- Constant (2 or 3)
- Proportionnal to the number of values (threshold = 5)
- Using a discretization method (Edge ML)
Choice of the ouput: the target or
other features for an unsupervised
problem
Encoding : we get the weighs
between the « one-hot-encoding
layer » and the « embedded layer
»
Choice of the number of units in the layer 1 :
take into account interactions between
embeddings
Choice of the number of
layers / dropout: to be tested !
Focus 2 on MLBox: « entity embeddings »
 Test on a real data set
• DataScience.net challenge: https://www.datascience.net/fr/challenge/26/details
• We would like to predict the price of an automobile insurance policy
• Training data set: 300.000 rows and 43 features like: brand, age, postcode, energy consumption, …
Top 10:
155 brands 6 energy
consumptions
target
Focus 2 on MLBox: « entity embeddings »
 Test on a real data set: accuracy + quickness
Focus 2 on MLBox: « entity embeddings »
 Test on a real data set: understanding + scalability
Threshold = 5 units for the embedded layer
Focus 2 on MLBox: « entity embeddings »
Energy
consumptions
(colors)
Increasing prices (size)

More Related Content

What's hot

Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013MLconf
 
Object- Relational Persistence in Smalltalk
Object- Relational Persistence in SmalltalkObject- Relational Persistence in Smalltalk
Object- Relational Persistence in SmalltalkESUG
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflowDatabricks
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
HyperGraphDb
HyperGraphDbHyperGraphDb
HyperGraphDbborislav
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
 
Pipeline oriented data analytics
Pipeline oriented data analyticsPipeline oriented data analytics
Pipeline oriented data analyticsBorys Biletskyy
 
Building A Machine Learning Platform At Quora (1)
Building A Machine Learning Platform At Quora (1)Building A Machine Learning Platform At Quora (1)
Building A Machine Learning Platform At Quora (1)Nikhil Garg
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013MLconf
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataBenjamin Bengfort
 
Introduction to NHibernate
Introduction to NHibernateIntroduction to NHibernate
Introduction to NHibernateDublin Alt,Net
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflowDatabricks
 
Parallel Computing in .NET
Parallel Computing in .NETParallel Computing in .NET
Parallel Computing in .NETmeghantaylor
 
IRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entityIRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entitykartik179
 
Object Oriented PHP Overview
Object Oriented PHP OverviewObject Oriented PHP Overview
Object Oriented PHP OverviewLarry Ball
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 

What's hot (20)

Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013
 
Object- Relational Persistence in Smalltalk
Object- Relational Persistence in SmalltalkObject- Relational Persistence in Smalltalk
Object- Relational Persistence in Smalltalk
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
HyperGraphDb
HyperGraphDbHyperGraphDb
HyperGraphDb
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
Pipeline oriented data analytics
Pipeline oriented data analyticsPipeline oriented data analytics
Pipeline oriented data analytics
 
Building A Machine Learning Platform At Quora (1)
Building A Machine Learning Platform At Quora (1)Building A Machine Learning Platform At Quora (1)
Building A Machine Learning Platform At Quora (1)
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Open Platform for AI & ML modeling
Open Platform for AI & ML modelingOpen Platform for AI & ML modeling
Open Platform for AI & ML modeling
 
Story story ppt
Story story pptStory story ppt
Story story ppt
 
Introduction to NHibernate
Introduction to NHibernateIntroduction to NHibernate
Introduction to NHibernate
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
 
Parallel Computing in .NET
Parallel Computing in .NETParallel Computing in .NET
Parallel Computing in .NET
 
IRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entityIRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entity
 
Distil
DistilDistil
Distil
 
Object Oriented PHP Overview
Object Oriented PHP OverviewObject Oriented PHP Overview
Object Oriented PHP Overview
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 

Similar to MLBox

Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material Bryan Yang
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorizationAndreas Loupasakis
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization Warply
 
Exploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in RExploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in RSatoshi Kato
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson ChallengeRaouf KESKES
 
Oracle Query Optimizer - An Introduction
Oracle Query Optimizer - An IntroductionOracle Query Optimizer - An Introduction
Oracle Query Optimizer - An Introductionadryanbub
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksSpark Summit
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series BigML, Inc
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
ProFET - Protein Feature Engineering Toolki
ProFET - Protein Feature Engineering ToolkiProFET - Protein Feature Engineering Toolki
ProFET - Protein Feature Engineering ToolkiDan Ofer
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEFeng Zhu
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsRajendran
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 

Similar to MLBox (20)

Ember
EmberEmber
Ember
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Exploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in RExploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in R
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
ADMET.pptx
ADMET.pptxADMET.pptx
ADMET.pptx
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
Oracle Query Optimizer - An Introduction
Oracle Query Optimizer - An IntroductionOracle Query Optimizer - An Introduction
Oracle Query Optimizer - An Introduction
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
ProFET - Protein Feature Engineering Toolki
ProFET - Protein Feature Engineering ToolkiProFET - Protein Feature Engineering Toolki
ProFET - Protein Feature Engineering Toolki
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 

Recently uploaded

Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 

Recently uploaded (20)

Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 

MLBox

  • 2. 2 - MLBox a – Features
  • 3. MLBox : a fully automated pipeline
  • 4. MLBox: features Initialisation: From a raw dataset to a cleaned dataset with numerical features. • Files reading:  Reading of several files (csv, xls, json and hdf5)  Task detection (binary/multiclass classification or regression)  Creation of the training data set and the test data set • Preprocessing/cleaning:  Dropping duplicates and constant features  Dropping drifting features  see « drift » • Encoding:  Converting features to a unique format (float if possible or str)  Converting lists  Converting dates into timestamp  Target encoding for classification task only  Categorical features encoding (several strategies available !)  Missing values encoding (several strategies available !)
  • 5. MLBox: features Validation: Several models are tested and cross-validated and the best one is fitted. • On features:  Feature engineering : neural network features engineering  see « entity embedding »  Feature selection : filter methods, wrapper methods and L1 regularization • On estimator:  TestIng of a wide range of accurate estimators: Linear model, Random Forest, XGBoost, LightGBM…  Model blending: stacking, boosting, bagging  Hyper-parameters tuning (using TPE algorithm) • On validation :  Choice of several metrics: accuracy, log-loss, AUC, f1-score, MSE, MAE, … or customized  Validation parameters: number of folds, random state, …
  • 6. MLBox: features Application: We fit the whole pipeline and predict the target on the test set. • Prediction:  Target prediction (class probabilities for classification)  Dumping fitted models and final predictions (.csv file)  CPU time display • Models interpretation:  Features importance  Leak detection
  • 7. 2 - MLBox b – focus #1 on MLBox « Drift »: a brand new algorithm !
  • 8. Focus 1 on MLBox: « drift detection » Training data set Model Learning hyper parameters tuning Test data set Scoring on test data Prediction The model predicts the target Hypothesis : data are independant and identically distributed (iid)  The issue
  • 9. Focus 1 on MLBox: « drift detection » Training data set Model Test data set In practice: data are biased Solution Take into account this drift when fitting the model Scoring on test data Goal Control the model’s performance  The issue Lower score !
  • 10. Focus 1 on MLBox: « drift detection »  Definition P(x,y) = P(x) . P(y|x) Training data set Concept shiftCovariate shift We deal with this case
  • 11. Focus 1 on MLBox: « drift detection »  The solution
  • 12. Focus 1 on MLBox: « drift detection »  The algorithm Univariate drift estimation - Labelling data whether it belongs to the training or the test set - Fitting a classifier to separate both classes - Validation: roc AUC score gives us a drift measure 1 (Greedy) Recursive Drift Elimination For each feature in decreasing drift order:  We drop the feature  We cross-validate the new model  If delta score is > p : - We keep the feature - If not we drop it permanently NB :  Greedy algorithm: we can set a limited number of features to try or a drift threshold under which features are kept  We can set different p 2
  • 13. Focus 1 on MLBox: « drift detection » Maximum threshold for delta score Delta score (%) Drift measure  Best case scenario • 100% drifting features (we drop it) • 0% drifting features (we keep it) • Drifting features that are not important (we drop it)  Worst case scenario • Important drifting features (2 sub-cases) :
  • 14. c – Focus #2 on MLBox « Entity embeddings » 2 - MLBox
  • 15. Focus 2 on MLBox: « entity embeddings »  How to encode a categorical feature? entity embedding How to do ??? Accurate + scalable + understandable + quick none ?  Methods Explanation Advantages Drawbacks label encoding For each categorical feature, we encode each value using the lexicographic order (A -> 1, B -> 2, …) Quick + scalable Naive encoding + non understandable dummification We « binarise » all categorical features (ex : var1_A, var1_B, …) Quick + less naive encoding + understandable Not scalable Random projection Random label encoding in k dimension Quick + scalable + less naive encoding Non understandable discretization We gather values scalable + understandable Loss of information
  • 16. Focus 2 on MLBox: « entity embeddings »  The idea of entity embedding We would like to learn the best vectorial representation in an euclidian space: • Label encoding/random projection : distance between 2 values = random • Dummification : distance between 2 values = √2 • Entity embedding : distance between 2 « similar » values low and vice versa  Solution : use a neural network to learn this representation
  • 17. Focus 2 on MLBox: « entity embeddings »  The principle of entity embedding Choice of the number of units for the « embedded layer » : - Constant (2 or 3) - Proportionnal to the number of values (threshold = 5) - Using a discretization method (Edge ML) Choice of the ouput: the target or other features for an unsupervised problem Encoding : we get the weighs between the « one-hot-encoding layer » and the « embedded layer » Choice of the number of units in the layer 1 : take into account interactions between embeddings Choice of the number of layers / dropout: to be tested !
  • 18. Focus 2 on MLBox: « entity embeddings »  Test on a real data set • DataScience.net challenge: https://www.datascience.net/fr/challenge/26/details • We would like to predict the price of an automobile insurance policy • Training data set: 300.000 rows and 43 features like: brand, age, postcode, energy consumption, … Top 10: 155 brands 6 energy consumptions target
  • 19. Focus 2 on MLBox: « entity embeddings »  Test on a real data set: accuracy + quickness
  • 20. Focus 2 on MLBox: « entity embeddings »  Test on a real data set: understanding + scalability Threshold = 5 units for the embedded layer
  • 21. Focus 2 on MLBox: « entity embeddings » Energy consumptions (colors) Increasing prices (size)