SlideShare a Scribd company logo
2 - MLBox
2 - MLBox
a – Features
MLBox : a fully automated pipeline
MLBox: features
Initialisation:
From a raw dataset to a cleaned dataset with numerical features.
• Files reading:
 Reading of several files (csv, xls, json and hdf5)
 Task detection (binary/multiclass classification or regression)
 Creation of the training data set and the test data set
• Preprocessing/cleaning:
 Dropping duplicates and constant features
 Dropping drifting features  see « drift »
• Encoding:
 Converting features to a unique format (float if possible or str)
 Converting lists
 Converting dates into timestamp
 Target encoding for classification task only
 Categorical features encoding (several strategies available !)
 Missing values encoding (several strategies available !)
MLBox: features
Validation:
Several models are tested and cross-validated and the best one is fitted.
• On features:
 Feature engineering : neural network features engineering  see « entity embedding »
 Feature selection : filter methods, wrapper methods and L1 regularization
• On estimator:
 TestIng of a wide range of accurate estimators: Linear model, Random Forest, XGBoost, LightGBM…
 Model blending: stacking, boosting, bagging
 Hyper-parameters tuning (using TPE algorithm)
• On validation :
 Choice of several metrics: accuracy, log-loss, AUC, f1-score, MSE, MAE, … or customized
 Validation parameters: number of folds, random state, …
MLBox: features
Application:
We fit the whole pipeline and predict the target on the test set.
• Prediction:
 Target prediction (class probabilities for classification)
 Dumping fitted models and final predictions (.csv file)
 CPU time display
• Models interpretation:
 Features importance
 Leak detection
2 - MLBox
b – focus #1 on MLBox
« Drift »: a brand new algorithm !
Focus 1 on MLBox: « drift detection »
Training data set Model
Learning
hyper parameters
tuning
Test data set
Scoring on test
data
Prediction
The model predicts the
target
Hypothesis : data are independant
and identically distributed (iid)
 The issue
Focus 1 on MLBox: « drift detection »
Training data set Model
Test data set
In practice: data are biased
Solution
Take into account this drift
when fitting the model
Scoring on test
data
Goal
Control the model’s
performance
 The issue
Lower score !
Focus 1 on MLBox: « drift detection »
 Definition
P(x,y) = P(x) . P(y|x)
Training data set Concept shiftCovariate shift
We deal with this
case
Focus 1 on MLBox: « drift detection »
 The solution
Focus 1 on MLBox: « drift detection »
 The algorithm
Univariate drift estimation
- Labelling data whether it belongs to the training or the test set
- Fitting a classifier to separate both classes
- Validation: roc AUC score gives us a drift measure
1
(Greedy) Recursive Drift Elimination
For each feature in decreasing drift order:
 We drop the feature
 We cross-validate the new model
 If delta score is > p :
- We keep the feature
- If not we drop it permanently
NB :
 Greedy algorithm: we can set a limited number of features to try or a drift
threshold under which features are kept
 We can set different p
2
Focus 1 on MLBox: « drift detection »
Maximum
threshold for
delta score
Delta score
(%)
Drift
measure
 Best case scenario
• 100% drifting features (we drop it)
• 0% drifting features (we keep it)
• Drifting features that are not important (we drop it)
 Worst case scenario
• Important drifting features (2 sub-cases) :
c – Focus #2 on MLBox
« Entity embeddings »
2 - MLBox
Focus 2 on MLBox: « entity embeddings »
 How to encode a categorical feature?
entity embedding How to do ???
Accurate + scalable +
understandable + quick
none ? 
Methods Explanation Advantages Drawbacks
label encoding
For each categorical feature, we encode each
value using the lexicographic order (A -> 1, B ->
2, …)
Quick + scalable
Naive encoding + non
understandable
dummification
We « binarise » all categorical features (ex :
var1_A, var1_B, …)
Quick + less naive encoding +
understandable
Not scalable
Random projection Random label encoding in k dimension
Quick + scalable + less naive
encoding
Non understandable
discretization We gather values scalable + understandable Loss of information
Focus 2 on MLBox: « entity embeddings »
 The idea of entity embedding
We would like to learn the best vectorial representation in an euclidian space:
• Label encoding/random projection : distance between 2 values = random
• Dummification : distance between 2 values = √2
• Entity embedding : distance between 2 « similar » values low and vice versa
 Solution : use a neural network to learn this representation
Focus 2 on MLBox: « entity embeddings »
 The principle of entity embedding
Choice of the number of units for the « embedded layer » :
- Constant (2 or 3)
- Proportionnal to the number of values (threshold = 5)
- Using a discretization method (Edge ML)
Choice of the ouput: the target or
other features for an unsupervised
problem
Encoding : we get the weighs
between the « one-hot-encoding
layer » and the « embedded layer
»
Choice of the number of units in the layer 1 :
take into account interactions between
embeddings
Choice of the number of
layers / dropout: to be tested !
Focus 2 on MLBox: « entity embeddings »
 Test on a real data set
• DataScience.net challenge: https://www.datascience.net/fr/challenge/26/details
• We would like to predict the price of an automobile insurance policy
• Training data set: 300.000 rows and 43 features like: brand, age, postcode, energy consumption, …
Top 10:
155 brands 6 energy
consumptions
target
Focus 2 on MLBox: « entity embeddings »
 Test on a real data set: accuracy + quickness
Focus 2 on MLBox: « entity embeddings »
 Test on a real data set: understanding + scalability
Threshold = 5 units for the embedded layer
Focus 2 on MLBox: « entity embeddings »
Energy
consumptions
(colors)
Increasing prices (size)

More Related Content

What's hot

Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013
MLconf
 
Object- Relational Persistence in Smalltalk
Object- Relational Persistence in SmalltalkObject- Relational Persistence in Smalltalk
Object- Relational Persistence in Smalltalk
ESUG
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
Databricks
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
HyperGraphDb
HyperGraphDbHyperGraphDb
HyperGraphDb
borislav
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
Korea Sdec
 
Pipeline oriented data analytics
Pipeline oriented data analyticsPipeline oriented data analytics
Pipeline oriented data analytics
Borys Biletskyy
 
Building A Machine Learning Platform At Quora (1)
Building A Machine Learning Platform At Quora (1)Building A Machine Learning Platform At Quora (1)
Building A Machine Learning Platform At Quora (1)
Nikhil Garg
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Grant Ingersoll
 
Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013
MLconf
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
Benjamin Bengfort
 
Open Platform for AI & ML modeling
Open Platform for AI & ML modelingOpen Platform for AI & ML modeling
Open Platform for AI & ML modeling
Institute of Contemporary Sciences
 
Story story ppt
Story story pptStory story ppt
Story story ppt
Pooja Patil
 
Introduction to NHibernate
Introduction to NHibernateIntroduction to NHibernate
Introduction to NHibernate
Dublin Alt,Net
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
Databricks
 
Parallel Computing in .NET
Parallel Computing in .NETParallel Computing in .NET
Parallel Computing in .NET
meghantaylor
 
IRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entityIRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entity
kartik179
 
Distil
DistilDistil
Distil
miso_uam
 
Object Oriented PHP Overview
Object Oriented PHP OverviewObject Oriented PHP Overview
Object Oriented PHP Overview
Larry Ball
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
Ted Dunning
 

What's hot (20)

Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013
 
Object- Relational Persistence in Smalltalk
Object- Relational Persistence in SmalltalkObject- Relational Persistence in Smalltalk
Object- Relational Persistence in Smalltalk
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
HyperGraphDb
HyperGraphDbHyperGraphDb
HyperGraphDb
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
Pipeline oriented data analytics
Pipeline oriented data analyticsPipeline oriented data analytics
Pipeline oriented data analytics
 
Building A Machine Learning Platform At Quora (1)
Building A Machine Learning Platform At Quora (1)Building A Machine Learning Platform At Quora (1)
Building A Machine Learning Platform At Quora (1)
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Open Platform for AI & ML modeling
Open Platform for AI & ML modelingOpen Platform for AI & ML modeling
Open Platform for AI & ML modeling
 
Story story ppt
Story story pptStory story ppt
Story story ppt
 
Introduction to NHibernate
Introduction to NHibernateIntroduction to NHibernate
Introduction to NHibernate
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
 
Parallel Computing in .NET
Parallel Computing in .NETParallel Computing in .NET
Parallel Computing in .NET
 
IRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entityIRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entity
 
Distil
DistilDistil
Distil
 
Object Oriented PHP Overview
Object Oriented PHP OverviewObject Oriented PHP Overview
Object Oriented PHP Overview
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 

Similar to MLBox

Ember
EmberEmber
Ember
mrphilroth
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
Garrett Teoh Hor Keong
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
Bryan Yang
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
Andreas Loupasakis
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
Warply
 
Exploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in RExploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in R
Satoshi Kato
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
Abdalmassih Yakeen
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
ADMET.pptx
ADMET.pptxADMET.pptx
ADMET.pptx
Santu Chall
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
Raouf KESKES
 
Oracle Query Optimizer - An Introduction
Oracle Query Optimizer - An IntroductionOracle Query Optimizer - An Introduction
Oracle Query Optimizer - An Introduction
adryanbub
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
BigML, Inc
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
ProFET - Protein Feature Engineering Toolki
ProFET - Protein Feature Engineering ToolkiProFET - Protein Feature Engineering Toolki
ProFET - Protein Feature Engineering Toolki
Dan Ofer
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Feng Zhu
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
Rajendran
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
Li Shen
 

Similar to MLBox (20)

Ember
EmberEmber
Ember
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Exploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in RExploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in R
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
ADMET.pptx
ADMET.pptxADMET.pptx
ADMET.pptx
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
Oracle Query Optimizer - An Introduction
Oracle Query Optimizer - An IntroductionOracle Query Optimizer - An Introduction
Oracle Query Optimizer - An Introduction
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
ProFET - Protein Feature Engineering Toolki
ProFET - Protein Feature Engineering ToolkiProFET - Protein Feature Engineering Toolki
ProFET - Protein Feature Engineering Toolki
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 

Recently uploaded

Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 

Recently uploaded (20)

Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 

MLBox

  • 2. 2 - MLBox a – Features
  • 3. MLBox : a fully automated pipeline
  • 4. MLBox: features Initialisation: From a raw dataset to a cleaned dataset with numerical features. • Files reading:  Reading of several files (csv, xls, json and hdf5)  Task detection (binary/multiclass classification or regression)  Creation of the training data set and the test data set • Preprocessing/cleaning:  Dropping duplicates and constant features  Dropping drifting features  see « drift » • Encoding:  Converting features to a unique format (float if possible or str)  Converting lists  Converting dates into timestamp  Target encoding for classification task only  Categorical features encoding (several strategies available !)  Missing values encoding (several strategies available !)
  • 5. MLBox: features Validation: Several models are tested and cross-validated and the best one is fitted. • On features:  Feature engineering : neural network features engineering  see « entity embedding »  Feature selection : filter methods, wrapper methods and L1 regularization • On estimator:  TestIng of a wide range of accurate estimators: Linear model, Random Forest, XGBoost, LightGBM…  Model blending: stacking, boosting, bagging  Hyper-parameters tuning (using TPE algorithm) • On validation :  Choice of several metrics: accuracy, log-loss, AUC, f1-score, MSE, MAE, … or customized  Validation parameters: number of folds, random state, …
  • 6. MLBox: features Application: We fit the whole pipeline and predict the target on the test set. • Prediction:  Target prediction (class probabilities for classification)  Dumping fitted models and final predictions (.csv file)  CPU time display • Models interpretation:  Features importance  Leak detection
  • 7. 2 - MLBox b – focus #1 on MLBox « Drift »: a brand new algorithm !
  • 8. Focus 1 on MLBox: « drift detection » Training data set Model Learning hyper parameters tuning Test data set Scoring on test data Prediction The model predicts the target Hypothesis : data are independant and identically distributed (iid)  The issue
  • 9. Focus 1 on MLBox: « drift detection » Training data set Model Test data set In practice: data are biased Solution Take into account this drift when fitting the model Scoring on test data Goal Control the model’s performance  The issue Lower score !
  • 10. Focus 1 on MLBox: « drift detection »  Definition P(x,y) = P(x) . P(y|x) Training data set Concept shiftCovariate shift We deal with this case
  • 11. Focus 1 on MLBox: « drift detection »  The solution
  • 12. Focus 1 on MLBox: « drift detection »  The algorithm Univariate drift estimation - Labelling data whether it belongs to the training or the test set - Fitting a classifier to separate both classes - Validation: roc AUC score gives us a drift measure 1 (Greedy) Recursive Drift Elimination For each feature in decreasing drift order:  We drop the feature  We cross-validate the new model  If delta score is > p : - We keep the feature - If not we drop it permanently NB :  Greedy algorithm: we can set a limited number of features to try or a drift threshold under which features are kept  We can set different p 2
  • 13. Focus 1 on MLBox: « drift detection » Maximum threshold for delta score Delta score (%) Drift measure  Best case scenario • 100% drifting features (we drop it) • 0% drifting features (we keep it) • Drifting features that are not important (we drop it)  Worst case scenario • Important drifting features (2 sub-cases) :
  • 14. c – Focus #2 on MLBox « Entity embeddings » 2 - MLBox
  • 15. Focus 2 on MLBox: « entity embeddings »  How to encode a categorical feature? entity embedding How to do ??? Accurate + scalable + understandable + quick none ?  Methods Explanation Advantages Drawbacks label encoding For each categorical feature, we encode each value using the lexicographic order (A -> 1, B -> 2, …) Quick + scalable Naive encoding + non understandable dummification We « binarise » all categorical features (ex : var1_A, var1_B, …) Quick + less naive encoding + understandable Not scalable Random projection Random label encoding in k dimension Quick + scalable + less naive encoding Non understandable discretization We gather values scalable + understandable Loss of information
  • 16. Focus 2 on MLBox: « entity embeddings »  The idea of entity embedding We would like to learn the best vectorial representation in an euclidian space: • Label encoding/random projection : distance between 2 values = random • Dummification : distance between 2 values = √2 • Entity embedding : distance between 2 « similar » values low and vice versa  Solution : use a neural network to learn this representation
  • 17. Focus 2 on MLBox: « entity embeddings »  The principle of entity embedding Choice of the number of units for the « embedded layer » : - Constant (2 or 3) - Proportionnal to the number of values (threshold = 5) - Using a discretization method (Edge ML) Choice of the ouput: the target or other features for an unsupervised problem Encoding : we get the weighs between the « one-hot-encoding layer » and the « embedded layer » Choice of the number of units in the layer 1 : take into account interactions between embeddings Choice of the number of layers / dropout: to be tested !
  • 18. Focus 2 on MLBox: « entity embeddings »  Test on a real data set • DataScience.net challenge: https://www.datascience.net/fr/challenge/26/details • We would like to predict the price of an automobile insurance policy • Training data set: 300.000 rows and 43 features like: brand, age, postcode, energy consumption, … Top 10: 155 brands 6 energy consumptions target
  • 19. Focus 2 on MLBox: « entity embeddings »  Test on a real data set: accuracy + quickness
  • 20. Focus 2 on MLBox: « entity embeddings »  Test on a real data set: understanding + scalability Threshold = 5 units for the embedded layer
  • 21. Focus 2 on MLBox: « entity embeddings » Energy consumptions (colors) Increasing prices (size)