SlideShare a Scribd company logo
1 of 22
Download to read offline
Effective data pre-processing for AutoML
Joseph Giovanelli Besim Bilalli Alberto Abelló
j.giovanelli@unibo.it bbilalli@essi.upc.edu aabello@essi.upc.edu
The data analytics process
Based on the problem to tackle:
• Data Pipeline Selection and Optimization (DPSO) 1
consists of
finding the best data pipeline;
• Combined Algorithm Selection and Hyperparameter
Optimization (CASH) 2
consists of finding the best algorithm and
its configuration.
1Alexandre Quemy. 2020. Two-stage optimization for machine learning workflow. Information
Systems 92 (2020).
2Chris Thornton, Frank Hutter, Holger H. Hoos, et al. 2013. Auto-WEKA: Combined Selection and
Hyperparameter Optimization of Classification Algorithms. In KDD.
1
AutoML and data analytics process
Automated Machine Learning (AutoML) aims to automate the whole
data analytics process.
Highlight:
• usage of state-of-the-art
optimization techniques e.g.,
Sequential Model-Based
Optimization (SMBO) 3
Drawback:
• neglect of data
pre-processing
1Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based
optimization for general algorithm configuration. In International conference on learning and
intelligent optimization.Springer, 507–523.
2
Data pre-processing complexity
List of transformations:
• E - Encoding;
• N - Normalization;
• D - Discretization;
• I - Imputation;
• R - Rebalancing;
• F - Feature Engineering.
framework=
scikit-learn 4
Framework-related rules:
1 - a precedence edge exists between
the row and the column,
0 - a precedence edge does not exist
between the row and the column,
x - the combination is meaningless.
2https://scikit-learn.org/stable/
3
Data pre-processing complexity
Framework-related rules:
1 - a precedence edge exists between
the row and the column,
0 - a precedence edge does not exist
between the row and the column,
x - the combination is meaningless.
Exhaustive set of data pipeline prototypes:
4
Data pre-processing complexity - Universal pipeline prototype
List of algorithms:
• NB - Naive Bayes;
• KNN - K-Nearest
Neighbor;
• RF - Random
Forest.
Datasets:
• OpenML CC-18
suite 5
3https://www.openml.org/s/99
5
Our approach
Our approach
1. Build effective pipeline prototypes:
2. Optimize them
3. Select the best one
6
Our approach - Building effective pipeline prototypes
Framework-related rules: Heuristic rules: Learnt rules:
E - Encoding N - Normalization D - Discretization I - Imputation R - Rebalancing F - Feature Eng.
1 - a precedence edge exists between the row and the column,
0 - a precedence edge does not exist between the row and the column,
x - the combination is meaningless.
7
Our approach - Learnt rules
List of algorithms:
• NB - Naive Bayes;
• KNN - K-Nearest
Neighbor;
• RF - Random
Forest.
Datasets:
• OpenML CC-18
suite
8
Our approach - Building effective pipeline prototypes
Framework-related rules: Heuristic rules: Learnt rules:
E - Encoding N - Normalization D - Discretization I - Imputation R - Rebalancing F - Feature Eng.
1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column,
x - the combination is meaningless.
Union of rules:
9
Our approach - Building effective pipeline prototypes
Union of rules:
1 - a precedence edge exists between
the row and the column,
0 - a precedence edge does not exist
between the row and the column,
x - the combination is meaningless.
Effective set of data pipeline prototypes:
10
Our approach - Optimization
11
Evaluation
Evaluation - Effective versus exhaustive prototypes optimization
Effective prototypes optimization (our approach):
Exhaustive prototypes optimization :
for each of the possible
data pipeline prototypes:
12
Evaluation - Effective versus exhaustive prototypes optimization
normalized distance =
Acc(deffective, a∗
) − Acc(d, a)
Acc(dexhaustive, a∗) − Acc(d, a)
• Acc(d, a): baseline accuracy (i.e., predictive accuracy of the algorithm a
with default hyper-parameters over the original dataset d);
• Acc(deffective, a∗
): accuracy of the optimized algorithm a∗
over the
dataset deffective transformed using the optimized instantiation of the
effective set of prototypes obtained with our approach;
• Acc(dexhaustive, a∗
): accuracy of the optimized algorithm a∗
over the
dataset dexhaustive transformed using the optimized pipeline
instantiation of the exhaustive set of prototypes.
13
Evaluation - Effective versus exhaustive prototypes optimization
normalized distance =
Acc(deffective, a∗
) − Acc(d, a)
Acc(dexhaustive, a∗) − Acc(d, a)
Normalized distances
between the scores
obtained by
optimizing our
effective prototypes
and the ones
obtained optimizing
the exhaustive set.
14
Evaluation - Complementing hyper-parameter optimization
with pre-processing
Complementing hyper-parameter optimization with pre-processing:
Just hyper-parameter optimization:
15
Evaluation - Complementing hyper-parameter optimization
with pre-processing
pp impact =
Acc(deffective, a∗
) − Acc(d, a)
max(Acc(deffective, a∗), Acc(d, a∗)) − Acc(d, a)
hp impact =
Acc(d, a∗
) − Acc(d, a)
max(Acc(deffective, a∗), Acc(d, a∗)) − Acc(d, a)
• Acc(d, a): baseline accuracy (i.e., predictive accuracy of the algorithm a
with default hyper-parameters over the original dataset d);
• Acc(deffective, a∗
): accuracy of the optimized algorithm a∗
over the
dataset deffective transformed using the optimized instantiation of the
effective set of prototypes obtained with our approach;
• Acc(d, a∗
): accuracy of the optimized algorithm a∗
(i.e, using the entire
budget) over the original dataset d.
16
Evaluation - Complementing hyper-parameter optimization
with pre-processing
normalized pp impact =
pp impact
pp impact + hp impact
normalized hp impact =
hp impact
pp impact + hp impact
17
Conclusions and future works
Contributions:
• no universal pre-processing pipeline prototype;
• pre-processing optimization may boost the final result;
• our approach provides results that get 90% of the optimal
predictive accuracy in the median, but with a cost that is 24
times smaller;
Future works:
• study the datasets that do not react well to data-pre-processing;
• extend our approach with a meta-learning module.
18
Thanks for the attention
18

More Related Content

Similar to Effective data pre-processing for AutoML

LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learningbutest
 
Transfer defect learning
Transfer defect learningTransfer defect learning
Transfer defect learningSung Kim
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersMonica Vitali
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_JieMDO_Lab
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsPooyan Jamshidi
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringAllenWu
 
Automated Sys. Design.11-For LinkedIn
Automated Sys. Design.11-For LinkedInAutomated Sys. Design.11-For LinkedIn
Automated Sys. Design.11-For LinkedInHARDIK MODI
 
Automated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform DesignsAutomated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform DesignsAijun Zhang
 
Sampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective MissionsSampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective MissionsMd Mahbubur Rahman
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...Databricks
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learningmilad abbasi
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningMehrnaz Faraz
 

Similar to Effective data pre-processing for AutoML (20)

LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learning
 
Transfer defect learning
Transfer defect learningTransfer defect learning
Transfer defect learning
 
Dynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application TestingDynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application Testing
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data Centers
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_Jie
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of Robots
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 
Automated Sys. Design.11-For LinkedIn
Automated Sys. Design.11-For LinkedInAutomated Sys. Design.11-For LinkedIn
Automated Sys. Design.11-For LinkedIn
 
Automated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform DesignsAutomated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform Designs
 
lecture_16.pptx
lecture_16.pptxlecture_16.pptx
lecture_16.pptx
 
TINET_FRnOG_2008_public
TINET_FRnOG_2008_publicTINET_FRnOG_2008_public
TINET_FRnOG_2008_public
 
Sampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective MissionsSampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective Missions
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 

Recently uploaded

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxsqpmdrvczh
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 

Recently uploaded (20)

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptx
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 

Effective data pre-processing for AutoML

  • 1. Effective data pre-processing for AutoML Joseph Giovanelli Besim Bilalli Alberto Abelló j.giovanelli@unibo.it bbilalli@essi.upc.edu aabello@essi.upc.edu
  • 2. The data analytics process Based on the problem to tackle: • Data Pipeline Selection and Optimization (DPSO) 1 consists of finding the best data pipeline; • Combined Algorithm Selection and Hyperparameter Optimization (CASH) 2 consists of finding the best algorithm and its configuration. 1Alexandre Quemy. 2020. Two-stage optimization for machine learning workflow. Information Systems 92 (2020). 2Chris Thornton, Frank Hutter, Holger H. Hoos, et al. 2013. Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In KDD. 1
  • 3. AutoML and data analytics process Automated Machine Learning (AutoML) aims to automate the whole data analytics process. Highlight: • usage of state-of-the-art optimization techniques e.g., Sequential Model-Based Optimization (SMBO) 3 Drawback: • neglect of data pre-processing 1Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization.Springer, 507–523. 2
  • 4. Data pre-processing complexity List of transformations: • E - Encoding; • N - Normalization; • D - Discretization; • I - Imputation; • R - Rebalancing; • F - Feature Engineering. framework= scikit-learn 4 Framework-related rules: 1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column, x - the combination is meaningless. 2https://scikit-learn.org/stable/ 3
  • 5. Data pre-processing complexity Framework-related rules: 1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column, x - the combination is meaningless. Exhaustive set of data pipeline prototypes: 4
  • 6. Data pre-processing complexity - Universal pipeline prototype List of algorithms: • NB - Naive Bayes; • KNN - K-Nearest Neighbor; • RF - Random Forest. Datasets: • OpenML CC-18 suite 5 3https://www.openml.org/s/99 5
  • 8. Our approach 1. Build effective pipeline prototypes: 2. Optimize them 3. Select the best one 6
  • 9. Our approach - Building effective pipeline prototypes Framework-related rules: Heuristic rules: Learnt rules: E - Encoding N - Normalization D - Discretization I - Imputation R - Rebalancing F - Feature Eng. 1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column, x - the combination is meaningless. 7
  • 10. Our approach - Learnt rules List of algorithms: • NB - Naive Bayes; • KNN - K-Nearest Neighbor; • RF - Random Forest. Datasets: • OpenML CC-18 suite 8
  • 11. Our approach - Building effective pipeline prototypes Framework-related rules: Heuristic rules: Learnt rules: E - Encoding N - Normalization D - Discretization I - Imputation R - Rebalancing F - Feature Eng. 1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column, x - the combination is meaningless. Union of rules: 9
  • 12. Our approach - Building effective pipeline prototypes Union of rules: 1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column, x - the combination is meaningless. Effective set of data pipeline prototypes: 10
  • 13. Our approach - Optimization 11
  • 15. Evaluation - Effective versus exhaustive prototypes optimization Effective prototypes optimization (our approach): Exhaustive prototypes optimization : for each of the possible data pipeline prototypes: 12
  • 16. Evaluation - Effective versus exhaustive prototypes optimization normalized distance = Acc(deffective, a∗ ) − Acc(d, a) Acc(dexhaustive, a∗) − Acc(d, a) • Acc(d, a): baseline accuracy (i.e., predictive accuracy of the algorithm a with default hyper-parameters over the original dataset d); • Acc(deffective, a∗ ): accuracy of the optimized algorithm a∗ over the dataset deffective transformed using the optimized instantiation of the effective set of prototypes obtained with our approach; • Acc(dexhaustive, a∗ ): accuracy of the optimized algorithm a∗ over the dataset dexhaustive transformed using the optimized pipeline instantiation of the exhaustive set of prototypes. 13
  • 17. Evaluation - Effective versus exhaustive prototypes optimization normalized distance = Acc(deffective, a∗ ) − Acc(d, a) Acc(dexhaustive, a∗) − Acc(d, a) Normalized distances between the scores obtained by optimizing our effective prototypes and the ones obtained optimizing the exhaustive set. 14
  • 18. Evaluation - Complementing hyper-parameter optimization with pre-processing Complementing hyper-parameter optimization with pre-processing: Just hyper-parameter optimization: 15
  • 19. Evaluation - Complementing hyper-parameter optimization with pre-processing pp impact = Acc(deffective, a∗ ) − Acc(d, a) max(Acc(deffective, a∗), Acc(d, a∗)) − Acc(d, a) hp impact = Acc(d, a∗ ) − Acc(d, a) max(Acc(deffective, a∗), Acc(d, a∗)) − Acc(d, a) • Acc(d, a): baseline accuracy (i.e., predictive accuracy of the algorithm a with default hyper-parameters over the original dataset d); • Acc(deffective, a∗ ): accuracy of the optimized algorithm a∗ over the dataset deffective transformed using the optimized instantiation of the effective set of prototypes obtained with our approach; • Acc(d, a∗ ): accuracy of the optimized algorithm a∗ (i.e, using the entire budget) over the original dataset d. 16
  • 20. Evaluation - Complementing hyper-parameter optimization with pre-processing normalized pp impact = pp impact pp impact + hp impact normalized hp impact = hp impact pp impact + hp impact 17
  • 21. Conclusions and future works Contributions: • no universal pre-processing pipeline prototype; • pre-processing optimization may boost the final result; • our approach provides results that get 90% of the optimal predictive accuracy in the median, but with a cost that is 24 times smaller; Future works: • study the datasets that do not react well to data-pre-processing; • extend our approach with a meta-learning module. 18
  • 22. Thanks for the attention 18