SlideShare a Scribd company logo
1 of 22
Download to read offline
Effective data pre-processing for AutoML
Joseph Giovanelli Besim Bilalli Alberto Abelló
j.giovanelli@unibo.it bbilalli@essi.upc.edu aabello@essi.upc.edu
The data analytics process
Based on the problem to tackle:
• Data Pipeline Selection and Optimization (DPSO) 1
consists of
finding the best data pipeline;
• Combined Algorithm Selection and Hyperparameter
Optimization (CASH) 2
consists of finding the best algorithm and
its configuration.
1Alexandre Quemy. 2020. Two-stage optimization for machine learning workflow. Information
Systems 92 (2020).
2Chris Thornton, Frank Hutter, Holger H. Hoos, et al. 2013. Auto-WEKA: Combined Selection and
Hyperparameter Optimization of Classification Algorithms. In KDD.
1
AutoML and data analytics process
Automated Machine Learning (AutoML) aims to automate the whole
data analytics process.
Highlight:
• usage of state-of-the-art
optimization techniques e.g.,
Sequential Model-Based
Optimization (SMBO) 3
Drawback:
• neglect of data
pre-processing
1Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based
optimization for general algorithm configuration. In International conference on learning and
intelligent optimization.Springer, 507–523.
2
Data pre-processing complexity
List of transformations:
• E - Encoding;
• N - Normalization;
• D - Discretization;
• I - Imputation;
• R - Rebalancing;
• F - Feature Engineering.
framework=
scikit-learn 4
Framework-related rules:
1 - a precedence edge exists between
the row and the column,
0 - a precedence edge does not exist
between the row and the column,
x - the combination is meaningless.
2https://scikit-learn.org/stable/
3
Data pre-processing complexity
Framework-related rules:
1 - a precedence edge exists between
the row and the column,
0 - a precedence edge does not exist
between the row and the column,
x - the combination is meaningless.
Exhaustive set of data pipeline prototypes:
4
Data pre-processing complexity - Universal pipeline prototype
List of algorithms:
• NB - Naive Bayes;
• KNN - K-Nearest
Neighbor;
• RF - Random
Forest.
Datasets:
• OpenML CC-18
suite 5
3https://www.openml.org/s/99
5
Our approach
Our approach
1. Build effective pipeline prototypes:
2. Optimize them
3. Select the best one
6
Our approach - Building effective pipeline prototypes
Framework-related rules: Heuristic rules: Learnt rules:
E - Encoding N - Normalization D - Discretization I - Imputation R - Rebalancing F - Feature Eng.
1 - a precedence edge exists between the row and the column,
0 - a precedence edge does not exist between the row and the column,
x - the combination is meaningless.
7
Our approach - Learnt rules
List of algorithms:
• NB - Naive Bayes;
• KNN - K-Nearest
Neighbor;
• RF - Random
Forest.
Datasets:
• OpenML CC-18
suite
8
Our approach - Building effective pipeline prototypes
Framework-related rules: Heuristic rules: Learnt rules:
E - Encoding N - Normalization D - Discretization I - Imputation R - Rebalancing F - Feature Eng.
1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column,
x - the combination is meaningless.
Union of rules:
9
Our approach - Building effective pipeline prototypes
Union of rules:
1 - a precedence edge exists between
the row and the column,
0 - a precedence edge does not exist
between the row and the column,
x - the combination is meaningless.
Effective set of data pipeline prototypes:
10
Our approach - Optimization
11
Evaluation
Evaluation - Effective versus exhaustive prototypes optimization
Effective prototypes optimization (our approach):
Exhaustive prototypes optimization :
for each of the possible
data pipeline prototypes:
12
Evaluation - Effective versus exhaustive prototypes optimization
normalized distance =
Acc(deffective, a∗
) − Acc(d, a)
Acc(dexhaustive, a∗) − Acc(d, a)
• Acc(d, a): baseline accuracy (i.e., predictive accuracy of the algorithm a
with default hyper-parameters over the original dataset d);
• Acc(deffective, a∗
): accuracy of the optimized algorithm a∗
over the
dataset deffective transformed using the optimized instantiation of the
effective set of prototypes obtained with our approach;
• Acc(dexhaustive, a∗
): accuracy of the optimized algorithm a∗
over the
dataset dexhaustive transformed using the optimized pipeline
instantiation of the exhaustive set of prototypes.
13
Evaluation - Effective versus exhaustive prototypes optimization
normalized distance =
Acc(deffective, a∗
) − Acc(d, a)
Acc(dexhaustive, a∗) − Acc(d, a)
Normalized distances
between the scores
obtained by
optimizing our
effective prototypes
and the ones
obtained optimizing
the exhaustive set.
14
Evaluation - Complementing hyper-parameter optimization
with pre-processing
Complementing hyper-parameter optimization with pre-processing:
Just hyper-parameter optimization:
15
Evaluation - Complementing hyper-parameter optimization
with pre-processing
pp impact =
Acc(deffective, a∗
) − Acc(d, a)
max(Acc(deffective, a∗), Acc(d, a∗)) − Acc(d, a)
hp impact =
Acc(d, a∗
) − Acc(d, a)
max(Acc(deffective, a∗), Acc(d, a∗)) − Acc(d, a)
• Acc(d, a): baseline accuracy (i.e., predictive accuracy of the algorithm a
with default hyper-parameters over the original dataset d);
• Acc(deffective, a∗
): accuracy of the optimized algorithm a∗
over the
dataset deffective transformed using the optimized instantiation of the
effective set of prototypes obtained with our approach;
• Acc(d, a∗
): accuracy of the optimized algorithm a∗
(i.e, using the entire
budget) over the original dataset d.
16
Evaluation - Complementing hyper-parameter optimization
with pre-processing
normalized pp impact =
pp impact
pp impact + hp impact
normalized hp impact =
hp impact
pp impact + hp impact
17
Conclusions and future works
Contributions:
• no universal pre-processing pipeline prototype;
• pre-processing optimization may boost the final result;
• our approach provides results that get 90% of the optimal
predictive accuracy in the median, but with a cost that is 24
times smaller;
Future works:
• study the datasets that do not react well to data-pre-processing;
• extend our approach with a meta-learning module.
18
Thanks for the attention
18

More Related Content

Similar to Effective data pre-processing for AutoML

LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learningbutest
 
Transfer defect learning
Transfer defect learningTransfer defect learning
Transfer defect learningSung Kim
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersMonica Vitali
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_JieMDO_Lab
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsPooyan Jamshidi
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringAllenWu
 
Automated Sys. Design.11-For LinkedIn
Automated Sys. Design.11-For LinkedInAutomated Sys. Design.11-For LinkedIn
Automated Sys. Design.11-For LinkedInHARDIK MODI
 
Automated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform DesignsAutomated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform DesignsAijun Zhang
 
Sampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective MissionsSampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective MissionsMd Mahbubur Rahman
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...Databricks
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learningmilad abbasi
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningMehrnaz Faraz
 

Similar to Effective data pre-processing for AutoML (20)

LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learning
 
Transfer defect learning
Transfer defect learningTransfer defect learning
Transfer defect learning
 
Dynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application TestingDynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application Testing
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data Centers
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_Jie
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of Robots
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 
Automated Sys. Design.11-For LinkedIn
Automated Sys. Design.11-For LinkedInAutomated Sys. Design.11-For LinkedIn
Automated Sys. Design.11-For LinkedIn
 
Automated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform DesignsAutomated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform Designs
 
lecture_16.pptx
lecture_16.pptxlecture_16.pptx
lecture_16.pptx
 
TINET_FRnOG_2008_public
TINET_FRnOG_2008_publicTINET_FRnOG_2008_public
TINET_FRnOG_2008_public
 
Sampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective MissionsSampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective Missions
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 

Recently uploaded

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 

Recently uploaded (20)

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 

Effective data pre-processing for AutoML

  • 1. Effective data pre-processing for AutoML Joseph Giovanelli Besim Bilalli Alberto Abelló j.giovanelli@unibo.it bbilalli@essi.upc.edu aabello@essi.upc.edu
  • 2. The data analytics process Based on the problem to tackle: • Data Pipeline Selection and Optimization (DPSO) 1 consists of finding the best data pipeline; • Combined Algorithm Selection and Hyperparameter Optimization (CASH) 2 consists of finding the best algorithm and its configuration. 1Alexandre Quemy. 2020. Two-stage optimization for machine learning workflow. Information Systems 92 (2020). 2Chris Thornton, Frank Hutter, Holger H. Hoos, et al. 2013. Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In KDD. 1
  • 3. AutoML and data analytics process Automated Machine Learning (AutoML) aims to automate the whole data analytics process. Highlight: • usage of state-of-the-art optimization techniques e.g., Sequential Model-Based Optimization (SMBO) 3 Drawback: • neglect of data pre-processing 1Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization.Springer, 507–523. 2
  • 4. Data pre-processing complexity List of transformations: • E - Encoding; • N - Normalization; • D - Discretization; • I - Imputation; • R - Rebalancing; • F - Feature Engineering. framework= scikit-learn 4 Framework-related rules: 1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column, x - the combination is meaningless. 2https://scikit-learn.org/stable/ 3
  • 5. Data pre-processing complexity Framework-related rules: 1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column, x - the combination is meaningless. Exhaustive set of data pipeline prototypes: 4
  • 6. Data pre-processing complexity - Universal pipeline prototype List of algorithms: • NB - Naive Bayes; • KNN - K-Nearest Neighbor; • RF - Random Forest. Datasets: • OpenML CC-18 suite 5 3https://www.openml.org/s/99 5
  • 8. Our approach 1. Build effective pipeline prototypes: 2. Optimize them 3. Select the best one 6
  • 9. Our approach - Building effective pipeline prototypes Framework-related rules: Heuristic rules: Learnt rules: E - Encoding N - Normalization D - Discretization I - Imputation R - Rebalancing F - Feature Eng. 1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column, x - the combination is meaningless. 7
  • 10. Our approach - Learnt rules List of algorithms: • NB - Naive Bayes; • KNN - K-Nearest Neighbor; • RF - Random Forest. Datasets: • OpenML CC-18 suite 8
  • 11. Our approach - Building effective pipeline prototypes Framework-related rules: Heuristic rules: Learnt rules: E - Encoding N - Normalization D - Discretization I - Imputation R - Rebalancing F - Feature Eng. 1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column, x - the combination is meaningless. Union of rules: 9
  • 12. Our approach - Building effective pipeline prototypes Union of rules: 1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column, x - the combination is meaningless. Effective set of data pipeline prototypes: 10
  • 13. Our approach - Optimization 11
  • 15. Evaluation - Effective versus exhaustive prototypes optimization Effective prototypes optimization (our approach): Exhaustive prototypes optimization : for each of the possible data pipeline prototypes: 12
  • 16. Evaluation - Effective versus exhaustive prototypes optimization normalized distance = Acc(deffective, a∗ ) − Acc(d, a) Acc(dexhaustive, a∗) − Acc(d, a) • Acc(d, a): baseline accuracy (i.e., predictive accuracy of the algorithm a with default hyper-parameters over the original dataset d); • Acc(deffective, a∗ ): accuracy of the optimized algorithm a∗ over the dataset deffective transformed using the optimized instantiation of the effective set of prototypes obtained with our approach; • Acc(dexhaustive, a∗ ): accuracy of the optimized algorithm a∗ over the dataset dexhaustive transformed using the optimized pipeline instantiation of the exhaustive set of prototypes. 13
  • 17. Evaluation - Effective versus exhaustive prototypes optimization normalized distance = Acc(deffective, a∗ ) − Acc(d, a) Acc(dexhaustive, a∗) − Acc(d, a) Normalized distances between the scores obtained by optimizing our effective prototypes and the ones obtained optimizing the exhaustive set. 14
  • 18. Evaluation - Complementing hyper-parameter optimization with pre-processing Complementing hyper-parameter optimization with pre-processing: Just hyper-parameter optimization: 15
  • 19. Evaluation - Complementing hyper-parameter optimization with pre-processing pp impact = Acc(deffective, a∗ ) − Acc(d, a) max(Acc(deffective, a∗), Acc(d, a∗)) − Acc(d, a) hp impact = Acc(d, a∗ ) − Acc(d, a) max(Acc(deffective, a∗), Acc(d, a∗)) − Acc(d, a) • Acc(d, a): baseline accuracy (i.e., predictive accuracy of the algorithm a with default hyper-parameters over the original dataset d); • Acc(deffective, a∗ ): accuracy of the optimized algorithm a∗ over the dataset deffective transformed using the optimized instantiation of the effective set of prototypes obtained with our approach; • Acc(d, a∗ ): accuracy of the optimized algorithm a∗ (i.e, using the entire budget) over the original dataset d. 16
  • 20. Evaluation - Complementing hyper-parameter optimization with pre-processing normalized pp impact = pp impact pp impact + hp impact normalized hp impact = hp impact pp impact + hp impact 17
  • 21. Conclusions and future works Contributions: • no universal pre-processing pipeline prototype; • pre-processing optimization may boost the final result; • our approach provides results that get 90% of the optimal predictive accuracy in the median, but with a cost that is 24 times smaller; Future works: • study the datasets that do not react well to data-pre-processing; • extend our approach with a meta-learning module. 18
  • 22. Thanks for the attention 18