Effective data pre-processing for AutoML

Effective data pre-processing for AutoML
Joseph Giovanelli Besim Bilalli Alberto Abelló
j.giovanelli@unibo.it bbilalli@essi.upc.edu aabello@essi.upc.edu

The data analytics process
Based on the problem to tackle:
• Data Pipeline Selection and Optimization (DPSO) 1
consists of
finding the best data pipeline;
• Combined Algorithm Selection and Hyperparameter
Optimization (CASH) 2
consists of finding the best algorithm and
its configuration.
1Alexandre Quemy. 2020. Two-stage optimization for machine learning workflow. Information
Systems 92 (2020).
2Chris Thornton, Frank Hutter, Holger H. Hoos, et al. 2013. Auto-WEKA: Combined Selection and
Hyperparameter Optimization of Classification Algorithms. In KDD.
1

AutoML and data analytics process
Automated Machine Learning (AutoML) aims to automate the whole
data analytics process.
Highlight:
• usage of state-of-the-art
optimization techniques e.g.,
Sequential Model-Based
Optimization (SMBO) 3
Drawback:
• neglect of data
pre-processing
1Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based
optimization for general algorithm configuration. In International conference on learning and
intelligent optimization.Springer, 507–523.
2

Data pre-processing complexity
List of transformations:
• E - Encoding;
• N - Normalization;
• D - Discretization;
• I - Imputation;
• R - Rebalancing;
• F - Feature Engineering.
framework=
scikit-learn 4
Framework-related rules:
1 - a precedence edge exists between
the row and the column,
0 - a precedence edge does not exist
between the row and the column,
x - the combination is meaningless.
2https://scikit-learn.org/stable/
3

Data pre-processing complexity
Framework-related rules:
Exhaustive set of data pipeline prototypes:
4

Data pre-processing complexity - Universal pipeline prototype
List of algorithms:
• NB - Naive Bayes;
• KNN - K-Nearest
Neighbor;
• RF - Random
Forest.
Datasets:
• OpenML CC-18
suite 5
3https://www.openml.org/s/99
5

Our approach
1. Build effective pipeline prototypes:
2. Optimize them
3. Select the best one
6

Our approach - Building effective pipeline prototypes
Framework-related rules: Heuristic rules: Learnt rules:
E - Encoding N - Normalization D - Discretization I - Imputation R - Rebalancing F - Feature Eng.
1 - a precedence edge exists between the row and the column,
0 - a precedence edge does not exist between the row and the column,
7

Our approach - Learnt rules
List of algorithms:
• NB - Naive Bayes;
• KNN - K-Nearest
Neighbor;
• RF - Random
Forest.
Datasets:
• OpenML CC-18
suite
8

Framework-related rules: Heuristic rules: Learnt rules:
E - Encoding N - Normalization D - Discretization I - Imputation R - Rebalancing F - Feature Eng.
1 - a precedence edge exists between the row and the column, 0 - a precedence edge does not exist between the row and the column,
Union of rules:
9

Union of rules:
Effective set of data pipeline prototypes:
10

Our approach - Optimization
11

Evaluation - Effective versus exhaustive prototypes optimization
Effective prototypes optimization (our approach):
Exhaustive prototypes optimization :
for each of the possible
data pipeline prototypes:
12

normalized distance =
Acc(deffective, a∗
) − Acc(d, a)
Acc(dexhaustive, a∗) − Acc(d, a)
• Acc(d, a): baseline accuracy (i.e., predictive accuracy of the algorithm a
with default hyper-parameters over the original dataset d);
• Acc(deffective, a∗
): accuracy of the optimized algorithm a∗
over the
dataset deffective transformed using the optimized instantiation of the
effective set of prototypes obtained with our approach;
• Acc(dexhaustive, a∗
over the
dataset dexhaustive transformed using the optimized pipeline
instantiation of the exhaustive set of prototypes.
13

normalized distance =
) − Acc(d, a)
Acc(dexhaustive, a∗) − Acc(d, a)
Normalized distances
between the scores
obtained by
optimizing our
effective prototypes
and the ones
obtained optimizing
the exhaustive set.
14

Evaluation - Complementing hyper-parameter optimization
with pre-processing
Complementing hyper-parameter optimization with pre-processing:
Just hyper-parameter optimization:
15

with pre-processing
pp impact =
) − Acc(d, a)
max(Acc(deffective, a∗), Acc(d, a∗)) − Acc(d, a)
hp impact =
Acc(d, a∗
) − Acc(d, a)
max(Acc(deffective, a∗), Acc(d, a∗)) − Acc(d, a)
• Acc(d, a): baseline accuracy (i.e., predictive accuracy of the algorithm a
with default hyper-parameters over the original dataset d);
• Acc(deffective, a∗
over the
dataset deffective transformed using the optimized instantiation of the
effective set of prototypes obtained with our approach;
• Acc(d, a∗
(i.e, using the entire
budget) over the original dataset d.
16

with pre-processing
normalized pp impact =
pp impact
pp impact + hp impact
normalized hp impact =
hp impact
pp impact + hp impact
17

Conclusions and future works
Contributions:
• no universal pre-processing pipeline prototype;
• pre-processing optimization may boost the final result;
• our approach provides results that get 90% of the optimal
predictive accuracy in the median, but with a cost that is 24
times smaller;
Future works:
• study the datasets that do not react well to data-pre-processing;
• extend our approach with a meta-learning module.
18

Effective data pre-processing for AutoML

Recommended

Recommended

More Related Content

Similar to Effective data pre-processing for AutoML

Similar to Effective data pre-processing for AutoML (20)

Recently uploaded

Recently uploaded (20)

Effective data pre-processing for AutoML