Summary of: "Automating Data Preparation: Can We? Should We? Must We?"
1. UNIVERSITÀ DEGLI STUDI DI
TRIESTE
Dipartimento di Ingegneria e Architettura
Laurea Triennale in Ingegneria Elettronica e
Informatica
Summary of: "Automating Data
Preparation: Can We? Should We?
Must We?"
April 28, 2021
Laureando Relatore
Samuele Bertollo Chiar.mo Prof. Eric Medvet
Anno Accademico 2020/2021
2. Introduction
Data preparation is also known as data wrangling and ETL (Extract
Transform Load). The definition given by N. Paton is: "Data preparation
covers the discovery, selection, integration and cleaning of existing data sets
into a form that is suitable for analysis.".
Data preparation is fundamental to the work of data scientists and consumes
80% of its work time in the mean. Data preparation can be divided into single
steps: activities that usually the data scientist perform manually. Doing this
work requires programming skills, a skill that has significant cost in terms of
time and money. By automating data preparation we can reduce these costs.
The goal of automating data preparation is to realize an application where
its users should specify what they want to obtain from the data preparation
instead of describing the steps required to obtain it. With this approach,
programming knowledge is not required and automating the process is less
time-consuming.
In the paper of N. Paton, "Automating Data Preparation: Can We? Should
We? Must We?" (2019), three main questions are discussed:
1. What techniques do we have to automate data preparation?
2. When are the results better than human-made data preparation?
3. When using an automated approach is mandatory because the manual
is not viable?
What techniques do we have to automate data
preparation?
There are two main strategies for automating data preparation: focusing on
the single steps or solving the problem end-to-end.
In the single steps strategies, we need to provide some additional informa-
tion to automate. These additional information are called evidence. In some
cases, the evidence requires just the source data or few additional data. How-
ever, in general, the more data we provide, the better results we have. For
example, to automate the learning of data transformation, we need to pro-
i
3. vide some samples of the problem solved, which are called training data. This
means that someone is still required to discover these samples manually, but
some researchers are finding ways to automate the discovery.
As good examples of end-to-end data preparation software, we have Data
TAMER and VADA.
Data TAMER is semi-automatic and uses training data. An important char-
acteristics of this software is that the user must provide feedbacks for each
step of the data preparation.
In contrast, VADA uses only evidence formed by data context, that are in-
stance values associated with portions of a target schema. With this software
the user can review the single steps by giving feedbacks, but this step is not
mandatory.
Therefore both end-to-end solutions require just a user with knowledge of the
domain application and they do not require the user to specify what they
should do to prepare the data.
How differ the quality of the results in manual
and automated approaches?
Manual and automated approaches are not comparable in an absolute way
in terms of quality of results. This is because the quality of results changes
when we change the data to prepare.
The manual approach will probably remain relevant when we consider a
situation where we have data from not many well-understood transactional
databases to populate a Data Warehouse. A Data Warehouse is a database
used by an organization as the single place where the latest, most accurate
data resides. In this scenario a manual approach gives high quality results,
and thus the analysis appear to be trustworthy.
Instead, when we consider data coming from data lakes, only a few studies
compare the automatic and manual approaches. A data lake is a collection
of data stored in its natural/raw format, usually object blobs or files. They
are challenging because they are characterized by many different and fast-
changing data that vary in quality and relevance. In this different task, we
ii
4. have few results considering the automation of the single steps, but they are
promising.
In the automatic, single steps scenario there are some studies that empir-
ically evaluate the effects of having feedback from the user as inputs and
considering how much feedback is given. However, these studies often have
mixed findings.
Furthermore, considering the automatic approaches a fundamental question
remain: Is it better to automatically solve individual steps of data prepara-
tion or address the problem as a whole? While focusing on the individual
steps can give more control, end-to-end solutions have lower costs and enable
positive synergy between the steps, and permit to avoid programming.
When using an automated approach is manda-
tory because the manual is not viable?
In some cases, a manual approach is not possible. Therefore automating the
data preparation permits to obtain information that otherwise would be lost.
The main examples are:
• Big data deal with data sets characterized by the so-called three V.
The first V refers to a large Volume of data. The second is Veracity,
which refers to the fact that the quality of data is often variable, and
we could find some false data. The third is Velocity, which refers to
the speed of generation and analysis required. These features make big
data not suitable to be prepared with a manual approach.
• There are no economic or human resources: the vast majority of the
ICT businesses employ a small number of people. They could not
afford it or could not have sufficiently large teams for manual data
preparation.
Furthermore, there are some of the data preparation steps where using man-
ual data preparation hardly produces good results. Also, the automatic ap-
proach makes it easier to set parameters considering the end-to-end problem,
leading to a better outcome.
iii
5. Conclusion
Automating data preparation is important for all businesses that work with
a big amount of data because it allows to lower cost and time.
As N. Paton explained, this automation can be conducted with different
approaches: focusing on the single steps or the entire process. It is still
needed additional input data to inform the decision and feedback from a
user with knowledge in the scope of data. It would be important to have
more research in automating all the different data preparation steps and
changing the evidence they are using. Also, end-to-end data preparation is
very promising but still needs more work.
Analysing big data will likely soon become more and more common. At
least considering the United Kingdom, ICT companies are still commonly
small businesses that employ few people. Both situations make using the
traditional manual approach not reasonable. In conclusion, research on this
topic will increase in importance.
iv