SlideShare a Scribd company logo
1 of 5
Download to read offline
UNIVERSITÀ DEGLI STUDI DI
TRIESTE
Dipartimento di Ingegneria e Architettura
Laurea Triennale in Ingegneria Elettronica e
Informatica
Summary of: "Automating Data
Preparation: Can We? Should We?
Must We?"
April 28, 2021
Laureando Relatore
Samuele Bertollo Chiar.mo Prof. Eric Medvet
Anno Accademico 2020/2021
Introduction
Data preparation is also known as data wrangling and ETL (Extract
Transform Load). The definition given by N. Paton is: "Data preparation
covers the discovery, selection, integration and cleaning of existing data sets
into a form that is suitable for analysis.".
Data preparation is fundamental to the work of data scientists and consumes
80% of its work time in the mean. Data preparation can be divided into single
steps: activities that usually the data scientist perform manually. Doing this
work requires programming skills, a skill that has significant cost in terms of
time and money. By automating data preparation we can reduce these costs.
The goal of automating data preparation is to realize an application where
its users should specify what they want to obtain from the data preparation
instead of describing the steps required to obtain it. With this approach,
programming knowledge is not required and automating the process is less
time-consuming.
In the paper of N. Paton, "Automating Data Preparation: Can We? Should
We? Must We?" (2019), three main questions are discussed:
1. What techniques do we have to automate data preparation?
2. When are the results better than human-made data preparation?
3. When using an automated approach is mandatory because the manual
is not viable?
What techniques do we have to automate data
preparation?
There are two main strategies for automating data preparation: focusing on
the single steps or solving the problem end-to-end.
In the single steps strategies, we need to provide some additional informa-
tion to automate. These additional information are called evidence. In some
cases, the evidence requires just the source data or few additional data. How-
ever, in general, the more data we provide, the better results we have. For
example, to automate the learning of data transformation, we need to pro-
i
vide some samples of the problem solved, which are called training data. This
means that someone is still required to discover these samples manually, but
some researchers are finding ways to automate the discovery.
As good examples of end-to-end data preparation software, we have Data
TAMER and VADA.
Data TAMER is semi-automatic and uses training data. An important char-
acteristics of this software is that the user must provide feedbacks for each
step of the data preparation.
In contrast, VADA uses only evidence formed by data context, that are in-
stance values associated with portions of a target schema. With this software
the user can review the single steps by giving feedbacks, but this step is not
mandatory.
Therefore both end-to-end solutions require just a user with knowledge of the
domain application and they do not require the user to specify what they
should do to prepare the data.
How differ the quality of the results in manual
and automated approaches?
Manual and automated approaches are not comparable in an absolute way
in terms of quality of results. This is because the quality of results changes
when we change the data to prepare.
The manual approach will probably remain relevant when we consider a
situation where we have data from not many well-understood transactional
databases to populate a Data Warehouse. A Data Warehouse is a database
used by an organization as the single place where the latest, most accurate
data resides. In this scenario a manual approach gives high quality results,
and thus the analysis appear to be trustworthy.
Instead, when we consider data coming from data lakes, only a few studies
compare the automatic and manual approaches. A data lake is a collection
of data stored in its natural/raw format, usually object blobs or files. They
are challenging because they are characterized by many different and fast-
changing data that vary in quality and relevance. In this different task, we
ii
have few results considering the automation of the single steps, but they are
promising.
In the automatic, single steps scenario there are some studies that empir-
ically evaluate the effects of having feedback from the user as inputs and
considering how much feedback is given. However, these studies often have
mixed findings.
Furthermore, considering the automatic approaches a fundamental question
remain: Is it better to automatically solve individual steps of data prepara-
tion or address the problem as a whole? While focusing on the individual
steps can give more control, end-to-end solutions have lower costs and enable
positive synergy between the steps, and permit to avoid programming.
When using an automated approach is manda-
tory because the manual is not viable?
In some cases, a manual approach is not possible. Therefore automating the
data preparation permits to obtain information that otherwise would be lost.
The main examples are:
• Big data deal with data sets characterized by the so-called three V.
The first V refers to a large Volume of data. The second is Veracity,
which refers to the fact that the quality of data is often variable, and
we could find some false data. The third is Velocity, which refers to
the speed of generation and analysis required. These features make big
data not suitable to be prepared with a manual approach.
• There are no economic or human resources: the vast majority of the
ICT businesses employ a small number of people. They could not
afford it or could not have sufficiently large teams for manual data
preparation.
Furthermore, there are some of the data preparation steps where using man-
ual data preparation hardly produces good results. Also, the automatic ap-
proach makes it easier to set parameters considering the end-to-end problem,
leading to a better outcome.
iii
Conclusion
Automating data preparation is important for all businesses that work with
a big amount of data because it allows to lower cost and time.
As N. Paton explained, this automation can be conducted with different
approaches: focusing on the single steps or the entire process. It is still
needed additional input data to inform the decision and feedback from a
user with knowledge in the scope of data. It would be important to have
more research in automating all the different data preparation steps and
changing the evidence they are using. Also, end-to-end data preparation is
very promising but still needs more work.
Analysing big data will likely soon become more and more common. At
least considering the United Kingdom, ICT companies are still commonly
small businesses that employ few people. Both situations make using the
traditional manual approach not reasonable. In conclusion, research on this
topic will increase in importance.
iv

More Related Content

Similar to Summary of: "Automating Data Preparation: Can We? Should We? Must We?"

Data Processing and its Types
Data Processing and its TypesData Processing and its Types
Data Processing and its TypesMuhammad Zubair
 
Different types of data processing
Different types of data processingDifferent types of data processing
Different types of data processingShyam Sunder Budhwar
 
What is data science ?
What is data science ?What is data science ?
What is data science ?ShahlKv
 
Data lakes a tool for minimizing expenditure on storage
Data lakes a tool for minimizing expenditure on storageData lakes a tool for minimizing expenditure on storage
Data lakes a tool for minimizing expenditure on storageDr. C.V. Suresh Babu
 
Testing Data & Data-Centric Applications - Whitepaper
Testing Data & Data-Centric Applications - WhitepaperTesting Data & Data-Centric Applications - Whitepaper
Testing Data & Data-Centric Applications - WhitepaperRyan Dowd
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...TELKOMNIKA JOURNAL
 
6. ijece guideforauthors 2012_2 eidt sat
6. ijece guideforauthors 2012_2 eidt sat6. ijece guideforauthors 2012_2 eidt sat
6. ijece guideforauthors 2012_2 eidt satIAESIJEECS
 
Data Migration: A White Paper by Bloor Research
Data Migration: A White Paper by Bloor ResearchData Migration: A White Paper by Bloor Research
Data Migration: A White Paper by Bloor ResearchFindWhitePapers
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And IntegrityGerrit Klaschke, CSM
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxAkash527744
 
How Can I Use SNAP to Improve My Estimation Practices?
How Can I Use SNAP to Improve My Estimation Practices?How Can I Use SNAP to Improve My Estimation Practices?
How Can I Use SNAP to Improve My Estimation Practices?DCG Software Value
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptxamitparashar42
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptxamitparashar42
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEijcsa
 

Similar to Summary of: "Automating Data Preparation: Can We? Should We? Must We?" (20)

Data Processing and its Types
Data Processing and its TypesData Processing and its Types
Data Processing and its Types
 
Different types of data processing
Different types of data processingDifferent types of data processing
Different types of data processing
 
What is data science ?
What is data science ?What is data science ?
What is data science ?
 
Data lakes a tool for minimizing expenditure on storage
Data lakes a tool for minimizing expenditure on storageData lakes a tool for minimizing expenditure on storage
Data lakes a tool for minimizing expenditure on storage
 
Testing Data & Data-Centric Applications - Whitepaper
Testing Data & Data-Centric Applications - WhitepaperTesting Data & Data-Centric Applications - Whitepaper
Testing Data & Data-Centric Applications - Whitepaper
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
-linkedin
-linkedin-linkedin
-linkedin
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...
Data Cleaning Service for Data Warehouse: An Experimental Comparative Study o...
 
6. ijece guideforauthors 2012_2 eidt sat
6. ijece guideforauthors 2012_2 eidt sat6. ijece guideforauthors 2012_2 eidt sat
6. ijece guideforauthors 2012_2 eidt sat
 
Data Migration: A White Paper by Bloor Research
Data Migration: A White Paper by Bloor ResearchData Migration: A White Paper by Bloor Research
Data Migration: A White Paper by Bloor Research
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
 
How Can I Use SNAP to Improve My Estimation Practices?
How Can I Use SNAP to Improve My Estimation Practices?How Can I Use SNAP to Improve My Estimation Practices?
How Can I Use SNAP to Improve My Estimation Practices?
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptx
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptx
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
 

Recently uploaded

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 

Recently uploaded (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Summary of: "Automating Data Preparation: Can We? Should We? Must We?"

  • 1. UNIVERSITÀ DEGLI STUDI DI TRIESTE Dipartimento di Ingegneria e Architettura Laurea Triennale in Ingegneria Elettronica e Informatica Summary of: "Automating Data Preparation: Can We? Should We? Must We?" April 28, 2021 Laureando Relatore Samuele Bertollo Chiar.mo Prof. Eric Medvet Anno Accademico 2020/2021
  • 2. Introduction Data preparation is also known as data wrangling and ETL (Extract Transform Load). The definition given by N. Paton is: "Data preparation covers the discovery, selection, integration and cleaning of existing data sets into a form that is suitable for analysis.". Data preparation is fundamental to the work of data scientists and consumes 80% of its work time in the mean. Data preparation can be divided into single steps: activities that usually the data scientist perform manually. Doing this work requires programming skills, a skill that has significant cost in terms of time and money. By automating data preparation we can reduce these costs. The goal of automating data preparation is to realize an application where its users should specify what they want to obtain from the data preparation instead of describing the steps required to obtain it. With this approach, programming knowledge is not required and automating the process is less time-consuming. In the paper of N. Paton, "Automating Data Preparation: Can We? Should We? Must We?" (2019), three main questions are discussed: 1. What techniques do we have to automate data preparation? 2. When are the results better than human-made data preparation? 3. When using an automated approach is mandatory because the manual is not viable? What techniques do we have to automate data preparation? There are two main strategies for automating data preparation: focusing on the single steps or solving the problem end-to-end. In the single steps strategies, we need to provide some additional informa- tion to automate. These additional information are called evidence. In some cases, the evidence requires just the source data or few additional data. How- ever, in general, the more data we provide, the better results we have. For example, to automate the learning of data transformation, we need to pro- i
  • 3. vide some samples of the problem solved, which are called training data. This means that someone is still required to discover these samples manually, but some researchers are finding ways to automate the discovery. As good examples of end-to-end data preparation software, we have Data TAMER and VADA. Data TAMER is semi-automatic and uses training data. An important char- acteristics of this software is that the user must provide feedbacks for each step of the data preparation. In contrast, VADA uses only evidence formed by data context, that are in- stance values associated with portions of a target schema. With this software the user can review the single steps by giving feedbacks, but this step is not mandatory. Therefore both end-to-end solutions require just a user with knowledge of the domain application and they do not require the user to specify what they should do to prepare the data. How differ the quality of the results in manual and automated approaches? Manual and automated approaches are not comparable in an absolute way in terms of quality of results. This is because the quality of results changes when we change the data to prepare. The manual approach will probably remain relevant when we consider a situation where we have data from not many well-understood transactional databases to populate a Data Warehouse. A Data Warehouse is a database used by an organization as the single place where the latest, most accurate data resides. In this scenario a manual approach gives high quality results, and thus the analysis appear to be trustworthy. Instead, when we consider data coming from data lakes, only a few studies compare the automatic and manual approaches. A data lake is a collection of data stored in its natural/raw format, usually object blobs or files. They are challenging because they are characterized by many different and fast- changing data that vary in quality and relevance. In this different task, we ii
  • 4. have few results considering the automation of the single steps, but they are promising. In the automatic, single steps scenario there are some studies that empir- ically evaluate the effects of having feedback from the user as inputs and considering how much feedback is given. However, these studies often have mixed findings. Furthermore, considering the automatic approaches a fundamental question remain: Is it better to automatically solve individual steps of data prepara- tion or address the problem as a whole? While focusing on the individual steps can give more control, end-to-end solutions have lower costs and enable positive synergy between the steps, and permit to avoid programming. When using an automated approach is manda- tory because the manual is not viable? In some cases, a manual approach is not possible. Therefore automating the data preparation permits to obtain information that otherwise would be lost. The main examples are: • Big data deal with data sets characterized by the so-called three V. The first V refers to a large Volume of data. The second is Veracity, which refers to the fact that the quality of data is often variable, and we could find some false data. The third is Velocity, which refers to the speed of generation and analysis required. These features make big data not suitable to be prepared with a manual approach. • There are no economic or human resources: the vast majority of the ICT businesses employ a small number of people. They could not afford it or could not have sufficiently large teams for manual data preparation. Furthermore, there are some of the data preparation steps where using man- ual data preparation hardly produces good results. Also, the automatic ap- proach makes it easier to set parameters considering the end-to-end problem, leading to a better outcome. iii
  • 5. Conclusion Automating data preparation is important for all businesses that work with a big amount of data because it allows to lower cost and time. As N. Paton explained, this automation can be conducted with different approaches: focusing on the single steps or the entire process. It is still needed additional input data to inform the decision and feedback from a user with knowledge in the scope of data. It would be important to have more research in automating all the different data preparation steps and changing the evidence they are using. Also, end-to-end data preparation is very promising but still needs more work. Analysing big data will likely soon become more and more common. At least considering the United Kingdom, ICT companies are still commonly small businesses that employ few people. Both situations make using the traditional manual approach not reasonable. In conclusion, research on this topic will increase in importance. iv