SlideShare a Scribd company logo
1 of 1
Download to read offline
10 STEPS FOR
HIGH-QUALITY
DATASETS
BY PIER GIUSEPPE DE MEO
#1
Keep your Datasets separate.
#2
Prepare a toolbox with a set of transformation processes (procedures, functions,
scripts, etc.) that can be reused.
#3
Logically group the types of transformations, based on categories (e.g. missing
values, decodes, normalization, etc.).
#4
For every category identified, select a subset of data in a Dataset on which to apply
this type of transformation: repeat this process on all your Datasets separately.
#5
For every Dataset, if needed, enrich the data contained with other derived
information (e.g. calculated field, extraction of sub-information, etc.).
#6
Define the minimum level of details shared across all Datasets (e.g. single
transaction per day, groups of transactions per month, etc.).
#7
For every Dataset, groups data at the same level of granularity.
#8
Join all formatted Datasets in a single Master Dataset, based on granularity defined.
#9
In the Master Dataset produced, check whether there exists a subset of data on
which to apply any of the transformations in the toolbox.
#10
In the Master Dataset produced, if needed, enrich the data with some extra
information (e.g. metrics from various Datasets combined to form a KPI,
decryption based on a combination of fields, etc.).
Knowledge
Share
Series 1
DATASETS
A "Divide et impera" approach in producing high-quality
Datasets for data analysts.

More Related Content

Similar to 10 Steps for High Quality Datasets

Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesInformaticaTrainingClasses
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf pointsocporacledba
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf pointsdba3003
 
Technical Report NetApp Clustered Data ONTAP 8.2: An Introduction
Technical Report NetApp Clustered Data ONTAP 8.2: An IntroductionTechnical Report NetApp Clustered Data ONTAP 8.2: An Introduction
Technical Report NetApp Clustered Data ONTAP 8.2: An IntroductionNetApp
 
Improving Association Rule Mining by Defining a Novel Data Structure
Improving Association Rule Mining by Defining a Novel Data StructureImproving Association Rule Mining by Defining a Novel Data Structure
Improving Association Rule Mining by Defining a Novel Data StructureIRJET Journal
 
Fp growth tree improve its efficiency and scalability
Fp growth tree improve its efficiency and scalabilityFp growth tree improve its efficiency and scalability
Fp growth tree improve its efficiency and scalabilityDr.Manmohan Singh
 
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptxUNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptxshruthisweety4
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.pptpadalamail
 
An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousingShahed Khalili
 
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoMySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoDave Stokes
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningNandakumar P
 
Intro to big data and applications -day 3
Intro to big data and applications -day 3Intro to big data and applications -day 3
Intro to big data and applications -day 3Parviz Vakili
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018Dave Stokes
 

Similar to 10 Steps for High Quality Datasets (20)

Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClasses
 
DBMS.pptx
DBMS.pptxDBMS.pptx
DBMS.pptx
 
5 v of big data
5 v of big data5 v of big data
5 v of big data
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
 
Technical Report NetApp Clustered Data ONTAP 8.2: An Introduction
Technical Report NetApp Clustered Data ONTAP 8.2: An IntroductionTechnical Report NetApp Clustered Data ONTAP 8.2: An Introduction
Technical Report NetApp Clustered Data ONTAP 8.2: An Introduction
 
Noha mega store
Noha mega storeNoha mega store
Noha mega store
 
ifip2008albashiri.pdf
ifip2008albashiri.pdfifip2008albashiri.pdf
ifip2008albashiri.pdf
 
Improving Association Rule Mining by Defining a Novel Data Structure
Improving Association Rule Mining by Defining a Novel Data StructureImproving Association Rule Mining by Defining a Novel Data Structure
Improving Association Rule Mining by Defining a Novel Data Structure
 
Fp growth tree improve its efficiency and scalability
Fp growth tree improve its efficiency and scalabilityFp growth tree improve its efficiency and scalability
Fp growth tree improve its efficiency and scalability
 
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptxUNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt
 
An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousing
 
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoMySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data Mining
 
Intro to big data and applications -day 3
Intro to big data and applications -day 3Intro to big data and applications -day 3
Intro to big data and applications -day 3
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018
 

More from Pier Giuseppe De Meo

10 Steps to build a periodic summary statistical report
10 Steps to build a periodic summary statistical report10 Steps to build a periodic summary statistical report
10 Steps to build a periodic summary statistical reportPier Giuseppe De Meo
 
Bilancio Demografico Nazionale del 2019
Bilancio Demografico Nazionale del 2019Bilancio Demografico Nazionale del 2019
Bilancio Demografico Nazionale del 2019Pier Giuseppe De Meo
 
10 Steps for Managing Cross-System Data Mapping.pdf
10 Steps for Managing Cross-System Data Mapping.pdf10 Steps for Managing Cross-System Data Mapping.pdf
10 Steps for Managing Cross-System Data Mapping.pdfPier Giuseppe De Meo
 
10 Passi per Set di Dati di Alta-Qualità
10 Passi per Set di Dati di Alta-Qualità10 Passi per Set di Dati di Alta-Qualità
10 Passi per Set di Dati di Alta-QualitàPier Giuseppe De Meo
 
EDW: Enterprise Data Warehouse - Architecture and Process
EDW:  Enterprise Data Warehouse - Architecture and ProcessEDW:  Enterprise Data Warehouse - Architecture and Process
EDW: Enterprise Data Warehouse - Architecture and ProcessPier Giuseppe De Meo
 
10 passi per la costruzione di un report statistico di sintesi periodico
10 passi per la costruzione di un report statistico di sintesi periodico10 passi per la costruzione di un report statistico di sintesi periodico
10 passi per la costruzione di un report statistico di sintesi periodicoPier Giuseppe De Meo
 
10 Passi per la gestione del Mapping dei Dati cross-sistema
10 Passi per la gestione del Mapping dei Dati cross-sistema10 Passi per la gestione del Mapping dei Dati cross-sistema
10 Passi per la gestione del Mapping dei Dati cross-sistemaPier Giuseppe De Meo
 
BES 2018 - La Soddisfazione sul Lavoro
BES 2018 - La Soddisfazione sul LavoroBES 2018 - La Soddisfazione sul Lavoro
BES 2018 - La Soddisfazione sul LavoroPier Giuseppe De Meo
 

More from Pier Giuseppe De Meo (9)

10 Steps to build a periodic summary statistical report
10 Steps to build a periodic summary statistical report10 Steps to build a periodic summary statistical report
10 Steps to build a periodic summary statistical report
 
Bilancio Demografico Nazionale del 2019
Bilancio Demografico Nazionale del 2019Bilancio Demografico Nazionale del 2019
Bilancio Demografico Nazionale del 2019
 
10 Steps for Managing Cross-System Data Mapping.pdf
10 Steps for Managing Cross-System Data Mapping.pdf10 Steps for Managing Cross-System Data Mapping.pdf
10 Steps for Managing Cross-System Data Mapping.pdf
 
10 Passi per Set di Dati di Alta-Qualità
10 Passi per Set di Dati di Alta-Qualità10 Passi per Set di Dati di Alta-Qualità
10 Passi per Set di Dati di Alta-Qualità
 
EDW: Enterprise Data Warehouse - Architecture and Process
EDW:  Enterprise Data Warehouse - Architecture and ProcessEDW:  Enterprise Data Warehouse - Architecture and Process
EDW: Enterprise Data Warehouse - Architecture and Process
 
10 passi per la costruzione di un report statistico di sintesi periodico
10 passi per la costruzione di un report statistico di sintesi periodico10 passi per la costruzione di un report statistico di sintesi periodico
10 passi per la costruzione di un report statistico di sintesi periodico
 
Covid19 20200406
Covid19 20200406Covid19 20200406
Covid19 20200406
 
10 Passi per la gestione del Mapping dei Dati cross-sistema
10 Passi per la gestione del Mapping dei Dati cross-sistema10 Passi per la gestione del Mapping dei Dati cross-sistema
10 Passi per la gestione del Mapping dei Dati cross-sistema
 
BES 2018 - La Soddisfazione sul Lavoro
BES 2018 - La Soddisfazione sul LavoroBES 2018 - La Soddisfazione sul Lavoro
BES 2018 - La Soddisfazione sul Lavoro
 

Recently uploaded

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 

Recently uploaded (20)

Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

10 Steps for High Quality Datasets

  • 1. 10 STEPS FOR HIGH-QUALITY DATASETS BY PIER GIUSEPPE DE MEO #1 Keep your Datasets separate. #2 Prepare a toolbox with a set of transformation processes (procedures, functions, scripts, etc.) that can be reused. #3 Logically group the types of transformations, based on categories (e.g. missing values, decodes, normalization, etc.). #4 For every category identified, select a subset of data in a Dataset on which to apply this type of transformation: repeat this process on all your Datasets separately. #5 For every Dataset, if needed, enrich the data contained with other derived information (e.g. calculated field, extraction of sub-information, etc.). #6 Define the minimum level of details shared across all Datasets (e.g. single transaction per day, groups of transactions per month, etc.). #7 For every Dataset, groups data at the same level of granularity. #8 Join all formatted Datasets in a single Master Dataset, based on granularity defined. #9 In the Master Dataset produced, check whether there exists a subset of data on which to apply any of the transformations in the toolbox. #10 In the Master Dataset produced, if needed, enrich the data with some extra information (e.g. metrics from various Datasets combined to form a KPI, decryption based on a combination of fields, etc.). Knowledge Share Series 1 DATASETS A "Divide et impera" approach in producing high-quality Datasets for data analysts.