SlideShare a Scribd company logo
Data Preparation and Cleaning
February 22, 2016
Matteo Manca matteo.manca@eurecat.org
Matteo Manca Researcher @ Eurecat
(Social Media group)- BCN
PhD @ Cagliari – Italy
Research interests:
• social media mining,
• social networks analysis
• computational social science
• data Science
Contacts:
matteo.manca@eurecat.org
https://mattemanca.wordpress.com
Matteo Manca matteo.manca@eurecat.org
Índice del capítulo
1
3
• Topic 1: Big Data Economy
• Topic 2: Environment
• Topic 3: Data Exploration
• Topic 4: Data Ingestion & Storage
• Topic 5: Data Preparation — Cleaning
• Topic 6: Distributed Systems (Hadoop)
• Topic 7: Distributed Analytics (PIG)
Topics
Big data
Matteo Manca matteo.manca@eurecat.org
• Why are we interested on Data preparation and Cleaning?
• Introduction to Data pre-processing and Cleaning ( main
concepts, and main steps)
• Best practices
• Data Pre-processing and Cleaning in R: Step-by-Step
Tutorial
Data Preparation — Cleaning
Matteo Manca matteo.manca@eurecat.org
Why are we interested on Data pre-processing
and Cleaning? Let’s analyse our data!!
1. Average test score?
2. Most common year?
3. % of male and
female?
5
Raw data
Matteo Manca matteo.manca@eurecat.org
Why are we interested on Data pre-processing
and Cleaning?
6
Raw data
• Incomplete: lacking attribute
values, lacking certain attributes
of interest, or containing only
aggregate data
• Noisy: containing errors or
outliers
• Inconsistent: containing
discrepancies in codes or names
• Data analyst spends much if not most of his time on
preparing the data before doing the analysis
• 80% of data mining and analysis is really data preparation.
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
7
Process of transforming raw data into consistent
data that can be analyzed.
Consistent data is the stage where data is ready for the
analysis
Main steps:
• Handle missing values (ignore
the tuple, fill missing value with
mean/mode value, predict
it,etc.)
• identify or remove outliers
• resolve inconsistencies.
• Data transformation:
normalization and aggregation
Data pre-
processing and
cleaning
Raw
data
Raw
data
Consisten
t data
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
8
Consistent Data
• Each variable you measure
should be in one column
• Each different observation
(record) should be in a different
row
• If we are working with different
variables there should be
different data frames linked
each other
Data pre-
processing and
cleaning
Raw
data
Raw
data
Consisten
t data
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
9
Best practices
• Pipeline: a explicit “recipe” used to go
from step i to step i+1 (all steps should
be recorded)
• A code book that describes each
variable and its values in the tidy
dataset
• Use make variable names human
readable
• save your clean / consistent data to
files to avoid to repeat each time the
pre-process and DC (one file per data
frame / table)
• Markdown (. md) files usually are
used
(https://en.wikipedia.org/wiki/Markdow
Data pre-
processing and
cleaning
Raw
data
Raw
data
Consisten
t data
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning in R
10
Rstudio is a user interface for R.
https://www.rstudio.com
Matteo Manca matteo.manca@eurecat.org
R is a free software environment for
statistical computing and graphics
(https://www.r-project.org)
Questions ?
Matteo Manca matteo.manca@eurecat.org
12
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
© 2015, Barcelona Technology School ed X.X DD/MM/2015
www.barcelonatechnologyschoo.com
References
14
Matteo Manca matteo.manca@eurecat.org
1. https://cran.r-
project.org/doc/contrib/de_Jonge+van_der_Loo-
Introduction_to_data_cleaning_with_R.pdf
2. https://www.coursera.org/learn/data-cleaning
3. https://www.coursera.org/learn/r-programming
4. http://www.r-bloggers.com

More Related Content

What's hot

Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
Peter Reimann
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Data analytics
Data analyticsData analytics
Data analytics
BindhuBhargaviTalasi
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
Maloy Manna, PMP®
 
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DI&A Slides: Descriptive, Prescriptive, and Predictive AnalyticsDI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DATAVERSITY
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
T Kavitha
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
FellowBuddy.com
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
Vishal Patel
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data Visualization
Stephen Tracy
 
Data mining
Data miningData mining
Data mining
Birju Tank
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
Derek Kane
 
Data Preprocessing
Data PreprocessingData Preprocessing
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket Analysis
Narayan Vyas
 
DATA ANALYSIS
DATA ANALYSISDATA ANALYSIS
DATA ANALYSIS
CHARAK RAY
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
Vishwas N
 
Data science
Data scienceData science
Data science
Mohamed Loey
 

What's hot (20)

Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DI&A Slides: Descriptive, Prescriptive, and Predictive AnalyticsDI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data Visualization
 
Data mining
Data miningData mining
Data mining
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket Analysis
 
DATA ANALYSIS
DATA ANALYSISDATA ANALYSIS
DATA ANALYSIS
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data science
Data scienceData science
Data science
 

Viewers also liked

Adaptive pre-processing for streaming data
Adaptive pre-processing for streaming dataAdaptive pre-processing for streaming data
Adaptive pre-processing for streaming data
LARCA UPC
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
Pre processing big data
Pre processing big dataPre processing big data
Pre processing big data
Maloy Manna, PMP®
 
Behavioral Data Mining to Produce Novel and Serendipitous Friend Recommendati...
Behavioral Data Mining to Produce Novel and Serendipitous Friend Recommendati...Behavioral Data Mining to Produce Novel and Serendipitous Friend Recommendati...
Behavioral Data Mining to Produce Novel and Serendipitous Friend Recommendati...
Matteo Manca
 
Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project Report
Arnab Mukhopadhyay
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 
Comprehensive Validation with Laravel 4
Comprehensive Validation with Laravel 4Comprehensive Validation with Laravel 4
Comprehensive Validation with Laravel 4
Kirk Bushell
 
HOW TO PROCESS DATA IN VARIOUS GEO'S A COMPARATIVE ANALYSIS BY SANJEEV SINGH...
HOW TO PROCESS DATA IN VARIOUS GEO'S A  COMPARATIVE ANALYSIS BY SANJEEV SINGH...HOW TO PROCESS DATA IN VARIOUS GEO'S A  COMPARATIVE ANALYSIS BY SANJEEV SINGH...
HOW TO PROCESS DATA IN VARIOUS GEO'S A COMPARATIVE ANALYSIS BY SANJEEV SINGH...
Sanjeev Bharwan
 
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)
Sanjeev Bharwan
 
Data pre processing
Data pre processingData pre processing
Data pre processingpommurajopt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Harry Potter
 
Difference between snowflake schema and fact constellation
Difference between snowflake schema and fact constellationDifference between snowflake schema and fact constellation
Difference between snowflake schema and fact constellation
Asim Saif
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
Ritvvij Parrikh
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Bertram Ludäscher
 
Bba203 unit 2data processing concepts
Bba203   unit 2data processing conceptsBba203   unit 2data processing concepts
Bba203 unit 2data processing concepts
kinjal patel
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unitbhagathk
 
Difference between star schema and snowflake schema
Difference between star schema and snowflake schemaDifference between star schema and snowflake schema
Difference between star schema and snowflake schema
Umar Ali
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingSlideshare
 
Data processing cycle
Data processing cycleData processing cycle
Data processing cycle
Shanmugam Thiagoo
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
Krish_ver2
 

Viewers also liked (20)

Adaptive pre-processing for streaming data
Adaptive pre-processing for streaming dataAdaptive pre-processing for streaming data
Adaptive pre-processing for streaming data
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Pre processing big data
Pre processing big dataPre processing big data
Pre processing big data
 
Behavioral Data Mining to Produce Novel and Serendipitous Friend Recommendati...
Behavioral Data Mining to Produce Novel and Serendipitous Friend Recommendati...Behavioral Data Mining to Produce Novel and Serendipitous Friend Recommendati...
Behavioral Data Mining to Produce Novel and Serendipitous Friend Recommendati...
 
Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project Report
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Comprehensive Validation with Laravel 4
Comprehensive Validation with Laravel 4Comprehensive Validation with Laravel 4
Comprehensive Validation with Laravel 4
 
HOW TO PROCESS DATA IN VARIOUS GEO'S A COMPARATIVE ANALYSIS BY SANJEEV SINGH...
HOW TO PROCESS DATA IN VARIOUS GEO'S A  COMPARATIVE ANALYSIS BY SANJEEV SINGH...HOW TO PROCESS DATA IN VARIOUS GEO'S A  COMPARATIVE ANALYSIS BY SANJEEV SINGH...
HOW TO PROCESS DATA IN VARIOUS GEO'S A COMPARATIVE ANALYSIS BY SANJEEV SINGH...
 
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Difference between snowflake schema and fact constellation
Difference between snowflake schema and fact constellationDifference between snowflake schema and fact constellation
Difference between snowflake schema and fact constellation
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
Bba203 unit 2data processing concepts
Bba203   unit 2data processing conceptsBba203   unit 2data processing concepts
Bba203 unit 2data processing concepts
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
Difference between star schema and snowflake schema
Difference between star schema and snowflake schemaDifference between star schema and snowflake schema
Difference between star schema and snowflake schema
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data processing cycle
Data processing cycleData processing cycle
Data processing cycle
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 

Similar to Introduction to data pre-processing and cleaning

Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
Loïc Lejoly
 
Data Management - Basic Concepts
Data Management - Basic ConceptsData Management - Basic Concepts
Data Management - Basic Concepts
Sr Edith Bogue
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
C. Tobin Magle
 
Augmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaAugmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with Nirvana
Igor Sfiligoi
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
Research Data Alliance
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
Research Data Alliance
 
Leonard&Dhollander_OpenScienceBelgium.pptx
Leonard&Dhollander_OpenScienceBelgium.pptxLeonard&Dhollander_OpenScienceBelgium.pptx
Leonard&Dhollander_OpenScienceBelgium.pptx
OpenAccessBelgium
 
data mining
data miningdata mining
data mining
manasa polu
 
Love Your Data Locally
Love Your Data LocallyLove Your Data Locally
Love Your Data Locally
Erin D. Foster
 
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Anna Maria Tammaro
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
priyanka rajput
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
Ken Karapetyan
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
wekineheshete
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
neelakandan2001kpm
 
Open Data standards day - IODC 16 - Simple open data practices by euroalert
Open Data standards day - IODC 16 - Simple open data practices by euroalertOpen Data standards day - IODC 16 - Simple open data practices by euroalert
Open Data standards day - IODC 16 - Simple open data practices by euroalert
Jose Luis Marín de la Iglesia
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
nayanakarsh469
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
Nandakumar P
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
Ramakant Soni
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
SugumarSarDurai
 

Similar to Introduction to data pre-processing and cleaning (20)

Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
Data Management - Basic Concepts
Data Management - Basic ConceptsData Management - Basic Concepts
Data Management - Basic Concepts
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
Augmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaAugmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with Nirvana
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
 
Leonard&Dhollander_OpenScienceBelgium.pptx
Leonard&Dhollander_OpenScienceBelgium.pptxLeonard&Dhollander_OpenScienceBelgium.pptx
Leonard&Dhollander_OpenScienceBelgium.pptx
 
data mining
data miningdata mining
data mining
 
Love Your Data Locally
Love Your Data LocallyLove Your Data Locally
Love Your Data Locally
 
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
 
Open Data standards day - IODC 16 - Simple open data practices by euroalert
Open Data standards day - IODC 16 - Simple open data practices by euroalertOpen Data standards day - IODC 16 - Simple open data practices by euroalert
Open Data standards day - IODC 16 - Simple open data practices by euroalert
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 

Recently uploaded

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 

Recently uploaded (20)

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 

Introduction to data pre-processing and cleaning

  • 1. Data Preparation and Cleaning February 22, 2016 Matteo Manca matteo.manca@eurecat.org
  • 2. Matteo Manca Researcher @ Eurecat (Social Media group)- BCN PhD @ Cagliari – Italy Research interests: • social media mining, • social networks analysis • computational social science • data Science Contacts: matteo.manca@eurecat.org https://mattemanca.wordpress.com Matteo Manca matteo.manca@eurecat.org
  • 3. Índice del capítulo 1 3 • Topic 1: Big Data Economy • Topic 2: Environment • Topic 3: Data Exploration • Topic 4: Data Ingestion & Storage • Topic 5: Data Preparation — Cleaning • Topic 6: Distributed Systems (Hadoop) • Topic 7: Distributed Analytics (PIG) Topics Big data Matteo Manca matteo.manca@eurecat.org
  • 4. • Why are we interested on Data preparation and Cleaning? • Introduction to Data pre-processing and Cleaning ( main concepts, and main steps) • Best practices • Data Pre-processing and Cleaning in R: Step-by-Step Tutorial Data Preparation — Cleaning Matteo Manca matteo.manca@eurecat.org
  • 5. Why are we interested on Data pre-processing and Cleaning? Let’s analyse our data!! 1. Average test score? 2. Most common year? 3. % of male and female? 5 Raw data Matteo Manca matteo.manca@eurecat.org
  • 6. Why are we interested on Data pre-processing and Cleaning? 6 Raw data • Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • Noisy: containing errors or outliers • Inconsistent: containing discrepancies in codes or names • Data analyst spends much if not most of his time on preparing the data before doing the analysis • 80% of data mining and analysis is really data preparation. Matteo Manca matteo.manca@eurecat.org
  • 7. Data Pre-processing and Cleaning 7 Process of transforming raw data into consistent data that can be analyzed. Consistent data is the stage where data is ready for the analysis Main steps: • Handle missing values (ignore the tuple, fill missing value with mean/mode value, predict it,etc.) • identify or remove outliers • resolve inconsistencies. • Data transformation: normalization and aggregation Data pre- processing and cleaning Raw data Raw data Consisten t data Matteo Manca matteo.manca@eurecat.org
  • 8. Data Pre-processing and Cleaning 8 Consistent Data • Each variable you measure should be in one column • Each different observation (record) should be in a different row • If we are working with different variables there should be different data frames linked each other Data pre- processing and cleaning Raw data Raw data Consisten t data Matteo Manca matteo.manca@eurecat.org
  • 9. Data Pre-processing and Cleaning 9 Best practices • Pipeline: a explicit “recipe” used to go from step i to step i+1 (all steps should be recorded) • A code book that describes each variable and its values in the tidy dataset • Use make variable names human readable • save your clean / consistent data to files to avoid to repeat each time the pre-process and DC (one file per data frame / table) • Markdown (. md) files usually are used (https://en.wikipedia.org/wiki/Markdow Data pre- processing and cleaning Raw data Raw data Consisten t data Matteo Manca matteo.manca@eurecat.org
  • 10. Data Pre-processing and Cleaning in R 10 Rstudio is a user interface for R. https://www.rstudio.com Matteo Manca matteo.manca@eurecat.org R is a free software environment for statistical computing and graphics (https://www.r-project.org)
  • 11. Questions ? Matteo Manca matteo.manca@eurecat.org
  • 13. Data Pre-processing and Cleaning © 2015, Barcelona Technology School ed X.X DD/MM/2015 www.barcelonatechnologyschoo.com
  • 14. References 14 Matteo Manca matteo.manca@eurecat.org 1. https://cran.r- project.org/doc/contrib/de_Jonge+van_der_Loo- Introduction_to_data_cleaning_with_R.pdf 2. https://www.coursera.org/learn/data-cleaning 3. https://www.coursera.org/learn/r-programming 4. http://www.r-bloggers.com

Editor's Notes

  1. Markdown is a lightweight markup language with plain text formatting syntax designed so that it can be converted to HTML and many other formats using a tool by the same name.