Introduction to data pre-processing and cleaning

Data Preparation and Cleaning
February 22, 2016
Matteo Manca matteo.manca@eurecat.org

Matteo Manca Researcher @ Eurecat
(Social Media group)- BCN
PhD @ Cagliari – Italy
Research interests:
• social media mining,
• social networks analysis
• computational social science
• data Science
Contacts:
matteo.manca@eurecat.org
https://mattemanca.wordpress.com

Índice del capítulo
1
3
• Topic 1: Big Data Economy
• Topic 2: Environment
• Topic 3: Data Exploration
• Topic 4: Data Ingestion & Storage
• Topic 5: Data Preparation — Cleaning
• Topic 6: Distributed Systems (Hadoop)
• Topic 7: Distributed Analytics (PIG)
Topics
Big data

• Why are we interested on Data preparation and Cleaning?
• Introduction to Data pre-processing and Cleaning ( main
concepts, and main steps)
• Best practices
• Data Pre-processing and Cleaning in R: Step-by-Step
Tutorial
Data Preparation — Cleaning

Why are we interested on Data pre-processing
and Cleaning? Let’s analyse our data!!
1. Average test score?
2. Most common year?
3. % of male and
female?
5
Raw data

Why are we interested on Data pre-processing
and Cleaning?
6
Raw data
• Incomplete: lacking attribute
values, lacking certain attributes
of interest, or containing only
aggregate data
• Noisy: containing errors or
outliers
• Inconsistent: containing
discrepancies in codes or names
• Data analyst spends much if not most of his time on
preparing the data before doing the analysis
• 80% of data mining and analysis is really data preparation.

Data Pre-processing and Cleaning
7
Process of transforming raw data into consistent
data that can be analyzed.
Consistent data is the stage where data is ready for the
analysis
Main steps:
• Handle missing values (ignore
the tuple, fill missing value with
mean/mode value, predict
it,etc.)
• identify or remove outliers
• resolve inconsistencies.
• Data transformation:
normalization and aggregation
Data pre-
processing and
cleaning
Raw
data
Raw
data
Consisten
t data

8
Consistent Data
• Each variable you measure
should be in one column
• Each different observation
(record) should be in a different
row
• If we are working with different
variables there should be
different data frames linked
each other
Data pre-
processing and
cleaning
Raw
data
Raw
data
Consisten
t data

9
Best practices
• Pipeline: a explicit “recipe” used to go
from step i to step i+1 (all steps should
be recorded)
• A code book that describes each
variable and its values in the tidy
dataset
• Use make variable names human
readable
• save your clean / consistent data to
files to avoid to repeat each time the
pre-process and DC (one file per data
frame / table)
• Markdown (. md) files usually are
used
(https://en.wikipedia.org/wiki/Markdow
Data pre-
processing and
cleaning
Raw
data
Raw
data
Consisten
t data

Data Pre-processing and Cleaning in R
10
Rstudio is a user interface for R.
https://www.rstudio.com
R is a free software environment for
statistical computing and graphics
(https://www.r-project.org)

Questions ?

12

References
14
1. https://cran.r-
project.org/doc/contrib/de_Jonge+van_der_Loo-
Introduction_to_data_cleaning_with_R.pdf
2. https://www.coursera.org/learn/data-cleaning
3. https://www.coursera.org/learn/r-programming
4. http://www.r-bloggers.com

Introduction to data pre-processing and cleaning

More Related Content

What's hot

Viewers also liked

Similar to Introduction to data pre-processing and cleaning

Recently uploaded

Introduction to data pre-processing and cleaning

Editor's Notes