Data Preparation and Cleaning
February 22, 2016
Matteo Manca matteo.manca@eurecat.org
Matteo Manca Researcher @ Eurecat
(Social Media group)- BCN
PhD @ Cagliari – Italy
Research interests:
• social media mining,
• social networks analysis
• computational social science
• data Science
Contacts:
matteo.manca@eurecat.org
https://mattemanca.wordpress.com
Matteo Manca matteo.manca@eurecat.org
Índice del capítulo
1
3
• Topic 1: Big Data Economy
• Topic 2: Environment
• Topic 3: Data Exploration
• Topic 4: Data Ingestion & Storage
• Topic 5: Data Preparation — Cleaning
• Topic 6: Distributed Systems (Hadoop)
• Topic 7: Distributed Analytics (PIG)
Topics
Big data
Matteo Manca matteo.manca@eurecat.org
• Why are we interested on Data preparation and Cleaning?
• Introduction to Data pre-processing and Cleaning ( main
concepts, and main steps)
• Best practices
• Data Pre-processing and Cleaning in R: Step-by-Step
Tutorial
Data Preparation — Cleaning
Matteo Manca matteo.manca@eurecat.org
Why are we interested on Data pre-processing
and Cleaning? Let’s analyse our data!!
1. Average test score?
2. Most common year?
3. % of male and
female?
5
Raw data
Matteo Manca matteo.manca@eurecat.org
Why are we interested on Data pre-processing
and Cleaning?
6
Raw data
• Incomplete: lacking attribute
values, lacking certain attributes
of interest, or containing only
aggregate data
• Noisy: containing errors or
outliers
• Inconsistent: containing
discrepancies in codes or names
• Data analyst spends much if not most of his time on
preparing the data before doing the analysis
• 80% of data mining and analysis is really data preparation.
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
7
Process of transforming raw data into consistent
data that can be analyzed.
Consistent data is the stage where data is ready for the
analysis
Main steps:
• Handle missing values (ignore
the tuple, fill missing value with
mean/mode value, predict
it,etc.)
• identify or remove outliers
• resolve inconsistencies.
• Data transformation:
normalization and aggregation
Data pre-
processing and
cleaning
Raw
data
Raw
data
Consisten
t data
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
8
Consistent Data
• Each variable you measure
should be in one column
• Each different observation
(record) should be in a different
row
• If we are working with different
variables there should be
different data frames linked
each other
Data pre-
processing and
cleaning
Raw
data
Raw
data
Consisten
t data
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
9
Best practices
• Pipeline: a explicit “recipe” used to go
from step i to step i+1 (all steps should
be recorded)
• A code book that describes each
variable and its values in the tidy
dataset
• Use make variable names human
readable
• save your clean / consistent data to
files to avoid to repeat each time the
pre-process and DC (one file per data
frame / table)
• Markdown (. md) files usually are
used
(https://en.wikipedia.org/wiki/Markdow
Data pre-
processing and
cleaning
Raw
data
Raw
data
Consisten
t data
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning in R
10
Rstudio is a user interface for R.
https://www.rstudio.com
Matteo Manca matteo.manca@eurecat.org
R is a free software environment for
statistical computing and graphics
(https://www.r-project.org)
Questions ?
Matteo Manca matteo.manca@eurecat.org
12
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
Š 2015, Barcelona Technology School ed X.X DD/MM/2015
www.barcelonatechnologyschoo.com
References
14
Matteo Manca matteo.manca@eurecat.org
1. https://cran.r-
project.org/doc/contrib/de_Jonge+van_der_Loo-
Introduction_to_data_cleaning_with_R.pdf
2. https://www.coursera.org/learn/data-cleaning
3. https://www.coursera.org/learn/r-programming
4. http://www.r-bloggers.com

Introduction to data pre-processing and cleaning

  • 1.
    Data Preparation andCleaning February 22, 2016 Matteo Manca matteo.manca@eurecat.org
  • 2.
    Matteo Manca Researcher@ Eurecat (Social Media group)- BCN PhD @ Cagliari – Italy Research interests: • social media mining, • social networks analysis • computational social science • data Science Contacts: matteo.manca@eurecat.org https://mattemanca.wordpress.com Matteo Manca matteo.manca@eurecat.org
  • 3.
    Índice del capítulo 1 3 •Topic 1: Big Data Economy • Topic 2: Environment • Topic 3: Data Exploration • Topic 4: Data Ingestion & Storage • Topic 5: Data Preparation — Cleaning • Topic 6: Distributed Systems (Hadoop) • Topic 7: Distributed Analytics (PIG) Topics Big data Matteo Manca matteo.manca@eurecat.org
  • 4.
    • Why arewe interested on Data preparation and Cleaning? • Introduction to Data pre-processing and Cleaning ( main concepts, and main steps) • Best practices • Data Pre-processing and Cleaning in R: Step-by-Step Tutorial Data Preparation — Cleaning Matteo Manca matteo.manca@eurecat.org
  • 5.
    Why are weinterested on Data pre-processing and Cleaning? Let’s analyse our data!! 1. Average test score? 2. Most common year? 3. % of male and female? 5 Raw data Matteo Manca matteo.manca@eurecat.org
  • 6.
    Why are weinterested on Data pre-processing and Cleaning? 6 Raw data • Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • Noisy: containing errors or outliers • Inconsistent: containing discrepancies in codes or names • Data analyst spends much if not most of his time on preparing the data before doing the analysis • 80% of data mining and analysis is really data preparation. Matteo Manca matteo.manca@eurecat.org
  • 7.
    Data Pre-processing andCleaning 7 Process of transforming raw data into consistent data that can be analyzed. Consistent data is the stage where data is ready for the analysis Main steps: • Handle missing values (ignore the tuple, fill missing value with mean/mode value, predict it,etc.) • identify or remove outliers • resolve inconsistencies. • Data transformation: normalization and aggregation Data pre- processing and cleaning Raw data Raw data Consisten t data Matteo Manca matteo.manca@eurecat.org
  • 8.
    Data Pre-processing andCleaning 8 Consistent Data • Each variable you measure should be in one column • Each different observation (record) should be in a different row • If we are working with different variables there should be different data frames linked each other Data pre- processing and cleaning Raw data Raw data Consisten t data Matteo Manca matteo.manca@eurecat.org
  • 9.
    Data Pre-processing andCleaning 9 Best practices • Pipeline: a explicit “recipe” used to go from step i to step i+1 (all steps should be recorded) • A code book that describes each variable and its values in the tidy dataset • Use make variable names human readable • save your clean / consistent data to files to avoid to repeat each time the pre-process and DC (one file per data frame / table) • Markdown (. md) files usually are used (https://en.wikipedia.org/wiki/Markdow Data pre- processing and cleaning Raw data Raw data Consisten t data Matteo Manca matteo.manca@eurecat.org
  • 10.
    Data Pre-processing andCleaning in R 10 Rstudio is a user interface for R. https://www.rstudio.com Matteo Manca matteo.manca@eurecat.org R is a free software environment for statistical computing and graphics (https://www.r-project.org)
  • 11.
    Questions ? Matteo Mancamatteo.manca@eurecat.org
  • 12.
  • 13.
    Data Pre-processing andCleaning Š 2015, Barcelona Technology School ed X.X DD/MM/2015 www.barcelonatechnologyschoo.com
  • 14.
    References 14 Matteo Manca matteo.manca@eurecat.org 1.https://cran.r- project.org/doc/contrib/de_Jonge+van_der_Loo- Introduction_to_data_cleaning_with_R.pdf 2. https://www.coursera.org/learn/data-cleaning 3. https://www.coursera.org/learn/r-programming 4. http://www.r-bloggers.com

Editor's Notes

  • #10 Markdown is a lightweight markup language with plain text formatting syntax designed so that it can be converted to HTML and many other formats using a tool by the same name.