SlideShare a Scribd company logo
1 of 45
Download to read offline
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
A systematic approach to data cleaning with R
Mark van der Loo
markvanderloo.eu | @markvdloo
Budapest | September 3 2016
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Demos and other materials
https://github.com/markvanderloo/satRday
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Contents
The statistical value chain
From raw data to technically correct data
Strings and encoding
Regexp and approximate matching
Type coercion
From technically correct data to consistent data
Data validation
Error localization
Correction, imputation, adjustment
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
The statistical value chain
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Statistical value chain
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Concepts
Technically correct data
Well-defined format (data structure)
Well-defined types (numbers, date/time,string, categorical... )
Statistical units can be identified (persons, transactions, phone
calls...)
Variables can be identified as properties of statistical units.
Note: tidy data ⊂ technically correct data
Consistent data
Data satisfies demands from domain knowledge
(more on this when we talk about validation)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
From raw to technically correct data
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Dirty tabular data
Demo
Coercing while reading: /table
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Tabular data: long story short
read.table: R’s swiss army knife
fairly strict (no sniffing)
Very flexible
Interface could be cleaner (see this talk)
readr::read_csv
Easy to switch between strict/lenient parsing
Compact control over column types
Fast
Clear reports of parsing failure
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Really dirty data
Demo
Output file parsing: /parsing
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
A few lessons from the demo
(base) R has great text processing tools.
Need to work with regular expressions1
Write many small functions extracting single data elements.
Don’t overgeneralize: adapt functions as you meet new input.
Smart use of existing tools (read.table(text=))
1
Mastering Regular Expressions (2006) by Jeffrey Friedl is a great resource
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Packages for standard format parsing
jsonlite: parse JSON files
yaml: parse yaml files
xml2: parse XML files
rvest: scrape and parse HTML files
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Some tips on regular expressions with R
stringr has many useful shorthands for common tasks.
Generate regular expressions with rex
library(rex)
# recognize a number in scientific notation
rex(one_or_more(digit)
, maybe(".",one_or_more(digit))
, "E" %or% "e"
, one_or_more(digit))
## (?:[[:digit:]])+(?:.(?:[[:digit:]])+)?(?:E|e)(?:[[:digi
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Regular expressions
Express a pattern of text, e.g.
"(a|b)c*" = {"a", "ac", "acc", . . . , "b", "bc", "bcc", . . .}
Task stringr function:
string detection str_detect(string, pattern)
string extraction str_extract(string, pattern)
string replacement str_extract(string, pattern, replacement)
string splitting str_split(string, pattern)
Base R: grep grepl | regexpr regmatches | sub gsub | strsplit
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
String normalization
Bring a text string in a standard format, e.g.
Standardize upper/lower case (casefolding)
stringr: str_to_lower, str_to_upper, str_to_title
base R: tolower, toupper
Remove accents (transliteration)
stringi: stri_trans_general
base R: iconv
Re-encoding
stringi: stri_encode
base R: iconv
Uniformize encoding (unicode normalization)
stringi: stri_trans_nfkc (and more)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding in R
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding in R
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding in R
Demo
Normalization, re-encoding, transliteration: /strings
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
A few tips
Detect encoding stringi::stri_enc_detect
Conversion options iconvlist() stringi::stri_enc_list()
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Approximate text matching
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Approximate text matching
Demo
Approximate matching and normalization: /matching
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Approximate text matching: edit-based distances
Allowed operation
Distance substitution deletion insertion transposition
Hamming    
LCS    
Levenshtein    
OSA    ∗
Damerau-
Levenshtein
   
∗Substrings may be edited only once.
leela → leea → leia
stringdist::stringdist(leela,leia,method=dl)
## [1] 2
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Some pointers for approximate matching
Normalisation and approximate matching are complementary
See my useR2014 talk or paper on stringdist for more distances
The fuzzyjoin package allows fuzzy joining of datasets
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Other good stuff
lubridate: extract dates from strings
lubridate::dmy(17 December 2015)
## [1] 2015-12-17
tidyr: many data cleaning operations to make your life easier
readr: Parse numbers from text strings
readr::parse_number(c(2%,6%,0.3%))
## [1] 2.0 6.0 0.3
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
From technically correct to consistent data
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
The mantra of data cleaning
Detection (data conflicts with domain knowledge)
Selection (find the value(s) that cause the violation)
Correction (replace them with better values)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Detection, AKA data validation
Informally:
Data Validation is checking data against (multivariate) expectations
about a data set.
Validation rules
Often these expectations can be expressed as a set of simple
validation rules.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Data validation
Demo
The validate package /validate
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
The validate package, in summary
Make data validation rules explicit
Treat them as objects of computation
store to / read from file
manipulate
annotate
Confront data with rules
Analyze/visualize the results
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Tracking changes when altering data
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Tracking changes in rule violations
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Use rules to correct data
Main idea
Rules restrict the data. Sometimes this is enough to derive a correct
value uniquely.
Examples
Correct typos in values under linear restrictions
123 + 45 = 177, but 123 + 54 = 177.
Derive imputations from values under linear restrictions
123 + NA = 177, compute 177 − 123 = 54.
Both can be generalized to systems Ax ≤ b.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Deductive correction and imputation
Demo
The deductive package: /deductive.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Selection, or: error localization
Fellegi and Holt (1976)
Find the least (weighted) number of fields that can be imputed such
that all rules can be satisfied.
Note
Solutions need not be unique.
Random one chosen in case of degeneracy.
Lowest weight need not guarantee smallest number of altered
variables.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Error localization
Demo
The errorlocate package: /errorlocate
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Notes on errorlocate
For in-record rules
Support for
linear (in)equality rules
Conditionals on categorical variables (if male then not pregnant)
Mixed conditionals (has job then age = 15)
Conditionals w/linear predicates (staff  0 then staff cost  0)
Optimization is mapped to MIP problem.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Missing values
Mechanisms (Rubin):
MCAR: missing completely at random
MAR: P(Y = NA) depends on value of X
MNAR: P(Y = NA) depends on value of Y
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Imputation
Purpose of imputation vs prediction
Prediction: estimate a single value (often for a single use)
Imputation: estimate values such that the completed data set
allows for valid inferencea
a
This is very difficult!
Imputation methods
Deductive imputation
Imputation based on predictive models
Donor imputation (knn, pmm, sequential/random hot deck )
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Predictive model-based imputation
ˆy = ˆf (x) +
e.g.Linear regression
ˆy = α + xT ˆβ +
Residual:
= 0 Impute expected value
drawn from observed residuals e
∼ N(0, σ) parametric residual, ˆσ2
= var(e)
Multiple imputation (Bayesian bootstrap)
Draw β from parametric distribution, impute multiple times.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Donor imputation (hot deck)
Method variants:
Random hot deck: copy value from random record.
Sequential hot deck: copy value from previous record.
k-nearest neighbours: draw donor from k neares neigbours
Predictive mean matching: copy value closest to prediction
Donor pool variants:
per variable
per missing data pattern
per record
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Note on multivariate donor imputation
Many multivariate methods seem relatively ad hoc, and more
theoretical and empirical comparisons with alternative approaches
would be of interest.
Andridge and Little (2010) A Review of Hot Deck Imputation for Survey
Non-response. Int. Stat. Rev. 78(1) 40–64
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Demo time
Demo
Imputation /imputation
VIM: visualisation, GUI, extensive methodology
simputation: simple, scriptable interface to common methods
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Methods supported by simputation
Model based (optionally add [non-]parametric random residual)
linear regression
robust linear regression
CART models
Random forest
Donor imputation (including various donor pool specifications)
k-nearest neigbour (based on gower’s distance)
sequential hotdeck (LOCF, NOCB)
random hotdeck
Predictive mean matching
Other
(groupwise) median imputation (optional random residual)
Proxy imputation (copy from other variable)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Credits
deductive Mark van der Loo, Edwin de Jonge
errorlocate Edwin de Jonge, Mark van der Loo
gower Mark van der Loo
jsonlite Jeroen Ooms, Duncan Temple Lang, Lloyd Hilaiel
magrittr Stefan Milton Bache, Hadley Wickham
rex Kevin Ushey Jim Hester, Robert Krzyzanowski
simputation Mark van der Loo
stringdist Mark van der Loo, Jan van der Laan, R Core, Nick Logan
stringi Marek Gagolewski, Bartek Tartanus
stringr Hadley Wickham, RStudio
tidyr Hadley Wickham, RStudio
validate Mark van der Loo, Edwin de Jonge
VIM Matthias Templ, Andreas Alfons, Alexander Kowarik, Bernd Prantner
xml2 Hadley Wickham, Jim Hester, Jeroen Ooms, RStudio, R foundation
Mark van der Loo A systematic approach to data cleaning with R

More Related Content

Viewers also liked

Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)Neha Ahuja
 
มารยาทในการลีลาศ
มารยาทในการลีลาศมารยาทในการลีลาศ
มารยาทในการลีลาศTepasoon Songnaa
 
การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)Kittiya Youngjarean
 
Luyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin họcLuyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin họcltgiang87
 
การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)Kittiya Youngjarean
 
Meaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. SinielMeaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. Sinielairazygy
 
Bridging the gap between theory and practise
Bridging the gap between theory and practiseBridging the gap between theory and practise
Bridging the gap between theory and practiseAnshul Punetha
 
การรวมกิจการ
การรวมกิจการการรวมกิจการ
การรวมกิจการKittiya Youngjarean
 
ประวัติวอลเลย์บอล
ประวัติวอลเลย์บอลประวัติวอลเลย์บอล
ประวัติวอลเลย์บอลTepasoon Songnaa
 
Policy Engagement through Digital Participation (Serious Games)
Policy Engagement through Digital Participation (Serious Games)Policy Engagement through Digital Participation (Serious Games)
Policy Engagement through Digital Participation (Serious Games)Abertay University
 
Graphene light bulb set for shops
Graphene light bulb set for shopsGraphene light bulb set for shops
Graphene light bulb set for shopsskinnyengineer898
 

Viewers also liked (16)

Storyworld Jam '14
Storyworld Jam '14Storyworld Jam '14
Storyworld Jam '14
 
Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)
 
มารยาทในการลีลาศ
มารยาทในการลีลาศมารยาทในการลีลาศ
มารยาทในการลีลาศ
 
Tugas 1
Tugas 1Tugas 1
Tugas 1
 
การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)
 
Luyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin họcLuyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin học
 
การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)
 
Meaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. SinielMeaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. Siniel
 
˹· 1 ǡѻ
˹· 1  ǡѻ˹· 1  ǡѻ
˹· 1 ǡѻ
 
Bridging the gap between theory and practise
Bridging the gap between theory and practiseBridging the gap between theory and practise
Bridging the gap between theory and practise
 
การรวมกิจการ
การรวมกิจการการรวมกิจการ
การรวมกิจการ
 
ประวัติวอลเลย์บอล
ประวัติวอลเลย์บอลประวัติวอลเลย์บอล
ประวัติวอลเลย์บอล
 
IoT Protocols
IoT ProtocolsIoT Protocols
IoT Protocols
 
TV3.0 New TV frontiers.
TV3.0 New TV frontiers.TV3.0 New TV frontiers.
TV3.0 New TV frontiers.
 
Policy Engagement through Digital Participation (Serious Games)
Policy Engagement through Digital Participation (Serious Games)Policy Engagement through Digital Participation (Serious Games)
Policy Engagement through Digital Participation (Serious Games)
 
Graphene light bulb set for shops
Graphene light bulb set for shopsGraphene light bulb set for shops
Graphene light bulb set for shops
 

Similar to Sat rday

Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsData Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsHPCC Systems
 
HospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformHospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformAngela Razzell
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdfRohanBorgalli
 
Lec 01 - Microcomputer Architecture and Logic Design
Lec 01 - Microcomputer Architecture and Logic DesignLec 01 - Microcomputer Architecture and Logic Design
Lec 01 - Microcomputer Architecture and Logic DesignVajira Thambawita
 
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...AI Frontiers
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbaiUnmesh Baile
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners Jen Stirrup
 
Understanding R for Epidemiologists
Understanding R for EpidemiologistsUnderstanding R for Epidemiologists
Understanding R for EpidemiologistsTomas J. Aragon
 
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and ScalaData Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and ScalaSubhasish Guha
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentationnibraspk
 
A view of graph data usage by Cerved
A view of graph data usage by CervedA view of graph data usage by Cerved
A view of graph data usage by CervedData Science Milan
 
computer architecture
computer architecture computer architecture
computer architecture Dr.Umadevi V
 
Validatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesValidatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesEdwin de Jonge
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsJen Stirrup
 
Software Measurement: Lecture 1. Measures and Metrics
Software Measurement: Lecture 1. Measures and MetricsSoftware Measurement: Lecture 1. Measures and Metrics
Software Measurement: Lecture 1. Measures and MetricsProgrameter
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxmyworld93
 

Similar to Sat rday (20)

Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsData Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
 
HospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformHospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics Platform
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
Lec 01 - Microcomputer Architecture and Logic Design
Lec 01 - Microcomputer Architecture and Logic DesignLec 01 - Microcomputer Architecture and Logic Design
Lec 01 - Microcomputer Architecture and Logic Design
 
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbai
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Understanding R for Epidemiologists
Understanding R for EpidemiologistsUnderstanding R for Epidemiologists
Understanding R for Epidemiologists
 
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and ScalaData Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
A view of graph data usage by Cerved
A view of graph data usage by CervedA view of graph data usage by Cerved
A view of graph data usage by Cerved
 
computer architecture
computer architecture computer architecture
computer architecture
 
Validatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesValidatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rules
 
Cerved Datascience Milan
Cerved Datascience MilanCerved Datascience Milan
Cerved Datascience Milan
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
 
Software Measurement: Lecture 1. Measures and Metrics
Software Measurement: Lecture 1. Measures and MetricsSoftware Measurement: Lecture 1. Measures and Metrics
Software Measurement: Lecture 1. Measures and Metrics
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptx
 
R programming by ganesh kavhar
R programming by ganesh kavharR programming by ganesh kavhar
R programming by ganesh kavhar
 

Recently uploaded

Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证nhjeo1gg
 

Recently uploaded (20)

Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
 

Sat rday

  • 1. The statistical value chain From raw to technically correct data From technically correct to consistent data A systematic approach to data cleaning with R Mark van der Loo markvanderloo.eu | @markvdloo Budapest | September 3 2016 Mark van der Loo A systematic approach to data cleaning with R
  • 2. The statistical value chain From raw to technically correct data From technically correct to consistent data Demos and other materials https://github.com/markvanderloo/satRday Mark van der Loo A systematic approach to data cleaning with R
  • 3. The statistical value chain From raw to technically correct data From technically correct to consistent data Contents The statistical value chain From raw data to technically correct data Strings and encoding Regexp and approximate matching Type coercion From technically correct data to consistent data Data validation Error localization Correction, imputation, adjustment Mark van der Loo A systematic approach to data cleaning with R
  • 4. The statistical value chain From raw to technically correct data From technically correct to consistent data The statistical value chain Mark van der Loo A systematic approach to data cleaning with R
  • 5. The statistical value chain From raw to technically correct data From technically correct to consistent data Statistical value chain Mark van der Loo A systematic approach to data cleaning with R
  • 6. The statistical value chain From raw to technically correct data From technically correct to consistent data Concepts Technically correct data Well-defined format (data structure) Well-defined types (numbers, date/time,string, categorical... ) Statistical units can be identified (persons, transactions, phone calls...) Variables can be identified as properties of statistical units. Note: tidy data ⊂ technically correct data Consistent data Data satisfies demands from domain knowledge (more on this when we talk about validation) Mark van der Loo A systematic approach to data cleaning with R
  • 7. The statistical value chain From raw to technically correct data From technically correct to consistent data From raw to technically correct data Mark van der Loo A systematic approach to data cleaning with R
  • 8. The statistical value chain From raw to technically correct data From technically correct to consistent data Dirty tabular data Demo Coercing while reading: /table Mark van der Loo A systematic approach to data cleaning with R
  • 9. The statistical value chain From raw to technically correct data From technically correct to consistent data Tabular data: long story short read.table: R’s swiss army knife fairly strict (no sniffing) Very flexible Interface could be cleaner (see this talk) readr::read_csv Easy to switch between strict/lenient parsing Compact control over column types Fast Clear reports of parsing failure Mark van der Loo A systematic approach to data cleaning with R
  • 10. The statistical value chain From raw to technically correct data From technically correct to consistent data Really dirty data Demo Output file parsing: /parsing Mark van der Loo A systematic approach to data cleaning with R
  • 11. The statistical value chain From raw to technically correct data From technically correct to consistent data A few lessons from the demo (base) R has great text processing tools. Need to work with regular expressions1 Write many small functions extracting single data elements. Don’t overgeneralize: adapt functions as you meet new input. Smart use of existing tools (read.table(text=)) 1 Mastering Regular Expressions (2006) by Jeffrey Friedl is a great resource Mark van der Loo A systematic approach to data cleaning with R
  • 12. The statistical value chain From raw to technically correct data From technically correct to consistent data Packages for standard format parsing jsonlite: parse JSON files yaml: parse yaml files xml2: parse XML files rvest: scrape and parse HTML files Mark van der Loo A systematic approach to data cleaning with R
  • 13. The statistical value chain From raw to technically correct data From technically correct to consistent data Some tips on regular expressions with R stringr has many useful shorthands for common tasks. Generate regular expressions with rex library(rex) # recognize a number in scientific notation rex(one_or_more(digit) , maybe(".",one_or_more(digit)) , "E" %or% "e" , one_or_more(digit)) ## (?:[[:digit:]])+(?:.(?:[[:digit:]])+)?(?:E|e)(?:[[:digi Mark van der Loo A systematic approach to data cleaning with R
  • 14. The statistical value chain From raw to technically correct data From technically correct to consistent data Regular expressions Express a pattern of text, e.g. "(a|b)c*" = {"a", "ac", "acc", . . . , "b", "bc", "bcc", . . .} Task stringr function: string detection str_detect(string, pattern) string extraction str_extract(string, pattern) string replacement str_extract(string, pattern, replacement) string splitting str_split(string, pattern) Base R: grep grepl | regexpr regmatches | sub gsub | strsplit Mark van der Loo A systematic approach to data cleaning with R
  • 15. The statistical value chain From raw to technically correct data From technically correct to consistent data String normalization Bring a text string in a standard format, e.g. Standardize upper/lower case (casefolding) stringr: str_to_lower, str_to_upper, str_to_title base R: tolower, toupper Remove accents (transliteration) stringi: stri_trans_general base R: iconv Re-encoding stringi: stri_encode base R: iconv Uniformize encoding (unicode normalization) stringi: stri_trans_nfkc (and more) Mark van der Loo A systematic approach to data cleaning with R
  • 16. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding Mark van der Loo A systematic approach to data cleaning with R
  • 17. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Mark van der Loo A systematic approach to data cleaning with R
  • 18. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Mark van der Loo A systematic approach to data cleaning with R
  • 19. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Demo Normalization, re-encoding, transliteration: /strings Mark van der Loo A systematic approach to data cleaning with R
  • 20. The statistical value chain From raw to technically correct data From technically correct to consistent data A few tips Detect encoding stringi::stri_enc_detect Conversion options iconvlist() stringi::stri_enc_list() Mark van der Loo A systematic approach to data cleaning with R
  • 21. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching Mark van der Loo A systematic approach to data cleaning with R
  • 22. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching Demo Approximate matching and normalization: /matching Mark van der Loo A systematic approach to data cleaning with R
  • 23. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching: edit-based distances Allowed operation Distance substitution deletion insertion transposition Hamming LCS Levenshtein OSA ∗ Damerau- Levenshtein ∗Substrings may be edited only once. leela → leea → leia stringdist::stringdist(leela,leia,method=dl) ## [1] 2 Mark van der Loo A systematic approach to data cleaning with R
  • 24. The statistical value chain From raw to technically correct data From technically correct to consistent data Some pointers for approximate matching Normalisation and approximate matching are complementary See my useR2014 talk or paper on stringdist for more distances The fuzzyjoin package allows fuzzy joining of datasets Mark van der Loo A systematic approach to data cleaning with R
  • 25. The statistical value chain From raw to technically correct data From technically correct to consistent data Other good stuff lubridate: extract dates from strings lubridate::dmy(17 December 2015) ## [1] 2015-12-17 tidyr: many data cleaning operations to make your life easier readr: Parse numbers from text strings readr::parse_number(c(2%,6%,0.3%)) ## [1] 2.0 6.0 0.3 Mark van der Loo A systematic approach to data cleaning with R
  • 26. The statistical value chain From raw to technically correct data From technically correct to consistent data From technically correct to consistent data Mark van der Loo A systematic approach to data cleaning with R
  • 27. The statistical value chain From raw to technically correct data From technically correct to consistent data The mantra of data cleaning Detection (data conflicts with domain knowledge) Selection (find the value(s) that cause the violation) Correction (replace them with better values) Mark van der Loo A systematic approach to data cleaning with R
  • 28. The statistical value chain From raw to technically correct data From technically correct to consistent data Detection, AKA data validation Informally: Data Validation is checking data against (multivariate) expectations about a data set. Validation rules Often these expectations can be expressed as a set of simple validation rules. Mark van der Loo A systematic approach to data cleaning with R
  • 29. The statistical value chain From raw to technically correct data From technically correct to consistent data Data validation Demo The validate package /validate Mark van der Loo A systematic approach to data cleaning with R
  • 30. The statistical value chain From raw to technically correct data From technically correct to consistent data The validate package, in summary Make data validation rules explicit Treat them as objects of computation store to / read from file manipulate annotate Confront data with rules Analyze/visualize the results Mark van der Loo A systematic approach to data cleaning with R
  • 31. The statistical value chain From raw to technically correct data From technically correct to consistent data Tracking changes when altering data Mark van der Loo A systematic approach to data cleaning with R
  • 32. The statistical value chain From raw to technically correct data From technically correct to consistent data Tracking changes in rule violations Mark van der Loo A systematic approach to data cleaning with R
  • 33. The statistical value chain From raw to technically correct data From technically correct to consistent data Use rules to correct data Main idea Rules restrict the data. Sometimes this is enough to derive a correct value uniquely. Examples Correct typos in values under linear restrictions 123 + 45 = 177, but 123 + 54 = 177. Derive imputations from values under linear restrictions 123 + NA = 177, compute 177 − 123 = 54. Both can be generalized to systems Ax ≤ b. Mark van der Loo A systematic approach to data cleaning with R
  • 34. The statistical value chain From raw to technically correct data From technically correct to consistent data Deductive correction and imputation Demo The deductive package: /deductive. Mark van der Loo A systematic approach to data cleaning with R
  • 35. The statistical value chain From raw to technically correct data From technically correct to consistent data Selection, or: error localization Fellegi and Holt (1976) Find the least (weighted) number of fields that can be imputed such that all rules can be satisfied. Note Solutions need not be unique. Random one chosen in case of degeneracy. Lowest weight need not guarantee smallest number of altered variables. Mark van der Loo A systematic approach to data cleaning with R
  • 36. The statistical value chain From raw to technically correct data From technically correct to consistent data Error localization Demo The errorlocate package: /errorlocate Mark van der Loo A systematic approach to data cleaning with R
  • 37. The statistical value chain From raw to technically correct data From technically correct to consistent data Notes on errorlocate For in-record rules Support for linear (in)equality rules Conditionals on categorical variables (if male then not pregnant) Mixed conditionals (has job then age = 15) Conditionals w/linear predicates (staff 0 then staff cost 0) Optimization is mapped to MIP problem. Mark van der Loo A systematic approach to data cleaning with R
  • 38. The statistical value chain From raw to technically correct data From technically correct to consistent data Missing values Mechanisms (Rubin): MCAR: missing completely at random MAR: P(Y = NA) depends on value of X MNAR: P(Y = NA) depends on value of Y Mark van der Loo A systematic approach to data cleaning with R
  • 39. The statistical value chain From raw to technically correct data From technically correct to consistent data Imputation Purpose of imputation vs prediction Prediction: estimate a single value (often for a single use) Imputation: estimate values such that the completed data set allows for valid inferencea a This is very difficult! Imputation methods Deductive imputation Imputation based on predictive models Donor imputation (knn, pmm, sequential/random hot deck ) Mark van der Loo A systematic approach to data cleaning with R
  • 40. The statistical value chain From raw to technically correct data From technically correct to consistent data Predictive model-based imputation ˆy = ˆf (x) + e.g.Linear regression ˆy = α + xT ˆβ + Residual: = 0 Impute expected value drawn from observed residuals e ∼ N(0, σ) parametric residual, ˆσ2 = var(e) Multiple imputation (Bayesian bootstrap) Draw β from parametric distribution, impute multiple times. Mark van der Loo A systematic approach to data cleaning with R
  • 41. The statistical value chain From raw to technically correct data From technically correct to consistent data Donor imputation (hot deck) Method variants: Random hot deck: copy value from random record. Sequential hot deck: copy value from previous record. k-nearest neighbours: draw donor from k neares neigbours Predictive mean matching: copy value closest to prediction Donor pool variants: per variable per missing data pattern per record Mark van der Loo A systematic approach to data cleaning with R
  • 42. The statistical value chain From raw to technically correct data From technically correct to consistent data Note on multivariate donor imputation Many multivariate methods seem relatively ad hoc, and more theoretical and empirical comparisons with alternative approaches would be of interest. Andridge and Little (2010) A Review of Hot Deck Imputation for Survey Non-response. Int. Stat. Rev. 78(1) 40–64 Mark van der Loo A systematic approach to data cleaning with R
  • 43. The statistical value chain From raw to technically correct data From technically correct to consistent data Demo time Demo Imputation /imputation VIM: visualisation, GUI, extensive methodology simputation: simple, scriptable interface to common methods Mark van der Loo A systematic approach to data cleaning with R
  • 44. The statistical value chain From raw to technically correct data From technically correct to consistent data Methods supported by simputation Model based (optionally add [non-]parametric random residual) linear regression robust linear regression CART models Random forest Donor imputation (including various donor pool specifications) k-nearest neigbour (based on gower’s distance) sequential hotdeck (LOCF, NOCB) random hotdeck Predictive mean matching Other (groupwise) median imputation (optional random residual) Proxy imputation (copy from other variable) Mark van der Loo A systematic approach to data cleaning with R
  • 45. The statistical value chain From raw to technically correct data From technically correct to consistent data Credits deductive Mark van der Loo, Edwin de Jonge errorlocate Edwin de Jonge, Mark van der Loo gower Mark van der Loo jsonlite Jeroen Ooms, Duncan Temple Lang, Lloyd Hilaiel magrittr Stefan Milton Bache, Hadley Wickham rex Kevin Ushey Jim Hester, Robert Krzyzanowski simputation Mark van der Loo stringdist Mark van der Loo, Jan van der Laan, R Core, Nick Logan stringi Marek Gagolewski, Bartek Tartanus stringr Hadley Wickham, RStudio tidyr Hadley Wickham, RStudio validate Mark van der Loo, Edwin de Jonge VIM Matthias Templ, Andreas Alfons, Alexander Kowarik, Bernd Prantner xml2 Hadley Wickham, Jim Hester, Jeroen Ooms, RStudio, R foundation Mark van der Loo A systematic approach to data cleaning with R