Data error! But where?

•

0 likes•352 views

This document discusses using validation rules to check data and locate errors. It introduces the errorlocate package, which can formulate validation rules using validate and then locate errors in data that violate those rules. Errorlocate uses an algorithm based on Feligi Holt formalism to identify the minimal number of variable changes needed to satisfy all validation rules. It provides functions to both locate errors and replace errors in data. The package aims to automate the error localization process for large datasets with complex validation rule sets.

Data & Analytics

Data error!, but where?
Errorlocate: ﬁnd and replace erroneous ﬁelds in data using
validation rules
Edwin de Jonge
Statistics Netherlands / UseR! 2017

Data cleaning. . .
A large part of your job is spent in data-cleaning:
getting your data in the right shape (e.g. tidyverse)
assessing missing data (e.g. VIM)
checking validity (e.g. validate)
locating and removing errors: errorlocate!
impute values for missing or erroneous data (e.g.
simputation)

Validation rules?
Package validate allows to:
formulate explicit data rule that data must conform to:
library(validate)
check_that( data.frame(age=160, driver_license=TRUE),
age >= 0,
age < 150,
if (driver_license == TRUE) age >= 16
)

Explicit validation rules:
Give a clear overview what the data must conform to.
Can be used to reason about.
Can be used to ﬁx/correct data!
Find error, and when found correct it.
Note:
Manual ﬁx is error prone, not reproducible and not feasible for
large data sets.
Large rule set have (very) complex behavior, e.g. entangled
rules: adjusting one value may invalidate other rules.

Error localization
Error localization is a procedure that points out ﬁelds in a
data set that can be altered or imputed in such a way that
all validation rules can be satisﬁed.

Find the error:
library(validate)
check_that( data.frame(age=160, driver_license=TRUE),
age >= 0,
age < 150,
if (driver_license == TRUE) age >= 16
)
It is clear that age has an erroneous value, but for more complex
rule sets it is less clear.

Multivariate example:
check_that( data.frame( age = 3
, married = TRUE
, attends = "kindergarten"
)
, if (married == TRUE) age >= 16
, if (attends == "kindergarten") age <= 6
)
Ok, clear that this is a faulty record, but what is the error?

Feligi Holt formalism:
Find the minimal (weighted) number of variables that
cause the invalidation of the data rules.
Makes sense! (But there are exceptions. . . )
Implemented in errorlocate (second generation of editrules).

errorlocate::locate_errors
locate_errors( data.frame( age = 3
, married = TRUE
, attends = "kindergarten"
)
, validator( if (married == TRUE) age >= 16
, if (attends == "kindergarten") age <= 6
)
)$errors
## age married attends
## [1,] FALSE TRUE FALSE

errorlocate::replace_errors
replace_errors(
data.frame( age = 3
, married = TRUE
, attends = "kindergarten"
)
, validator( if (married == TRUE) age >= 16
, if (attends == "kindergarten") age <= 6
)
)
## age married attends
## 1 3 NA kindergarten

Internal workings:
errorlocate:
translates error localization problem into a mixed integer
problem, which is solved with lpsolveAPI.
contains a small framework for implementing your own error
localization algorithms.

Pipe friendly
The replace_errors function is pipe friendly:
rules <- validator(age < 150)
data_noerrors <-
data.frame(age=160, driver_license = TRUE) %>%
replace_errors(rules)
errors_removed(data_noerrors) # contains errors removed

Thank you!
Interested?
install.packages("errorlocate")
Or visit:
http://github.com/data-cleaning/errorlocate

Similar to Data error! But where?

Lecture 19Shani729

Bayesian Inference (UC Berkeley School of Information; July 25, 2019)Ivan Corneillet

Upstate CSCI 525 Data Mining Chapter 3DanWooster1

Data preprocessingsuganmca14

03Preprocessing_plp.pptxProfPPavanKumar

03Preprocessing.pptProfPPavanKumar

03Preprocessing_plp.pptxProfPPavanKumar

03Preprocessing for student computer sciecne.pptMuhammadHanifSyabani

03Preprocessing.pptAnkitaAnki16

Preprocessing.pptNBACriteria2SICET

03Predddddddddddddddddddddddprocessling.ppta99150433

Data Preprocessing and Visualizsdjvnovrnververdfvdfationwokati2689

03 preprocessingJoonyoungJayGwak

Why are data transformations a bad choice in statisticsAdrian Olszewski

Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...Daniel Katz

Data preprocessingTony Nguyen

Data preprocessingJames Wong

Data preprocessingHarry Potter

Data preprocessingFraboni Ec

Data preprocessingHoang Nguyen

Similar to Data error! But where? (20)

Lecture 19

Bayesian Inference (UC Berkeley School of Information; July 25, 2019)

Upstate CSCI 525 Data Mining Chapter 3

Data preprocessing

03Preprocessing_plp.pptx

03Preprocessing.ppt

03Preprocessing_plp.pptx

03Preprocessing for student computer sciecne.ppt

03Preprocessing.ppt

Preprocessing.ppt

03Predddddddddddddddddddddddprocessling.ppt

Data Preprocessing and Visualizsdjvnovrnververdfvdfation

03 preprocessing

Why are data transformations a bad choice in statistics

Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...

Data preprocessing

Recently uploaded

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

RadioAdProWritingCinderellabyButleri.pdfgstagge

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Data Science Jobs and Salaries Analysis.pptxFurkanTasci3

Ukraine War presentation: KNOW THE BASICSAishani27

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

B2 Creative Industry Response Evaluation.docxStephen266013

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Recently uploaded (20)

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

100-Concepts-of-AI by Anupama Kate .pptx

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

RadioAdProWritingCinderellabyButleri.pdf

04242024_CCC TUG_Joins and Relationships

Data Science Jobs and Salaries Analysis.pptx

Ukraine War presentation: KNOW THE BASICS

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

Unveiling Insights: The Role of a Data Analyst

Brighton SEO | April 2024 | Data Storytelling

E-Commerce Order PredictionShraddha Kamble.pptx

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

B2 Creative Industry Response Evaluation.docx

Customer Service Analytics - Make Sense of All Your Data.pptx

Data error! But where?

1. Data error!, but where? Errorlocate: ﬁnd and replace erroneous ﬁelds in data using validation rules Edwin de Jonge Statistics Netherlands / UseR! 2017

3. Data cleaning. . . A large part of your job is spent in data-cleaning: getting your data in the right shape (e.g. tidyverse) assessing missing data (e.g. VIM) checking validity (e.g. validate) locating and removing errors: errorlocate! impute values for missing or erroneous data (e.g. simputation)

5. Validation rules? Package validate allows to: formulate explicit data rule that data must conform to: library(validate) check_that( data.frame(age=160, driver_license=TRUE), age >= 0, age < 150, if (driver_license == TRUE) age >= 16 )

6. Explicit validation rules: Give a clear overview what the data must conform to. Can be used to reason about. Can be used to ﬁx/correct data! Find error, and when found correct it. Note: Manual ﬁx is error prone, not reproducible and not feasible for large data sets. Large rule set have (very) complex behavior, e.g. entangled rules: adjusting one value may invalidate other rules.

7. Error localization Error localization is a procedure that points out ﬁelds in a data set that can be altered or imputed in such a way that all validation rules can be satisﬁed.

8. Find the error: library(validate) check_that( data.frame(age=160, driver_license=TRUE), age >= 0, age < 150, if (driver_license == TRUE) age >= 16 ) It is clear that age has an erroneous value, but for more complex rule sets it is less clear.

9. Multivariate example: check_that( data.frame( age = 3 , married = TRUE , attends = "kindergarten" ) , if (married == TRUE) age >= 16 , if (attends == "kindergarten") age <= 6 ) Ok, clear that this is a faulty record, but what is the error?

10. Feligi Holt formalism: Find the minimal (weighted) number of variables that cause the invalidation of the data rules. Makes sense! (But there are exceptions. . . ) Implemented in errorlocate (second generation of editrules).

11. errorlocate::locate_errors locate_errors( data.frame( age = 3 , married = TRUE , attends = "kindergarten" ) , validator( if (married == TRUE) age >= 16 , if (attends == "kindergarten") age <= 6 ) )$errors ## age married attends ## [1,] FALSE TRUE FALSE

12. errorlocate::replace_errors replace_errors( data.frame( age = 3 , married = TRUE , attends = "kindergarten" ) , validator( if (married == TRUE) age >= 16 , if (attends == "kindergarten") age <= 6 ) ) ## age married attends ## 1 3 NA kindergarten

13. Internal workings: errorlocate: translates error localization problem into a mixed integer problem, which is solved with lpsolveAPI. contains a small framework for implementing your own error localization algorithms.

14. Pipe friendly The replace_errors function is pipe friendly: rules <- validator(age < 150) data_noerrors <- data.frame(age=160, driver_license = TRUE) %>% replace_errors(rules) errors_removed(data_noerrors) # contains errors removed

15. Thank you! Interested? install.packages("errorlocate") Or visit: http://github.com/data-cleaning/errorlocate

Data error! But where?

Recommended

Recommended

More Related Content

Similar to Data error! But where?

Similar to Data error! But where? (20)

More from Edwin de Jonge

More from Edwin de Jonge (15)

Recently uploaded

Recently uploaded (20)

Data error! But where?