Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sat rday

485 views

Published on

Presentation for the tutorial on data cleaning with R given at the satRday conference, Budapest September 3 2016

Published in: Data & Analytics
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Sat rday

  1. 1. The statistical value chain From raw to technically correct data From technically correct to consistent data A systematic approach to data cleaning with R Mark van der Loo markvanderloo.eu | @markvdloo Budapest | September 3 2016 Mark van der Loo A systematic approach to data cleaning with R
  2. 2. The statistical value chain From raw to technically correct data From technically correct to consistent data Demos and other materials https://github.com/markvanderloo/satRday Mark van der Loo A systematic approach to data cleaning with R
  3. 3. The statistical value chain From raw to technically correct data From technically correct to consistent data Contents The statistical value chain From raw data to technically correct data Strings and encoding Regexp and approximate matching Type coercion From technically correct data to consistent data Data validation Error localization Correction, imputation, adjustment Mark van der Loo A systematic approach to data cleaning with R
  4. 4. The statistical value chain From raw to technically correct data From technically correct to consistent data The statistical value chain Mark van der Loo A systematic approach to data cleaning with R
  5. 5. The statistical value chain From raw to technically correct data From technically correct to consistent data Statistical value chain Mark van der Loo A systematic approach to data cleaning with R
  6. 6. The statistical value chain From raw to technically correct data From technically correct to consistent data Concepts Technically correct data Well-defined format (data structure) Well-defined types (numbers, date/time,string, categorical... ) Statistical units can be identified (persons, transactions, phone calls...) Variables can be identified as properties of statistical units. Note: tidy data ⊂ technically correct data Consistent data Data satisfies demands from domain knowledge (more on this when we talk about validation) Mark van der Loo A systematic approach to data cleaning with R
  7. 7. The statistical value chain From raw to technically correct data From technically correct to consistent data From raw to technically correct data Mark van der Loo A systematic approach to data cleaning with R
  8. 8. The statistical value chain From raw to technically correct data From technically correct to consistent data Dirty tabular data Demo Coercing while reading: /table Mark van der Loo A systematic approach to data cleaning with R
  9. 9. The statistical value chain From raw to technically correct data From technically correct to consistent data Tabular data: long story short read.table: R’s swiss army knife fairly strict (no sniffing) Very flexible Interface could be cleaner (see this talk) readr::read_csv Easy to switch between strict/lenient parsing Compact control over column types Fast Clear reports of parsing failure Mark van der Loo A systematic approach to data cleaning with R
  10. 10. The statistical value chain From raw to technically correct data From technically correct to consistent data Really dirty data Demo Output file parsing: /parsing Mark van der Loo A systematic approach to data cleaning with R
  11. 11. The statistical value chain From raw to technically correct data From technically correct to consistent data A few lessons from the demo (base) R has great text processing tools. Need to work with regular expressions1 Write many small functions extracting single data elements. Don’t overgeneralize: adapt functions as you meet new input. Smart use of existing tools (read.table(text=)) 1 Mastering Regular Expressions (2006) by Jeffrey Friedl is a great resource Mark van der Loo A systematic approach to data cleaning with R
  12. 12. The statistical value chain From raw to technically correct data From technically correct to consistent data Packages for standard format parsing jsonlite: parse JSON files yaml: parse yaml files xml2: parse XML files rvest: scrape and parse HTML files Mark van der Loo A systematic approach to data cleaning with R
  13. 13. The statistical value chain From raw to technically correct data From technically correct to consistent data Some tips on regular expressions with R stringr has many useful shorthands for common tasks. Generate regular expressions with rex library(rex) # recognize a number in scientific notation rex(one_or_more(digit) , maybe(".",one_or_more(digit)) , "E" %or% "e" , one_or_more(digit)) ## (?:[[:digit:]])+(?:.(?:[[:digit:]])+)?(?:E|e)(?:[[:digi Mark van der Loo A systematic approach to data cleaning with R
  14. 14. The statistical value chain From raw to technically correct data From technically correct to consistent data Regular expressions Express a pattern of text, e.g. "(a|b)c*" = {"a", "ac", "acc", . . . , "b", "bc", "bcc", . . .} Task stringr function: string detection str_detect(string, pattern) string extraction str_extract(string, pattern) string replacement str_extract(string, pattern, replacement) string splitting str_split(string, pattern) Base R: grep grepl | regexpr regmatches | sub gsub | strsplit Mark van der Loo A systematic approach to data cleaning with R
  15. 15. The statistical value chain From raw to technically correct data From technically correct to consistent data String normalization Bring a text string in a standard format, e.g. Standardize upper/lower case (casefolding) stringr: str_to_lower, str_to_upper, str_to_title base R: tolower, toupper Remove accents (transliteration) stringi: stri_trans_general base R: iconv Re-encoding stringi: stri_encode base R: iconv Uniformize encoding (unicode normalization) stringi: stri_trans_nfkc (and more) Mark van der Loo A systematic approach to data cleaning with R
  16. 16. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding Mark van der Loo A systematic approach to data cleaning with R
  17. 17. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Mark van der Loo A systematic approach to data cleaning with R
  18. 18. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Mark van der Loo A systematic approach to data cleaning with R
  19. 19. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Demo Normalization, re-encoding, transliteration: /strings Mark van der Loo A systematic approach to data cleaning with R
  20. 20. The statistical value chain From raw to technically correct data From technically correct to consistent data A few tips Detect encoding stringi::stri_enc_detect Conversion options iconvlist() stringi::stri_enc_list() Mark van der Loo A systematic approach to data cleaning with R
  21. 21. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching Mark van der Loo A systematic approach to data cleaning with R
  22. 22. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching Demo Approximate matching and normalization: /matching Mark van der Loo A systematic approach to data cleaning with R
  23. 23. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching: edit-based distances Allowed operation Distance substitution deletion insertion transposition Hamming LCS Levenshtein OSA ∗ Damerau- Levenshtein ∗Substrings may be edited only once. leela → leea → leia stringdist::stringdist(leela,leia,method=dl) ## [1] 2 Mark van der Loo A systematic approach to data cleaning with R
  24. 24. The statistical value chain From raw to technically correct data From technically correct to consistent data Some pointers for approximate matching Normalisation and approximate matching are complementary See my useR2014 talk or paper on stringdist for more distances The fuzzyjoin package allows fuzzy joining of datasets Mark van der Loo A systematic approach to data cleaning with R
  25. 25. The statistical value chain From raw to technically correct data From technically correct to consistent data Other good stuff lubridate: extract dates from strings lubridate::dmy(17 December 2015) ## [1] 2015-12-17 tidyr: many data cleaning operations to make your life easier readr: Parse numbers from text strings readr::parse_number(c(2%,6%,0.3%)) ## [1] 2.0 6.0 0.3 Mark van der Loo A systematic approach to data cleaning with R
  26. 26. The statistical value chain From raw to technically correct data From technically correct to consistent data From technically correct to consistent data Mark van der Loo A systematic approach to data cleaning with R
  27. 27. The statistical value chain From raw to technically correct data From technically correct to consistent data The mantra of data cleaning Detection (data conflicts with domain knowledge) Selection (find the value(s) that cause the violation) Correction (replace them with better values) Mark van der Loo A systematic approach to data cleaning with R
  28. 28. The statistical value chain From raw to technically correct data From technically correct to consistent data Detection, AKA data validation Informally: Data Validation is checking data against (multivariate) expectations about a data set. Validation rules Often these expectations can be expressed as a set of simple validation rules. Mark van der Loo A systematic approach to data cleaning with R
  29. 29. The statistical value chain From raw to technically correct data From technically correct to consistent data Data validation Demo The validate package /validate Mark van der Loo A systematic approach to data cleaning with R
  30. 30. The statistical value chain From raw to technically correct data From technically correct to consistent data The validate package, in summary Make data validation rules explicit Treat them as objects of computation store to / read from file manipulate annotate Confront data with rules Analyze/visualize the results Mark van der Loo A systematic approach to data cleaning with R
  31. 31. The statistical value chain From raw to technically correct data From technically correct to consistent data Tracking changes when altering data Mark van der Loo A systematic approach to data cleaning with R
  32. 32. The statistical value chain From raw to technically correct data From technically correct to consistent data Tracking changes in rule violations Mark van der Loo A systematic approach to data cleaning with R
  33. 33. The statistical value chain From raw to technically correct data From technically correct to consistent data Use rules to correct data Main idea Rules restrict the data. Sometimes this is enough to derive a correct value uniquely. Examples Correct typos in values under linear restrictions 123 + 45 = 177, but 123 + 54 = 177. Derive imputations from values under linear restrictions 123 + NA = 177, compute 177 − 123 = 54. Both can be generalized to systems Ax ≤ b. Mark van der Loo A systematic approach to data cleaning with R
  34. 34. The statistical value chain From raw to technically correct data From technically correct to consistent data Deductive correction and imputation Demo The deductive package: /deductive. Mark van der Loo A systematic approach to data cleaning with R
  35. 35. The statistical value chain From raw to technically correct data From technically correct to consistent data Selection, or: error localization Fellegi and Holt (1976) Find the least (weighted) number of fields that can be imputed such that all rules can be satisfied. Note Solutions need not be unique. Random one chosen in case of degeneracy. Lowest weight need not guarantee smallest number of altered variables. Mark van der Loo A systematic approach to data cleaning with R
  36. 36. The statistical value chain From raw to technically correct data From technically correct to consistent data Error localization Demo The errorlocate package: /errorlocate Mark van der Loo A systematic approach to data cleaning with R
  37. 37. The statistical value chain From raw to technically correct data From technically correct to consistent data Notes on errorlocate For in-record rules Support for linear (in)equality rules Conditionals on categorical variables (if male then not pregnant) Mixed conditionals (has job then age = 15) Conditionals w/linear predicates (staff 0 then staff cost 0) Optimization is mapped to MIP problem. Mark van der Loo A systematic approach to data cleaning with R
  38. 38. The statistical value chain From raw to technically correct data From technically correct to consistent data Missing values Mechanisms (Rubin): MCAR: missing completely at random MAR: P(Y = NA) depends on value of X MNAR: P(Y = NA) depends on value of Y Mark van der Loo A systematic approach to data cleaning with R
  39. 39. The statistical value chain From raw to technically correct data From technically correct to consistent data Imputation Purpose of imputation vs prediction Prediction: estimate a single value (often for a single use) Imputation: estimate values such that the completed data set allows for valid inferencea a This is very difficult! Imputation methods Deductive imputation Imputation based on predictive models Donor imputation (knn, pmm, sequential/random hot deck ) Mark van der Loo A systematic approach to data cleaning with R
  40. 40. The statistical value chain From raw to technically correct data From technically correct to consistent data Predictive model-based imputation ˆy = ˆf (x) + e.g.Linear regression ˆy = α + xT ˆβ + Residual: = 0 Impute expected value drawn from observed residuals e ∼ N(0, σ) parametric residual, ˆσ2 = var(e) Multiple imputation (Bayesian bootstrap) Draw β from parametric distribution, impute multiple times. Mark van der Loo A systematic approach to data cleaning with R
  41. 41. The statistical value chain From raw to technically correct data From technically correct to consistent data Donor imputation (hot deck) Method variants: Random hot deck: copy value from random record. Sequential hot deck: copy value from previous record. k-nearest neighbours: draw donor from k neares neigbours Predictive mean matching: copy value closest to prediction Donor pool variants: per variable per missing data pattern per record Mark van der Loo A systematic approach to data cleaning with R
  42. 42. The statistical value chain From raw to technically correct data From technically correct to consistent data Note on multivariate donor imputation Many multivariate methods seem relatively ad hoc, and more theoretical and empirical comparisons with alternative approaches would be of interest. Andridge and Little (2010) A Review of Hot Deck Imputation for Survey Non-response. Int. Stat. Rev. 78(1) 40–64 Mark van der Loo A systematic approach to data cleaning with R
  43. 43. The statistical value chain From raw to technically correct data From technically correct to consistent data Demo time Demo Imputation /imputation VIM: visualisation, GUI, extensive methodology simputation: simple, scriptable interface to common methods Mark van der Loo A systematic approach to data cleaning with R
  44. 44. The statistical value chain From raw to technically correct data From technically correct to consistent data Methods supported by simputation Model based (optionally add [non-]parametric random residual) linear regression robust linear regression CART models Random forest Donor imputation (including various donor pool specifications) k-nearest neigbour (based on gower’s distance) sequential hotdeck (LOCF, NOCB) random hotdeck Predictive mean matching Other (groupwise) median imputation (optional random residual) Proxy imputation (copy from other variable) Mark van der Loo A systematic approach to data cleaning with R
  45. 45. The statistical value chain From raw to technically correct data From technically correct to consistent data Credits deductive Mark van der Loo, Edwin de Jonge errorlocate Edwin de Jonge, Mark van der Loo gower Mark van der Loo jsonlite Jeroen Ooms, Duncan Temple Lang, Lloyd Hilaiel magrittr Stefan Milton Bache, Hadley Wickham rex Kevin Ushey Jim Hester, Robert Krzyzanowski simputation Mark van der Loo stringdist Mark van der Loo, Jan van der Laan, R Core, Nick Logan stringi Marek Gagolewski, Bartek Tartanus stringr Hadley Wickham, RStudio tidyr Hadley Wickham, RStudio validate Mark van der Loo, Edwin de Jonge VIM Matthias Templ, Andreas Alfons, Alexander Kowarik, Bernd Prantner xml2 Hadley Wickham, Jim Hester, Jeroen Ooms, RStudio, R foundation Mark van der Loo A systematic approach to data cleaning with R

×