• Like
testdat: An R package for unit testing of tabular data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

testdat: An R package for unit testing of tabular data

  • 474 views
Published

Poster presentation at useR! 2014 for testdat: An R package for unit testing of tabular data

Poster presentation at useR! 2014 for testdat: An R package for unit testing of tabular data

Published in Data & Analytics
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
474
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
11
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. testdat: An  R  package  for  unit  tes2ng  of  tabular  data   Mo#va#on   Karthik  Ram1,  Hilary  Parker2,  Alyssa  Frazee3   1  The  rOpenSci  project,  University  of  California,  Berkeley.  Berkeley,  CA  94720  USA,  karthik.ram@berkeley.edu 2  Etsy  Inc.,  Brooklyn,  NY.  USA,  hilary@etsy.com 3  Department  of  Biosta2s2cs,  Johns  Hopkins  Bloomberg  School  of  Public  Health,  Bal2more,  MD.  USA,  afrazee@jhsph.edu Contribute   The  testdat  package,  like  rOpenSci,  is  an  open-­‐ source,  community-­‐supported  project!     Improve  data  preprocessing:   Data  preprocessing  is  an  important  and  under-­‐ discussed  step  in  data  analysis.  By  providing   func2ons  to  easily  test  for  and  correct  common   piXalls,  we  aim  to  help  researchers  overcome  these   stumbling  blocks.         Encourage  reproducibility:   By  providing  a  suite  of  func2ons  that  easily  test  and   correct  data  for  common  errors,  we  hope  to   encourage  researchers  to  perform  data   preprocessing  as  part  of  a  reproducible  workflow,   rather  than  in  tools  such  as  Excel.       Communicate  analy#cal  steps:   By  providing  readable  func2ons  for  preprocessing,   we  aim  for  researchers  to  include  the  data   preprocessing  code  in  their  analyses  or  papers,  to   communicate  that  they  took  exhaus2ve  steps  to   remove  ar2facts  from  data.   Example  Func#ons   Workflow   Obtain   > dat date num name 1 2014-01-01 1 NULL 2 2014-01-01 2 naa 3 2014-01-01 3 foo 4 2014-01-01 4 foo 5 2014-01-01 5 foo 6 2014-01-01 6 foo 7 2014-01-01 7 foo 8 2014-01-01 8 foo 9 2014-01-01 999 foo 10 2014-01-01 n/a foo > class(dat$num) [1] "factor" > class(dat$name) [1] "factor” > test_NA(dat) Now checking 3 columns... 999 was identified as a possible NA alias -- please verify this is not a data value! row column value 1 9 2 999 2 10 2 n/a 3  1 3 NULL > clean_dat <- fix_NA(dat, custom_NAs="naa") Now fixing 3 columns... > clean_dat date num name 1 2014-01-01 1 <NA> 2 2014-01-01 2 <NA> 3 2014-01-01 3 foo 4 2014-01-01 4 foo 5 2014-01-01 5 foo 6 2014-01-01 6 foo 7 2014-01-01 7 foo 8 2014-01-01 8 foo 9 2014-01-01 NA foo 10 2014-01-01 NA foo > class(clean_dat$num) [1] "numeric" > class(clean_dat$name) [1] "character" Test   Fix   test_utf8.R, clean_utf8.R! ! Test  and  correct  uX8  characters,  which  cannot  be   read  into  R.   ! test_NA.R, fix_NA.R! ! Test  and  correct  for  common  missing-­‐value   indicators  that  are  not  converted  to  an  NA   character  in  R.   ! test_continuous_date.R, fix_continuous_date.R! ! Test  and  correct  for  unexpected  gaps  in  date   ranges.   ! test_white_spaces.R, fix_white_spaces.R! ! Test  and  correct  for  white-­‐spaces  in  character   vectors.   ! test_outliers.R! ! Test  for  outliers  in  your  numeric  data.  A  correct   func2on  is  not  supplied,  as  this  has  sta2s2cal   implica2ons.   !