Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Data Cleansing introduction (for BigClean Prague 2011)

  1. Data Cleansing What about quality? Stefan Urbanek stefan.urbanek@gmail.com @Stiivi March 2011
  2. Content ■ Introduction ■ What is data quality? ■ E and T from ETL ■ Summary
  3. http://vestnik.transparency.sk
  4. Brewery analytical data streams & Cubes online analytical processing github/bitbucket: Stiivi
  5. Quality
  6. What is data quality ?
  7. Dimensions ■ completeness – data provided ■ accuracy – reflecting real world ■ credibility – regarded as true ■ timeliness – up-to-date ■ consistency – matching facts across datasets ■ integrity – valid references between datasets
  8. completeness
  9. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 1 20 06 -1 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 1 20 08 -1 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 how many % of the field is filled and 09 -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 55%
  10. type 1 type 2 +
  11. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 0 20 05 -1 2 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 0 20 07 -1 2 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 09 how many % of the field is filled and -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 88%
  12. reconstruction: 5€ temperature: 32˚C accuracy
  13. timeliness
  14. Auto-measurable ■ completeness – easily ■ accuracy – somehow ■ credibility – not-so ■ timeliness – easily ■ consistency – yes ■ integrity – yes
  15. What does that mean: “high quality data?” ?
  16. 85%
  17. appropriate for given purpose
  18. attach quality report
  19. Quality Measurement for accuracy and transparency
  20. ■ why to measure? ■ when to measure? ■ where to measure?
  21. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index keep intermediate results for auditability
  22. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index insert probes at appropriate places
  23. like unit testing: 1. write probes 2. set data quality indicators 3. pass data through
  24. SQL PostgreSQL yml database table YAML directory coalesce values {x:.2%} + 15.00% data audit threshold formatted printer
  25. field nulls status distinct ------------------------------------------------------------ file 0.00% ok 100 source_code 0.00% ok 6 year 0.00% ok 6 donor_code 0.00% ok 2 receiver_name 1.25% fail 10363 receiver_address 13.29% fail 9979 receiver_ico 13.53% fail 5813 project 0.01% ok 28370 program 0.00% ok 29 subprogram 11.60% fail 177 project_budget 14.48% fail 9487 requested_amount 88.73% fail 1356 received_amount 9.32% fail 2179 contract_number 13.29% fail 28627 contract_date 57.88% fail 1425 source_comment 99.93% fail 9 source_id 89.52% fail 814
  26. E and T from ETL E as Extraction
  27. HTML Documents
  28. Ceci ne sont pas des données
  29. html body div id=#page div id=#page div id=#container div id=#main div id=#innerMain div (anonymous) div (anonymous) table tbody tr td tabletbody tr td table trtd tbody tabletd value √tr
  30. Now: you parse! 3 seconds *non-technical explanation follows
  31. <SPAN class=podnazov>More information </SPAN>
  32. ?
  33. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ...
  34. ?
  35. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ...
  36. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ... here is a subtitle and it should be in upper-case: o And here is another subtitle: dkaz na (non-breaking space) projekt much better here is a label: Odkaz na projekt
  37. “Structured” spreadsheets error prone more work needed
  38. ✓ structured file format
  39. 1 2 4 3 5 (1) image & title (2) repeating groups of columns (3) padding rows/columns (4) removed redundancy for readability (5) colored cells
  40. 1 2 3 (1) header with row padding (2) multi-row logical cell (3) broken pattern
  41. 1 2 (1) multi-row cell (2) more values in a row
  42. why? source id itemid file format parser data extraction class id item amount class item amount class why not? amount “structured” file raw data
  43. E and T from ETL T as Transformation
  44. Basic pattern slightly more technical
  45. source lists and maps ? + target ? diff ? target
  46. SELECT ... EXCEPT SELECT ... *in PostgreSQL, not in MySQL
  47. sta_vvo_vysledky sta_regis - - map_suppliers 1 unknown suppliers ? Slovensko + 2 + tmp_coalesced_suppliers_sk - sta_suppliers + 3 new suppliers
  48. Script or manual? script
  49. Script or manual? script ■ recurrent processing (weekly, monthly,...) ■ huge amount of data ■ one-time processing ■ small amount of data
  50. appropriate tool for given task
  51. balance
  52. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index
  53. Brewery data streams
  54. Data Sources Data Targets CSV file relational database data stream processing Google Spreadsheet report X remote Excel Spreadsheet URL processing streams
  55. data row data row data row data source data target value value value value id id id item item item class class class amount amount amount data source data target data record data record data record id value item value class value amount value
  56. Sources X SQL CSV file XLS file SQL query mongo DB yml Google spreadsheet YAML directory row list record list
  57. Targets yml SQL CSV file SQL table mongo DB YAML directory {x:.2%} <html> 15.00% HTML table formatted printer row list record list
  58. Record Operations + ! append distinct aggregate merge (join) !x ? ? n sample select set select data audit numerical statistics*
  59. Field Operations A→B re + + field map text substitute value threshold* derive* abc + string strip consolidate value histogram/bin* set to flag* to type
  60. + SQL ? <html> SQL
  61. yml nodes = { "source": CSVSourceNode(...), "clean": CoalesceValueToTypeNode(), "output": DatabaseTableTargetNode(...), "audit": AuditNode(...), "threshold": ValueThresholdNode(), "print": FormattedPrinterNode() } connections = [ ("source", "clean"), ("clean", "output"), SQL ("clean", "audit"), ("audit", "threshold"), ("threshold", "print") ] + ... # configure nodes here stream = Stream(nodes, connections) stream.initialize() {x:.2%} stream.run() 15.00%

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
Advertisement