Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Cleansing introduction (for BigClean Prague 2011)

5,722 views

Published on

Presentation from the BigClean event in spring 2011 in Prague. Briefly introduces to data quality, cleansing and shows some examples from existing open data/open government projects.

Published in: Technology, Business
  • Be the first to comment

Data Cleansing introduction (for BigClean Prague 2011)

  1. 1. Data Cleansing What about quality?Stefan Urbanekstefan.urbanek@gmail.com@Stiivi March 2011
  2. 2. Content■ Introduction■ What is data quality?■ E and T from ETL■ Summary
  3. 3. http://vestnik.transparency.sk
  4. 4. Brewery analytical data streams & Cubesonline analytical processing
  5. 5. Brewery analytical data streams & Cubesonline analytical processing github/bitbucket: Stiivi
  6. 6. Quality
  7. 7. What is data quality ?
  8. 8. Dimensions■ completeness – data provided■ accuracy – reflecting real world■ credibility – regarded as true■ timeliness – up-to-date■ consistency – matching facts across datasets■ integrity – valid references between datasets
  9. 9. completeness
  10. 10. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 1 20 06 -1 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 1 20 08 -1 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20how many % of the field is filled and 09 -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 55%
  11. 11. type 1 type 2 +
  12. 12. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 0 20 05 -1 2 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 0 20 07 -1 2 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 09how many % of the field is filled and -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 88%
  13. 13. reconstruction: 5€ temperature: 32˚C accuracy
  14. 14. timeliness
  15. 15. Auto-measurable■ completeness – easily■ accuracy – somehow■ credibility – not-so■ timeliness – easily■ consistency – yes■ integrity – yes
  16. 16. What does that mean:“high quality data?” ?
  17. 17. 85%
  18. 18. appropriate for given purpose
  19. 19. attach quality report
  20. 20. Quality Measurement for accuracy and transparency
  21. 21. ■ why to measure?■ when to measure?■ where to measure?
  22. 22. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index keep intermediate results for auditability
  23. 23. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index insert probes at appropriate places
  24. 24. like unit testing:1. write probes2. set data quality indicators3. pass data through
  25. 25. SQL PostgreSQL yml database tableYAML directory coalesce values {x:.2%} + 15.00% data audit threshold formatted printer
  26. 26. field nulls status distinct------------------------------------------------------------file 0.00% ok 100source_code 0.00% ok 6year 0.00% ok 6donor_code 0.00% ok 2receiver_name 1.25% fail 10363receiver_address 13.29% fail 9979receiver_ico 13.53% fail 5813project 0.01% ok 28370program 0.00% ok 29subprogram 11.60% fail 177project_budget 14.48% fail 9487requested_amount 88.73% fail 1356received_amount 9.32% fail 2179contract_number 13.29% fail 28627contract_date 57.88% fail 1425source_comment 99.93% fail 9source_id 89.52% fail 814
  27. 27. E and T from ETL E as Extraction
  28. 28. HTML Documents
  29. 29. Ceci ne sont pas des données
  30. 30. html body div id=#page div id=#page div id=#container div id=#main div id=#innerMain div (anonymous) div (anonymous) table tbody tr td tabletbody tr td table trtd tbody tabletd value √tr
  31. 31. Now: you parse! 3 seconds *non-technical explanation follows
  32. 32. <SPAN class=podnazov>More information</SPAN>
  33. 33. ?
  34. 34. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o</SPAN><SPAN class=podnazov>dkaz na&nbsp;projekt...
  35. 35. ?
  36. 36. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o</SPAN><SPAN class=podnazov>dkaz na&nbsp;projekt...
  37. 37. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o</SPAN><SPAN class=podnazov>dkaz na&nbsp;projekt...here is a subtitleand it should be in upper-case:oAnd here is another subtitle:dkaz na (non-breaking space) projekt much better here is a label: Odkaz na projekt
  38. 38. “Structured”spreadsheets error prone more work needed
  39. 39. ✓ structured file format
  40. 40. 1 2 43 5 (1) image & title (2) repeating groups of columns (3) padding rows/columns (4) removed redundancy for readability (5) colored cells
  41. 41. 1 2 3(1) header with row padding(2) multi-row logical cell(3) broken pattern
  42. 42. 1 2(1) multi-row cell(2) more values in a row
  43. 43. why?source id itemid file format parser data extraction class id item amount class item amount class why not? amount “structured” file raw data
  44. 44. E and T from ETL T as Transformation
  45. 45. Basic pattern slightly more technical
  46. 46. source lists and maps ? + target ? diff ? target
  47. 47. SELECT ...EXCEPTSELECT ... *in PostgreSQL, not in MySQL
  48. 48. sta_vvo_vysledkysta_regis - - map_suppliers 1 unknown suppliers ? Slovensko + 2 + tmp_coalesced_suppliers_sk - sta_suppliers + 3 new suppliers
  49. 49. Script or manual? script
  50. 50. Script or manual?script■ recurrent processing (weekly, monthly,...)■ huge amount of data■ one-time processing■ small amount of data
  51. 51. appropriate tool for given task
  52. 52. balance
  53. 53. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index
  54. 54. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index
  55. 55. Brewery data streams
  56. 56. Data Sources Data Targets CSV file relational database data stream processing Google Spreadsheet report X remote Excel Spreadsheet URL processing streams
  57. 57. data row data row data rowdata source data target value value value value id id id item item item class class class amount amount amountdata source data target data record data record data record id value item value class value amount value
  58. 58. Sources X SQL CSV file XLS file SQL query mongo DB ymlGoogle spreadsheet YAML directory row list record list
  59. 59. Targets yml SQL CSV file SQL table mongo DB YAML directory {x:.2%} <html> 15.00%HTML table formatted printer row list record list
  60. 60. Record Operations+ !append distinct aggregate merge (join) !x ? ? nsample select set select data audit numerical statistics*
  61. 61. Field Operations A→B re + +field map text substitute value threshold* derive* abc +string strip consolidate value histogram/bin* set to flag* to type
  62. 62. + SQL ? <html>SQL
  63. 63. yml nodes = { "source": CSVSourceNode(...), "clean": CoalesceValueToTypeNode(), "output": DatabaseTableTargetNode(...), "audit": AuditNode(...), "threshold": ValueThresholdNode(), "print": FormattedPrinterNode() } connections = [ ("source", "clean"), ("clean", "output"),SQL ("clean", "audit"), ("audit", "threshold"), ("threshold", "print") ] + ... # configure nodes here stream = Stream(nodes, connections) stream.initialize() {x:.2%} stream.run() 15.00%

×