Data Cleansing introduction (for BigClean Prague 2011)

4,653 views
4,544 views

Published on

Presentation from the BigClean event in spring 2011 in Prague. Briefly introduces to data quality, cleansing and shows some examples from existing open data/open government projects.

Published in: Technology, Business
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,653
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Data Cleansing introduction (for BigClean Prague 2011)

    1. 1. Data Cleansing What about quality?Stefan Urbanekstefan.urbanek@gmail.com@Stiivi March 2011
    2. 2. Content■ Introduction■ What is data quality?■ E and T from ETL■ Summary
    3. 3. http://vestnik.transparency.sk
    4. 4. Brewery analytical data streams & Cubesonline analytical processing
    5. 5. Brewery analytical data streams & Cubesonline analytical processing github/bitbucket: Stiivi
    6. 6. Quality
    7. 7. What is data quality ?
    8. 8. Dimensions■ completeness – data provided■ accuracy – reflecting real world■ credibility – regarded as true■ timeliness – up-to-date■ consistency – matching facts across datasets■ integrity – valid references between datasets
    9. 9. completeness
    10. 10. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 1 20 06 -1 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 1 20 08 -1 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20how many % of the field is filled and 09 -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 55%
    11. 11. type 1 type 2 +
    12. 12. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 0 20 05 -1 2 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 0 20 07 -1 2 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 09how many % of the field is filled and -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 88%
    13. 13. reconstruction: 5€ temperature: 32˚C accuracy
    14. 14. timeliness
    15. 15. Auto-measurable■ completeness – easily■ accuracy – somehow■ credibility – not-so■ timeliness – easily■ consistency – yes■ integrity – yes
    16. 16. What does that mean:“high quality data?” ?
    17. 17. 85%
    18. 18. appropriate for given purpose
    19. 19. attach quality report
    20. 20. Quality Measurement for accuracy and transparency
    21. 21. ■ why to measure?■ when to measure?■ where to measure?
    22. 22. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index keep intermediate results for auditability
    23. 23. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index insert probes at appropriate places
    24. 24. like unit testing:1. write probes2. set data quality indicators3. pass data through
    25. 25. SQL PostgreSQL yml database tableYAML directory coalesce values {x:.2%} + 15.00% data audit threshold formatted printer
    26. 26. field nulls status distinct------------------------------------------------------------file 0.00% ok 100source_code 0.00% ok 6year 0.00% ok 6donor_code 0.00% ok 2receiver_name 1.25% fail 10363receiver_address 13.29% fail 9979receiver_ico 13.53% fail 5813project 0.01% ok 28370program 0.00% ok 29subprogram 11.60% fail 177project_budget 14.48% fail 9487requested_amount 88.73% fail 1356received_amount 9.32% fail 2179contract_number 13.29% fail 28627contract_date 57.88% fail 1425source_comment 99.93% fail 9source_id 89.52% fail 814
    27. 27. E and T from ETL E as Extraction
    28. 28. HTML Documents
    29. 29. Ceci ne sont pas des données
    30. 30. html body div id=#page div id=#page div id=#container div id=#main div id=#innerMain div (anonymous) div (anonymous) table tbody tr td tabletbody tr td table trtd tbody tabletd value √tr
    31. 31. Now: you parse! 3 seconds *non-technical explanation follows
    32. 32. <SPAN class=podnazov>More information</SPAN>
    33. 33. ?
    34. 34. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o</SPAN><SPAN class=podnazov>dkaz na&nbsp;projekt...
    35. 35. ?
    36. 36. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o</SPAN><SPAN class=podnazov>dkaz na&nbsp;projekt...
    37. 37. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o</SPAN><SPAN class=podnazov>dkaz na&nbsp;projekt...here is a subtitleand it should be in upper-case:oAnd here is another subtitle:dkaz na (non-breaking space) projekt much better here is a label: Odkaz na projekt
    38. 38. “Structured”spreadsheets error prone more work needed
    39. 39. ✓ structured file format
    40. 40. 1 2 43 5 (1) image & title (2) repeating groups of columns (3) padding rows/columns (4) removed redundancy for readability (5) colored cells
    41. 41. 1 2 3(1) header with row padding(2) multi-row logical cell(3) broken pattern
    42. 42. 1 2(1) multi-row cell(2) more values in a row
    43. 43. why?source id itemid file format parser data extraction class id item amount class item amount class why not? amount “structured” file raw data
    44. 44. E and T from ETL T as Transformation
    45. 45. Basic pattern slightly more technical
    46. 46. source lists and maps ? + target ? diff ? target
    47. 47. SELECT ...EXCEPTSELECT ... *in PostgreSQL, not in MySQL
    48. 48. sta_vvo_vysledkysta_regis - - map_suppliers 1 unknown suppliers ? Slovensko + 2 + tmp_coalesced_suppliers_sk - sta_suppliers + 3 new suppliers
    49. 49. Script or manual? script
    50. 50. Script or manual?script■ recurrent processing (weekly, monthly,...)■ huge amount of data■ one-time processing■ small amount of data
    51. 51. appropriate tool for given task
    52. 52. balance
    53. 53. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index
    54. 54. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index
    55. 55. Brewery data streams
    56. 56. Data Sources Data Targets CSV file relational database data stream processing Google Spreadsheet report X remote Excel Spreadsheet URL processing streams
    57. 57. data row data row data rowdata source data target value value value value id id id item item item class class class amount amount amountdata source data target data record data record data record id value item value class value amount value
    58. 58. Sources X SQL CSV file XLS file SQL query mongo DB ymlGoogle spreadsheet YAML directory row list record list
    59. 59. Targets yml SQL CSV file SQL table mongo DB YAML directory {x:.2%} <html> 15.00%HTML table formatted printer row list record list
    60. 60. Record Operations+ !append distinct aggregate merge (join) !x ? ? nsample select set select data audit numerical statistics*
    61. 61. Field Operations A→B re + +field map text substitute value threshold* derive* abc +string strip consolidate value histogram/bin* set to flag* to type
    62. 62. + SQL ? <html>SQL
    63. 63. yml nodes = { "source": CSVSourceNode(...), "clean": CoalesceValueToTypeNode(), "output": DatabaseTableTargetNode(...), "audit": AuditNode(...), "threshold": ValueThresholdNode(), "print": FormattedPrinterNode() } connections = [ ("source", "clean"), ("clean", "output"),SQL ("clean", "audit"), ("audit", "threshold"), ("threshold", "print") ] + ... # configure nodes here stream = Stream(nodes, connections) stream.initialize() {x:.2%} stream.run() 15.00%

    ×