• Save
Data Cleansing introduction (for BigClean Prague 2011)
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Data Cleansing introduction (for BigClean Prague 2011)

  • 3,252 views
Uploaded on

Presentation from the BigClean event in spring 2011 in Prague. Briefly introduces to data quality, cleansing and shows some examples from existing open data/open government projects.

Presentation from the BigClean event in spring 2011 in Prague. Briefly introduces to data quality, cleansing and shows some examples from existing open data/open government projects.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,252
On Slideshare
3,244
From Embeds
8
Number of Embeds
4

Actions

Shares
Downloads
0
Comments
0
Likes
11

Embeds 8

http://coderwall.com 3
http://www.docseek.net 2
http://192.168.6.179 2
http://paper.li 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. Data Cleansing What about quality?Stefan Urbanekstefan.urbanek@gmail.com@Stiivi March 2011
  • 2. Content■ Introduction■ What is data quality?■ E and T from ETL■ Summary
  • 3. http://vestnik.transparency.sk
  • 4. Brewery analytical data streams & Cubesonline analytical processing
  • 5. Brewery analytical data streams & Cubesonline analytical processing github/bitbucket: Stiivi
  • 6. Quality
  • 7. What is data quality ?
  • 8. Dimensions■ completeness – data provided■ accuracy – reflecting real world■ credibility – regarded as true■ timeliness – up-to-date■ consistency – matching facts across datasets■ integrity – valid references between datasets
  • 9. completeness
  • 10. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 1 20 06 -1 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 1 20 08 -1 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20how many % of the field is filled and 09 -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 55%
  • 11. type 1 type 2 +
  • 12. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 0 20 05 -1 2 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 0 20 07 -1 2 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 09how many % of the field is filled and -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 88%
  • 13. reconstruction: 5€ temperature: 32˚C accuracy
  • 14. timeliness
  • 15. Auto-measurable■ completeness – easily■ accuracy – somehow■ credibility – not-so■ timeliness – easily■ consistency – yes■ integrity – yes
  • 16. What does that mean:“high quality data?” ?
  • 17. 85%
  • 18. appropriate for given purpose
  • 19. attach quality report
  • 20. Quality Measurement for accuracy and transparency
  • 21. ■ why to measure?■ when to measure?■ where to measure?
  • 22. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index keep intermediate results for auditability
  • 23. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index insert probes at appropriate places
  • 24. like unit testing:1. write probes2. set data quality indicators3. pass data through
  • 25. SQL PostgreSQL yml database tableYAML directory coalesce values {x:.2%} + 15.00% data audit threshold formatted printer
  • 26. field nulls status distinct------------------------------------------------------------file 0.00% ok 100source_code 0.00% ok 6year 0.00% ok 6donor_code 0.00% ok 2receiver_name 1.25% fail 10363receiver_address 13.29% fail 9979receiver_ico 13.53% fail 5813project 0.01% ok 28370program 0.00% ok 29subprogram 11.60% fail 177project_budget 14.48% fail 9487requested_amount 88.73% fail 1356received_amount 9.32% fail 2179contract_number 13.29% fail 28627contract_date 57.88% fail 1425source_comment 99.93% fail 9source_id 89.52% fail 814
  • 27. E and T from ETL E as Extraction
  • 28. HTML Documents
  • 29. Ceci ne sont pas des données
  • 30. html body div id=#page div id=#page div id=#container div id=#main div id=#innerMain div (anonymous) div (anonymous) table tbody tr td tabletbody tr td table trtd tbody tabletd value √tr
  • 31. Now: you parse! 3 seconds *non-technical explanation follows
  • 32. <SPAN class=podnazov>More information</SPAN>
  • 33. ?
  • 34. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o</SPAN><SPAN class=podnazov>dkaz na&nbsp;projekt...
  • 35. ?
  • 36. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o</SPAN><SPAN class=podnazov>dkaz na&nbsp;projekt...
  • 37. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o</SPAN><SPAN class=podnazov>dkaz na&nbsp;projekt...here is a subtitleand it should be in upper-case:oAnd here is another subtitle:dkaz na (non-breaking space) projekt much better here is a label: Odkaz na projekt
  • 38. “Structured”spreadsheets error prone more work needed
  • 39. ✓ structured file format
  • 40. 1 2 43 5 (1) image & title (2) repeating groups of columns (3) padding rows/columns (4) removed redundancy for readability (5) colored cells
  • 41. 1 2 3(1) header with row padding(2) multi-row logical cell(3) broken pattern
  • 42. 1 2(1) multi-row cell(2) more values in a row
  • 43. why?source id itemid file format parser data extraction class id item amount class item amount class why not? amount “structured” file raw data
  • 44. E and T from ETL T as Transformation
  • 45. Basic pattern slightly more technical
  • 46. source lists and maps ? + target ? diff ? target
  • 47. SELECT ...EXCEPTSELECT ... *in PostgreSQL, not in MySQL
  • 48. sta_vvo_vysledkysta_regis - - map_suppliers 1 unknown suppliers ? Slovensko + 2 + tmp_coalesced_suppliers_sk - sta_suppliers + 3 new suppliers
  • 49. Script or manual? script
  • 50. Script or manual?script■ recurrent processing (weekly, monthly,...)■ huge amount of data■ one-time processing■ small amount of data
  • 51. appropriate tool for given task
  • 52. balance
  • 53. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index
  • 54. from staging to analytical datafrom source to staging data analytical modelsince 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging)from source to staging data2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index
  • 55. Brewery data streams
  • 56. Data Sources Data Targets CSV file relational database data stream processing Google Spreadsheet report X remote Excel Spreadsheet URL processing streams
  • 57. data row data row data rowdata source data target value value value value id id id item item item class class class amount amount amountdata source data target data record data record data record id value item value class value amount value
  • 58. Sources X SQL CSV file XLS file SQL query mongo DB ymlGoogle spreadsheet YAML directory row list record list
  • 59. Targets yml SQL CSV file SQL table mongo DB YAML directory {x:.2%} <html> 15.00%HTML table formatted printer row list record list
  • 60. Record Operations+ !append distinct aggregate merge (join) !x ? ? nsample select set select data audit numerical statistics*
  • 61. Field Operations A→B re + +field map text substitute value threshold* derive* abc +string strip consolidate value histogram/bin* set to flag* to type
  • 62. + SQL ? <html>SQL
  • 63. yml nodes = { "source": CSVSourceNode(...), "clean": CoalesceValueToTypeNode(), "output": DatabaseTableTargetNode(...), "audit": AuditNode(...), "threshold": ValueThresholdNode(), "print": FormattedPrinterNode() } connections = [ ("source", "clean"), ("clean", "output"),SQL ("clean", "audit"), ("audit", "threshold"), ("threshold", "print") ] + ... # configure nodes here stream = Stream(nodes, connections) stream.initialize() {x:.2%} stream.run() 15.00%