Data Cleansing
                    What about quality?




Stefan Urbanek
stefan.urbanek@gmail.com
@Stiivi                                   March 2011
Content

■   Introduction
■   What is data quality?
■   E and T from ETL
■   Summary
http://vestnik.transparency.sk
Brewery
  analytical data streams

        &
      Cubes
online analytical processing




  github/bitbucket: Stiivi
Quality
What is data quality


        ?
Dimensions
■   completeness – data provided
■   accuracy – reflecting real world
■   credibility – regarded as true
■   timeliness – up-to-date
■   consistency – matching facts across datasets
■   integrity – valid references between datasets
completeness
all




                                              none
                                                                             better




                                                   0%
                                                                     25%
                                                                            50%
                                                                                      75%
                                                                                            100%




                                      20
                                           05
                                              -3
                                      20
                                           05
                                              -5
                                      20
                                           05
                                              -7
                                      20
                                           05
                                              -9
                                      20
                                        05
                                           -1  1
                                      20
                                           06
                                              -1
                                      20
                                           06
                                              -3
                                      20
                                           06
                                              -5
                                      20
                                           06
                                              -7
                                      20
                                           06
                                              -9
                                      20
                                        06
                                           -1  1
                                      20
                                           07
                                              -1
                                      20
                                           07
                                              -3
                                      20
                                           07
                                              -5
                                      20
                                           07
                                              -7
                                      20
                                           07
                                              -9
                                      20
                                        07
                                           -1  1
                                      20
                                           08
                                              -1
                                      20
                                           08
                                              -3
                                      20
                                           08
                                              -5
                                      20
                                           08
                                              -7
                                      20
                                           08
                                              -9
                                      20
                                        08
                                           -1  1
                                      20
                                           09
                                              -1

     successfully processed?          20
                                      20
                                           09
                                           09
                                              -3
                                              -5
                                      20
                                           09
                                              -7
                                      20
                                           09
                                              -9
                                      20
how many % of the field is filled and


                                        09
                                           -1  1
                                      20
                                           10
                                              -1
                                      20
                                           10
                                              -3
                                      20
                                           10
                                              -5
                                      20
                                           10
                                              -7
                                      20
                                           10
                                              -9
                                                         Quality measure
                                                        completeness: 55%
type 1       type 2


         +
all




                                              none
                                                                             better




                                                   0%
                                                                     25%
                                                                            50%
                                                                                      75%
                                                                                            100%




                                      20
                                           05
                                              -3
                                      20
                                           05
                                              -5
                                      20
                                           05
                                              -7
                                      20
                                           05
                                              -9
                                      20
                                        05
                                           -1 0
                                      20
                                        05
                                           -1  2
                                      20
                                           06
                                              -3
                                      20
                                           06
                                              -5
                                      20
                                           06
                                              -7
                                      20
                                           06
                                              -9
                                      20
                                        06
                                           -1  1
                                      20
                                           07
                                              -1
                                      20
                                           07
                                              -3
                                      20
                                           07
                                              -5
                                      20
                                           07
                                              -7
                                      20
                                           07
                                              -9
                                      20
                                        07
                                           -1 0
                                      20
                                        07
                                           -1  2
                                      20
                                           08
                                              -3
                                      20
                                           08
                                              -5
                                      20
                                           08
                                              -7
                                      20
                                           08
                                              -9
                                      20
                                        08
                                           -1  1
                                      20
                                           09
                                              -1

     successfully processed?          20
                                      20
                                           09
                                           09
                                              -3
                                              -5
                                      20
                                           09
                                              -7
                                      20
                                           09
                                              -9
                                      20
                                        09
how many % of the field is filled and


                                           -1  1
                                      20
                                           10
                                              -1
                                      20
                                           10
                                              -3
                                      20
                                           10
                                              -5
                                      20
                                           10
                                              -7
                                      20
                                           10
                                              -9
                                                         Quality measure
                                                        completeness: 88%
reconstruction: 5€

                     temperature: 32˚C

             accuracy
timeliness
Auto-measurable
■   completeness – easily
■   accuracy – somehow
■   credibility – not-so
■   timeliness – easily
■   consistency – yes
■   integrity – yes
What does that mean:
“high quality data?”


          ?
85%
appropriate for given
     purpose
attach quality report
Quality Measurement
   for accuracy and transparency
■ why to measure?
■ when to measure?
■ where to measure?
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index




        keep intermediate results for auditability
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index




                   insert probes at appropriate places
like unit testing:

1. write probes
2. set data quality indicators
3. pass data through
SQL


                            PostgreSQL
   yml                       database
                               table

YAML directory   coalesce
                  values
                                                     {x:.2%}
                                         +           15.00%

                            data audit   threshold   formatted
                                                       printer
field                            nulls     status   distinct
------------------------------------------------------------
file                             0.00%         ok        100
source_code                      0.00%         ok          6
year                             0.00%         ok          6
donor_code                       0.00%         ok          2
receiver_name                    1.25%       fail      10363
receiver_address                13.29%       fail       9979
receiver_ico                    13.53%       fail       5813
project                          0.01%         ok      28370
program                          0.00%         ok         29
subprogram                      11.60%       fail        177
project_budget                  14.48%       fail       9487
requested_amount                88.73%       fail       1356
received_amount                  9.32%       fail       2179
contract_number                 13.29%       fail      28627
contract_date                   57.88%       fail       1425
source_comment                  99.93%       fail          9
source_id                       89.52%       fail        814
E and T from ETL
     E as Extraction
HTML Documents
Ceci ne sont pas des données
html
 body
 div id=#page
  div id=#page
   div id=#container
        div id=#main
         div id=#innerMain
          div (anonymous)
           div (anonymous)
              table tbody
                             tr   td
                                       tabletbody
                                                tr td
                                                        table trtd
                                                           tbody
                                                                  tabletd value
                                                                    √tr
Now: you parse!
       3 seconds




   *non-technical explanation follows
<SPAN class=podnazov>More information
</SPAN>
?
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...
?
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...

here is a subtitle
and it should be in upper-case:
o
And here is another subtitle:
dkaz na (non-breaking space) projekt

                                              much better

               here is a label: Odkaz na projekt
“Structured”
spreadsheets


          error prone
          more work needed
✓ structured file format
1

                                             2




             4



3
                                      5




    (1) image & title
    (2) repeating groups of columns
    (3) padding rows/columns
    (4) removed redundancy for readability
    (5) colored cells
1




                    2




                        3




(1) header with row padding
(2) multi-row logical cell
(3) broken pattern
1
                           2




(1) multi-row cell
(2) more values in a row
why?


source                                               id
                                                   itemid
         file format parser   data extraction
                                                   class id
                                                       item
                                                  amount
                                                       class
                                                          item
                                                      amount
                                                         class
                             why not?                   amount

  “structured”
       file
                                               raw data
E and T from ETL
    T as Transformation
Basic pattern
 slightly more technical
source   lists and maps




  ?

         +
                          target



                            ?


                   diff



                    ?




                 target
SELECT ...
EXCEPT
SELECT ...

      *in PostgreSQL, not in MySQL
sta_vvo_vysledky
sta_regis




                                             -                                              -


                                                              map_suppliers
                                      1
                                          unknown suppliers




                                             ?    Slovensko

                          +

                      2


     +

              tmp_coalesced_suppliers_sk


     -
                          sta_suppliers



     +
                  3

         new suppliers
Script or manual?

       script
Script or manual?
script




■ recurrent processing (weekly, monthly,...)
■ huge amount of data


■ one-time processing
■ small amount of data
appropriate tool
 for given task
balance
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index
Brewery
 data streams
Data Sources                                      Data Targets



                        CSV file

                                                                 relational database
                                    data stream
                                    processing
      Google Spreadsheet

                                                                       report

               X
  remote Excel Spreadsheet    URL




                   processing streams
data row         data row                   data row
data source                                                                     data target




                                            value       value        value   value




                 id             id                       id
               item           item                     item
               class          class                    class
              amount         amount                   amount
data source                                                                     data target
               data record    data record              data record




                                               id          value

                                              item         value

                                              class     value

                                            amount         value
Sources

                       X
                                      SQL

     CSV file         XLS file        SQL query   mongo DB



                      yml



Google spreadsheet YAML directory    row list   record list
Targets

                                                yml
                  SQL

  CSV file       SQL table        mongo DB    YAML directory



                {x:.2%}
 <html>         15.00%

HTML table   formatted printer    row list     record list
Record Operations

+
                       !

append   distinct   aggregate    merge (join)



                                                           !x
          ?                 ?                              n

sample   select     set select   data audit     numerical statistics*
Field Operations
                     A→B
                       re             +                +
field map          text substitute   value threshold*     derive*




   abc
                                       +
string strip   consolidate value    histogram/bin*     set to flag*
                   to type
+
      SQL




            ?   <html>




SQL
yml             nodes = {
                    "source": CSVSourceNode(...),
                    "clean": CoalesceValueToTypeNode(),
                    "output": DatabaseTableTargetNode(...),
                    "audit": AuditNode(...),
                    "threshold": ValueThresholdNode(),
                    "print": FormattedPrinterNode()
                }

                connections = [
                                  ("source", "clean"),
                                  ("clean", "output"),
SQL
                                  ("clean", "audit"),
                                  ("audit", "threshold"),
                                  ("threshold", "print")
                                  ]

      +         ... # configure nodes here

                stream = Stream(nodes, connections)
                stream.initialize()
      {x:.2%}   stream.run()
      15.00%

Data Cleansing introduction (for BigClean Prague 2011)

  • 1.
    Data Cleansing What about quality? Stefan Urbanek stefan.urbanek@gmail.com @Stiivi March 2011
  • 2.
    Content ■ Introduction ■ What is data quality? ■ E and T from ETL ■ Summary
  • 3.
  • 4.
    Brewery analyticaldata streams & Cubes online analytical processing github/bitbucket: Stiivi
  • 5.
  • 6.
    What is dataquality ?
  • 7.
    Dimensions ■ completeness – data provided ■ accuracy – reflecting real world ■ credibility – regarded as true ■ timeliness – up-to-date ■ consistency – matching facts across datasets ■ integrity – valid references between datasets
  • 8.
  • 9.
    all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 1 20 06 -1 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 1 20 08 -1 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 how many % of the field is filled and 09 -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 55%
  • 10.
    type 1 type 2 +
  • 11.
    all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 0 20 05 -1 2 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 0 20 07 -1 2 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 09 how many % of the field is filled and -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 88%
  • 12.
    reconstruction: 5€ temperature: 32˚C accuracy
  • 13.
  • 14.
    Auto-measurable ■ completeness – easily ■ accuracy – somehow ■ credibility – not-so ■ timeliness – easily ■ consistency – yes ■ integrity – yes
  • 15.
    What does thatmean: “high quality data?” ?
  • 16.
  • 17.
  • 18.
  • 19.
    Quality Measurement for accuracy and transparency
  • 20.
    ■ why tomeasure? ■ when to measure? ■ where to measure?
  • 21.
    from staging toanalytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index keep intermediate results for auditability
  • 22.
    from staging toanalytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index insert probes at appropriate places
  • 23.
    like unit testing: 1.write probes 2. set data quality indicators 3. pass data through
  • 24.
    SQL PostgreSQL yml database table YAML directory coalesce values {x:.2%} + 15.00% data audit threshold formatted printer
  • 25.
    field nulls status distinct ------------------------------------------------------------ file 0.00% ok 100 source_code 0.00% ok 6 year 0.00% ok 6 donor_code 0.00% ok 2 receiver_name 1.25% fail 10363 receiver_address 13.29% fail 9979 receiver_ico 13.53% fail 5813 project 0.01% ok 28370 program 0.00% ok 29 subprogram 11.60% fail 177 project_budget 14.48% fail 9487 requested_amount 88.73% fail 1356 received_amount 9.32% fail 2179 contract_number 13.29% fail 28627 contract_date 57.88% fail 1425 source_comment 99.93% fail 9 source_id 89.52% fail 814
  • 26.
    E and Tfrom ETL E as Extraction
  • 27.
  • 28.
    Ceci ne sontpas des données
  • 31.
    html body divid=#page div id=#page div id=#container div id=#main div id=#innerMain div (anonymous) div (anonymous) table tbody tr td tabletbody tr td table trtd tbody tabletd value √tr
  • 33.
    Now: you parse! 3 seconds *non-technical explanation follows
  • 34.
  • 35.
  • 36.
    <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ...
  • 37.
  • 38.
    <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ...
  • 39.
    <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ... here is a subtitle and it should be in upper-case: o And here is another subtitle: dkaz na (non-breaking space) projekt much better here is a label: Odkaz na projekt
  • 40.
    “Structured” spreadsheets error prone more work needed
  • 41.
  • 42.
    1 2 4 3 5 (1) image & title (2) repeating groups of columns (3) padding rows/columns (4) removed redundancy for readability (5) colored cells
  • 43.
    1 2 3 (1) header with row padding (2) multi-row logical cell (3) broken pattern
  • 44.
    1 2 (1) multi-row cell (2) more values in a row
  • 45.
    why? source id itemid file format parser data extraction class id item amount class item amount class why not? amount “structured” file raw data
  • 46.
    E and Tfrom ETL T as Transformation
  • 47.
    Basic pattern slightlymore technical
  • 48.
    source lists and maps ? + target ? diff ? target
  • 49.
    SELECT ... EXCEPT SELECT ... *in PostgreSQL, not in MySQL
  • 50.
    sta_vvo_vysledky sta_regis - - map_suppliers 1 unknown suppliers ? Slovensko + 2 + tmp_coalesced_suppliers_sk - sta_suppliers + 3 new suppliers
  • 51.
  • 52.
    Script or manual? script ■recurrent processing (weekly, monthly,...) ■ huge amount of data ■ one-time processing ■ small amount of data
  • 53.
  • 54.
  • 55.
    from staging toanalytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index
  • 56.
  • 57.
    Data Sources Data Targets CSV file relational database data stream processing Google Spreadsheet report X remote Excel Spreadsheet URL processing streams
  • 58.
    data row data row data row data source data target value value value value id id id item item item class class class amount amount amount data source data target data record data record data record id value item value class value amount value
  • 59.
    Sources X SQL CSV file XLS file SQL query mongo DB yml Google spreadsheet YAML directory row list record list
  • 60.
    Targets yml SQL CSV file SQL table mongo DB YAML directory {x:.2%} <html> 15.00% HTML table formatted printer row list record list
  • 61.
    Record Operations + ! append distinct aggregate merge (join) !x ? ? n sample select set select data audit numerical statistics*
  • 62.
    Field Operations A→B re + + field map text substitute value threshold* derive* abc + string strip consolidate value histogram/bin* set to flag* to type
  • 63.
    + SQL ? <html> SQL
  • 64.
    yml nodes = { "source": CSVSourceNode(...), "clean": CoalesceValueToTypeNode(), "output": DatabaseTableTargetNode(...), "audit": AuditNode(...), "threshold": ValueThresholdNode(), "print": FormattedPrinterNode() } connections = [ ("source", "clean"), ("clean", "output"), SQL ("clean", "audit"), ("audit", "threshold"), ("threshold", "print") ] + ... # configure nodes here stream = Stream(nodes, connections) stream.initialize() {x:.2%} stream.run() 15.00%