SlideShare a Scribd company logo
Data Cleansing
                    What about quality?




Stefan Urbanek
stefan.urbanek@gmail.com
@Stiivi                                   March 2011
Content

■   Introduction
■   What is data quality?
■   E and T from ETL
■   Summary
http://vestnik.transparency.sk
Brewery
  analytical data streams

        &
      Cubes
online analytical processing




  github/bitbucket: Stiivi
Quality
What is data quality


        ?
Dimensions
■   completeness – data provided
■   accuracy – reflecting real world
■   credibility – regarded as true
■   timeliness – up-to-date
■   consistency – matching facts across datasets
■   integrity – valid references between datasets
completeness
all




                                              none
                                                                             better




                                                   0%
                                                                     25%
                                                                            50%
                                                                                      75%
                                                                                            100%




                                      20
                                           05
                                              -3
                                      20
                                           05
                                              -5
                                      20
                                           05
                                              -7
                                      20
                                           05
                                              -9
                                      20
                                        05
                                           -1  1
                                      20
                                           06
                                              -1
                                      20
                                           06
                                              -3
                                      20
                                           06
                                              -5
                                      20
                                           06
                                              -7
                                      20
                                           06
                                              -9
                                      20
                                        06
                                           -1  1
                                      20
                                           07
                                              -1
                                      20
                                           07
                                              -3
                                      20
                                           07
                                              -5
                                      20
                                           07
                                              -7
                                      20
                                           07
                                              -9
                                      20
                                        07
                                           -1  1
                                      20
                                           08
                                              -1
                                      20
                                           08
                                              -3
                                      20
                                           08
                                              -5
                                      20
                                           08
                                              -7
                                      20
                                           08
                                              -9
                                      20
                                        08
                                           -1  1
                                      20
                                           09
                                              -1

     successfully processed?          20
                                      20
                                           09
                                           09
                                              -3
                                              -5
                                      20
                                           09
                                              -7
                                      20
                                           09
                                              -9
                                      20
how many % of the field is filled and


                                        09
                                           -1  1
                                      20
                                           10
                                              -1
                                      20
                                           10
                                              -3
                                      20
                                           10
                                              -5
                                      20
                                           10
                                              -7
                                      20
                                           10
                                              -9
                                                         Quality measure
                                                        completeness: 55%
type 1       type 2


         +
all




                                              none
                                                                             better




                                                   0%
                                                                     25%
                                                                            50%
                                                                                      75%
                                                                                            100%




                                      20
                                           05
                                              -3
                                      20
                                           05
                                              -5
                                      20
                                           05
                                              -7
                                      20
                                           05
                                              -9
                                      20
                                        05
                                           -1 0
                                      20
                                        05
                                           -1  2
                                      20
                                           06
                                              -3
                                      20
                                           06
                                              -5
                                      20
                                           06
                                              -7
                                      20
                                           06
                                              -9
                                      20
                                        06
                                           -1  1
                                      20
                                           07
                                              -1
                                      20
                                           07
                                              -3
                                      20
                                           07
                                              -5
                                      20
                                           07
                                              -7
                                      20
                                           07
                                              -9
                                      20
                                        07
                                           -1 0
                                      20
                                        07
                                           -1  2
                                      20
                                           08
                                              -3
                                      20
                                           08
                                              -5
                                      20
                                           08
                                              -7
                                      20
                                           08
                                              -9
                                      20
                                        08
                                           -1  1
                                      20
                                           09
                                              -1

     successfully processed?          20
                                      20
                                           09
                                           09
                                              -3
                                              -5
                                      20
                                           09
                                              -7
                                      20
                                           09
                                              -9
                                      20
                                        09
how many % of the field is filled and


                                           -1  1
                                      20
                                           10
                                              -1
                                      20
                                           10
                                              -3
                                      20
                                           10
                                              -5
                                      20
                                           10
                                              -7
                                      20
                                           10
                                              -9
                                                         Quality measure
                                                        completeness: 88%
reconstruction: 5€

                     temperature: 32˚C

             accuracy
timeliness
Auto-measurable
■   completeness – easily
■   accuracy – somehow
■   credibility – not-so
■   timeliness – easily
■   consistency – yes
■   integrity – yes
What does that mean:
“high quality data?”


          ?
85%
appropriate for given
     purpose
attach quality report
Quality Measurement
   for accuracy and transparency
■ why to measure?
■ when to measure?
■ where to measure?
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index




        keep intermediate results for auditability
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index




                   insert probes at appropriate places
like unit testing:

1. write probes
2. set data quality indicators
3. pass data through
SQL


                            PostgreSQL
   yml                       database
                               table

YAML directory   coalesce
                  values
                                                     {x:.2%}
                                         +           15.00%

                            data audit   threshold   formatted
                                                       printer
field                            nulls     status   distinct
------------------------------------------------------------
file                             0.00%         ok        100
source_code                      0.00%         ok          6
year                             0.00%         ok          6
donor_code                       0.00%         ok          2
receiver_name                    1.25%       fail      10363
receiver_address                13.29%       fail       9979
receiver_ico                    13.53%       fail       5813
project                          0.01%         ok      28370
program                          0.00%         ok         29
subprogram                      11.60%       fail        177
project_budget                  14.48%       fail       9487
requested_amount                88.73%       fail       1356
received_amount                  9.32%       fail       2179
contract_number                 13.29%       fail      28627
contract_date                   57.88%       fail       1425
source_comment                  99.93%       fail          9
source_id                       89.52%       fail        814
E and T from ETL
     E as Extraction
HTML Documents
Ceci ne sont pas des données
html
 body
 div id=#page
  div id=#page
   div id=#container
        div id=#main
         div id=#innerMain
          div (anonymous)
           div (anonymous)
              table tbody
                             tr   td
                                       tabletbody
                                                tr td
                                                        table trtd
                                                           tbody
                                                                  tabletd value
                                                                    √tr
Now: you parse!
       3 seconds




   *non-technical explanation follows
<SPAN class=podnazov>More information
</SPAN>
?
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...
?
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...

here is a subtitle
and it should be in upper-case:
o
And here is another subtitle:
dkaz na (non-breaking space) projekt

                                              much better

               here is a label: Odkaz na projekt
“Structured”
spreadsheets


          error prone
          more work needed
✓ structured file format
1

                                             2




             4



3
                                      5




    (1) image & title
    (2) repeating groups of columns
    (3) padding rows/columns
    (4) removed redundancy for readability
    (5) colored cells
1




                    2




                        3




(1) header with row padding
(2) multi-row logical cell
(3) broken pattern
1
                           2




(1) multi-row cell
(2) more values in a row
why?


source                                               id
                                                   itemid
         file format parser   data extraction
                                                   class id
                                                       item
                                                  amount
                                                       class
                                                          item
                                                      amount
                                                         class
                             why not?                   amount

  “structured”
       file
                                               raw data
E and T from ETL
    T as Transformation
Basic pattern
 slightly more technical
source   lists and maps




  ?

         +
                          target



                            ?


                   diff



                    ?




                 target
SELECT ...
EXCEPT
SELECT ...

      *in PostgreSQL, not in MySQL
sta_vvo_vysledky
sta_regis




                                             -                                              -


                                                              map_suppliers
                                      1
                                          unknown suppliers




                                             ?    Slovensko

                          +

                      2


     +

              tmp_coalesced_suppliers_sk


     -
                          sta_suppliers



     +
                  3

         new suppliers
Script or manual?

       script
Script or manual?
script




■ recurrent processing (weekly, monthly,...)
■ huge amount of data


■ one-time processing
■ small amount of data
appropriate tool
 for given task
balance
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index
Brewery
 data streams
Data Sources                                      Data Targets



                        CSV file

                                                                 relational database
                                    data stream
                                    processing
      Google Spreadsheet

                                                                       report

               X
  remote Excel Spreadsheet    URL




                   processing streams
data row         data row                   data row
data source                                                                     data target




                                            value       value        value   value




                 id             id                       id
               item           item                     item
               class          class                    class
              amount         amount                   amount
data source                                                                     data target
               data record    data record              data record




                                               id          value

                                              item         value

                                              class     value

                                            amount         value
Sources

                       X
                                      SQL

     CSV file         XLS file        SQL query   mongo DB



                      yml



Google spreadsheet YAML directory    row list   record list
Targets

                                                yml
                  SQL

  CSV file       SQL table        mongo DB    YAML directory



                {x:.2%}
 <html>         15.00%

HTML table   formatted printer    row list     record list
Record Operations

+
                       !

append   distinct   aggregate    merge (join)



                                                           !x
          ?                 ?                              n

sample   select     set select   data audit     numerical statistics*
Field Operations
                     A→B
                       re             +                +
field map          text substitute   value threshold*     derive*




   abc
                                       +
string strip   consolidate value    histogram/bin*     set to flag*
                   to type
+
      SQL




            ?   <html>




SQL
yml             nodes = {
                    "source": CSVSourceNode(...),
                    "clean": CoalesceValueToTypeNode(),
                    "output": DatabaseTableTargetNode(...),
                    "audit": AuditNode(...),
                    "threshold": ValueThresholdNode(),
                    "print": FormattedPrinterNode()
                }

                connections = [
                                  ("source", "clean"),
                                  ("clean", "output"),
SQL
                                  ("clean", "audit"),
                                  ("audit", "threshold"),
                                  ("threshold", "print")
                                  ]

      +         ... # configure nodes here

                stream = Stream(nodes, connections)
                stream.initialize()
      {x:.2%}   stream.run()
      15.00%

More Related Content

Viewers also liked

Data Quality Best Practices Nbk Auto May 06 2010
Data Quality Best Practices  Nbk Auto May 06 2010Data Quality Best Practices  Nbk Auto May 06 2010
Data Quality Best Practices Nbk Auto May 06 2010Rami Mansour
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity Model
Data Blueprint
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
Dimitris Kontokostas
 
Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
Zuhair khayyat
 
Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data Management
Empowered Holdings, LLC
 

Viewers also liked (6)

Data Quality Best Practices Nbk Auto May 06 2010
Data Quality Best Practices  Nbk Auto May 06 2010Data Quality Best Practices  Nbk Auto May 06 2010
Data Quality Best Practices Nbk Auto May 06 2010
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity Model
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
 
Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data Management
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 

Similar to Data Cleansing introduction (for BigClean Prague 2011)

The New Game in Town
The New Game in TownThe New Game in Town
The New Game in Town
Modern Times Group MTG AB
 
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...Business Intelligence Research
 
Regulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the ConsequencesRegulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the ConsequencesMercatus Center
 
Wellness & Consumer Driven Health Care
Wellness & Consumer Driven Health CareWellness & Consumer Driven Health Care
Wellness & Consumer Driven Health Care
guest00dbec2
 
Wellness & Consumer Driven Health Care
Wellness & Consumer Driven Health CareWellness & Consumer Driven Health Care
Wellness & Consumer Driven Health Care
guest00dbec2
 
Wellness &amp; Consumer Driven Health Care
Wellness &amp; Consumer Driven Health CareWellness &amp; Consumer Driven Health Care
Wellness &amp; Consumer Driven Health Care
Smeaco
 
Dr. Elwynn Taylor - Weather Outlook
Dr. Elwynn Taylor - Weather OutlookDr. Elwynn Taylor - Weather Outlook
Dr. Elwynn Taylor - Weather Outlook
John Blue
 
22.02, Group 5 — Concept of sustainable development in built environment
22.02, Group 5 — Concept of sustainable development in built environment22.02, Group 5 — Concept of sustainable development in built environment
22.02, Group 5 — Concept of sustainable development in built environment
WDC_Ukraine
 
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010Burton Lee
 
The BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and resultsThe BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and resultsGeorge Kershoff
 

Similar to Data Cleansing introduction (for BigClean Prague 2011) (11)

The New Game in Town
The New Game in TownThe New Game in Town
The New Game in Town
 
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
 
Regulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the ConsequencesRegulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the Consequences
 
Wellness & Consumer Driven Health Care
Wellness & Consumer Driven Health CareWellness & Consumer Driven Health Care
Wellness & Consumer Driven Health Care
 
Wellness & Consumer Driven Health Care
Wellness & Consumer Driven Health CareWellness & Consumer Driven Health Care
Wellness & Consumer Driven Health Care
 
Wellness &amp; Consumer Driven Health Care
Wellness &amp; Consumer Driven Health CareWellness &amp; Consumer Driven Health Care
Wellness &amp; Consumer Driven Health Care
 
AEFI Dhamija
AEFI DhamijaAEFI Dhamija
AEFI Dhamija
 
Dr. Elwynn Taylor - Weather Outlook
Dr. Elwynn Taylor - Weather OutlookDr. Elwynn Taylor - Weather Outlook
Dr. Elwynn Taylor - Weather Outlook
 
22.02, Group 5 — Concept of sustainable development in built environment
22.02, Group 5 — Concept of sustainable development in built environment22.02, Group 5 — Concept of sustainable development in built environment
22.02, Group 5 — Concept of sustainable development in built environment
 
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
 
The BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and resultsThe BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and results
 

More from Stefan Urbanek

StepTalk Introduction
StepTalk IntroductionStepTalk Introduction
StepTalk Introduction
Stefan Urbanek
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Stefan Urbanek
 
Sepro - introduction
Sepro - introductionSepro - introduction
Sepro - introduction
Stefan Urbanek
 
New york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionNew york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introduction
Stefan Urbanek
 
Cubes 1.0 Overview
Cubes 1.0 OverviewCubes 1.0 Overview
Cubes 1.0 Overview
Stefan Urbanek
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explained
Stefan Urbanek
 
Cubes – ways of deployment
Cubes – ways of deploymentCubes – ways of deployment
Cubes – ways of deploymentStefan Urbanek
 
Knowledge Management Lecture 4: Models
Knowledge Management Lecture 4: ModelsKnowledge Management Lecture 4: Models
Knowledge Management Lecture 4: Models
Stefan Urbanek
 
Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionDallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality Perception
Stefan Urbanek
 
Dallas Data Brewery - introduction
Dallas Data Brewery - introductionDallas Data Brewery - introduction
Dallas Data Brewery - introduction
Stefan Urbanek
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
Stefan Urbanek
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
Stefan Urbanek
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Stefan Urbanek
 
Knowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: CycleKnowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: Cycle
Stefan Urbanek
 
Knowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsKnowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsStefan Urbanek
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presence
Stefan Urbanek
 
Open spending as-is 2011-06
Open spending   as-is 2011-06Open spending   as-is 2011-06
Open spending as-is 2011-06
Stefan Urbanek
 
Cubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP FrameworkCubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP Framework
Stefan Urbanek
 
Open Data Decentralisation
Open Data DecentralisationOpen Data Decentralisation
Open Data Decentralisation
Stefan Urbanek
 
Knowledge Management Introduction
Knowledge Management IntroductionKnowledge Management Introduction
Knowledge Management IntroductionStefan Urbanek
 

More from Stefan Urbanek (20)

StepTalk Introduction
StepTalk IntroductionStepTalk Introduction
StepTalk Introduction
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
 
Sepro - introduction
Sepro - introductionSepro - introduction
Sepro - introduction
 
New york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionNew york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introduction
 
Cubes 1.0 Overview
Cubes 1.0 OverviewCubes 1.0 Overview
Cubes 1.0 Overview
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explained
 
Cubes – ways of deployment
Cubes – ways of deploymentCubes – ways of deployment
Cubes – ways of deployment
 
Knowledge Management Lecture 4: Models
Knowledge Management Lecture 4: ModelsKnowledge Management Lecture 4: Models
Knowledge Management Lecture 4: Models
 
Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionDallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality Perception
 
Dallas Data Brewery - introduction
Dallas Data Brewery - introductionDallas Data Brewery - introduction
Dallas Data Brewery - introduction
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
 
Knowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: CycleKnowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: Cycle
 
Knowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsKnowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizations
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presence
 
Open spending as-is 2011-06
Open spending   as-is 2011-06Open spending   as-is 2011-06
Open spending as-is 2011-06
 
Cubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP FrameworkCubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP Framework
 
Open Data Decentralisation
Open Data DecentralisationOpen Data Decentralisation
Open Data Decentralisation
 
Knowledge Management Introduction
Knowledge Management IntroductionKnowledge Management Introduction
Knowledge Management Introduction
 

Recently uploaded

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 

Recently uploaded (20)

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 

Data Cleansing introduction (for BigClean Prague 2011)

  • 1. Data Cleansing What about quality? Stefan Urbanek stefan.urbanek@gmail.com @Stiivi March 2011
  • 2. Content ■ Introduction ■ What is data quality? ■ E and T from ETL ■ Summary
  • 4. Brewery analytical data streams & Cubes online analytical processing github/bitbucket: Stiivi
  • 6. What is data quality ?
  • 7. Dimensions ■ completeness – data provided ■ accuracy – reflecting real world ■ credibility – regarded as true ■ timeliness – up-to-date ■ consistency – matching facts across datasets ■ integrity – valid references between datasets
  • 9. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 1 20 06 -1 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 1 20 08 -1 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 how many % of the field is filled and 09 -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 55%
  • 10. type 1 type 2 +
  • 11. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 0 20 05 -1 2 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 0 20 07 -1 2 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 09 how many % of the field is filled and -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 88%
  • 12. reconstruction: 5€ temperature: 32˚C accuracy
  • 14. Auto-measurable ■ completeness – easily ■ accuracy – somehow ■ credibility – not-so ■ timeliness – easily ■ consistency – yes ■ integrity – yes
  • 15. What does that mean: “high quality data?” ?
  • 16. 85%
  • 19. Quality Measurement for accuracy and transparency
  • 20. ■ why to measure? ■ when to measure? ■ where to measure?
  • 21. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index keep intermediate results for auditability
  • 22. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index insert probes at appropriate places
  • 23. like unit testing: 1. write probes 2. set data quality indicators 3. pass data through
  • 24. SQL PostgreSQL yml database table YAML directory coalesce values {x:.2%} + 15.00% data audit threshold formatted printer
  • 25. field nulls status distinct ------------------------------------------------------------ file 0.00% ok 100 source_code 0.00% ok 6 year 0.00% ok 6 donor_code 0.00% ok 2 receiver_name 1.25% fail 10363 receiver_address 13.29% fail 9979 receiver_ico 13.53% fail 5813 project 0.01% ok 28370 program 0.00% ok 29 subprogram 11.60% fail 177 project_budget 14.48% fail 9487 requested_amount 88.73% fail 1356 received_amount 9.32% fail 2179 contract_number 13.29% fail 28627 contract_date 57.88% fail 1425 source_comment 99.93% fail 9 source_id 89.52% fail 814
  • 26. E and T from ETL E as Extraction
  • 28. Ceci ne sont pas des données
  • 29.
  • 30.
  • 31. html body div id=#page div id=#page div id=#container div id=#main div id=#innerMain div (anonymous) div (anonymous) table tbody tr td tabletbody tr td table trtd tbody tabletd value √tr
  • 32.
  • 33. Now: you parse! 3 seconds *non-technical explanation follows
  • 35. ?
  • 36. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ...
  • 37. ?
  • 38. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ...
  • 39. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ... here is a subtitle and it should be in upper-case: o And here is another subtitle: dkaz na (non-breaking space) projekt much better here is a label: Odkaz na projekt
  • 40. “Structured” spreadsheets error prone more work needed
  • 42. 1 2 4 3 5 (1) image & title (2) repeating groups of columns (3) padding rows/columns (4) removed redundancy for readability (5) colored cells
  • 43. 1 2 3 (1) header with row padding (2) multi-row logical cell (3) broken pattern
  • 44. 1 2 (1) multi-row cell (2) more values in a row
  • 45. why? source id itemid file format parser data extraction class id item amount class item amount class why not? amount “structured” file raw data
  • 46. E and T from ETL T as Transformation
  • 47. Basic pattern slightly more technical
  • 48. source lists and maps ? + target ? diff ? target
  • 49. SELECT ... EXCEPT SELECT ... *in PostgreSQL, not in MySQL
  • 50. sta_vvo_vysledky sta_regis - - map_suppliers 1 unknown suppliers ? Slovensko + 2 + tmp_coalesced_suppliers_sk - sta_suppliers + 3 new suppliers
  • 52. Script or manual? script ■ recurrent processing (weekly, monthly,...) ■ huge amount of data ■ one-time processing ■ small amount of data
  • 53. appropriate tool for given task
  • 55. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index
  • 57. Data Sources Data Targets CSV file relational database data stream processing Google Spreadsheet report X remote Excel Spreadsheet URL processing streams
  • 58. data row data row data row data source data target value value value value id id id item item item class class class amount amount amount data source data target data record data record data record id value item value class value amount value
  • 59. Sources X SQL CSV file XLS file SQL query mongo DB yml Google spreadsheet YAML directory row list record list
  • 60. Targets yml SQL CSV file SQL table mongo DB YAML directory {x:.2%} <html> 15.00% HTML table formatted printer row list record list
  • 61. Record Operations + ! append distinct aggregate merge (join) !x ? ? n sample select set select data audit numerical statistics*
  • 62. Field Operations A→B re + + field map text substitute value threshold* derive* abc + string strip consolidate value histogram/bin* set to flag* to type
  • 63. + SQL ? <html> SQL
  • 64. yml nodes = { "source": CSVSourceNode(...), "clean": CoalesceValueToTypeNode(), "output": DatabaseTableTargetNode(...), "audit": AuditNode(...), "threshold": ValueThresholdNode(), "print": FormattedPrinterNode() } connections = [ ("source", "clean"), ("clean", "output"), SQL ("clean", "audit"), ("audit", "threshold"), ("threshold", "print") ] + ... # configure nodes here stream = Stream(nodes, connections) stream.initialize() {x:.2%} stream.run() 15.00%

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n