SlideShare a Scribd company logo
Data Cleansing
                    What about quality?




Stefan Urbanek
stefan.urbanek@gmail.com
@Stiivi                                   March 2011
Content

■   Introduction
■   What is data quality?
■   E and T from ETL
■   Summary
http://vestnik.transparency.sk
Brewery
  analytical data streams

        &
      Cubes
online analytical processing




  github/bitbucket: Stiivi
Quality
What is data quality


        ?
Dimensions
■   completeness – data provided
■   accuracy – reflecting real world
■   credibility – regarded as true
■   timeliness – up-to-date
■   consistency – matching facts across datasets
■   integrity – valid references between datasets
completeness
all




                                              none
                                                                             better




                                                   0%
                                                                     25%
                                                                            50%
                                                                                      75%
                                                                                            100%




                                      20
                                           05
                                              -3
                                      20
                                           05
                                              -5
                                      20
                                           05
                                              -7
                                      20
                                           05
                                              -9
                                      20
                                        05
                                           -1  1
                                      20
                                           06
                                              -1
                                      20
                                           06
                                              -3
                                      20
                                           06
                                              -5
                                      20
                                           06
                                              -7
                                      20
                                           06
                                              -9
                                      20
                                        06
                                           -1  1
                                      20
                                           07
                                              -1
                                      20
                                           07
                                              -3
                                      20
                                           07
                                              -5
                                      20
                                           07
                                              -7
                                      20
                                           07
                                              -9
                                      20
                                        07
                                           -1  1
                                      20
                                           08
                                              -1
                                      20
                                           08
                                              -3
                                      20
                                           08
                                              -5
                                      20
                                           08
                                              -7
                                      20
                                           08
                                              -9
                                      20
                                        08
                                           -1  1
                                      20
                                           09
                                              -1

     successfully processed?          20
                                      20
                                           09
                                           09
                                              -3
                                              -5
                                      20
                                           09
                                              -7
                                      20
                                           09
                                              -9
                                      20
how many % of the field is filled and


                                        09
                                           -1  1
                                      20
                                           10
                                              -1
                                      20
                                           10
                                              -3
                                      20
                                           10
                                              -5
                                      20
                                           10
                                              -7
                                      20
                                           10
                                              -9
                                                         Quality measure
                                                        completeness: 55%
type 1       type 2


         +
all




                                              none
                                                                             better




                                                   0%
                                                                     25%
                                                                            50%
                                                                                      75%
                                                                                            100%




                                      20
                                           05
                                              -3
                                      20
                                           05
                                              -5
                                      20
                                           05
                                              -7
                                      20
                                           05
                                              -9
                                      20
                                        05
                                           -1 0
                                      20
                                        05
                                           -1  2
                                      20
                                           06
                                              -3
                                      20
                                           06
                                              -5
                                      20
                                           06
                                              -7
                                      20
                                           06
                                              -9
                                      20
                                        06
                                           -1  1
                                      20
                                           07
                                              -1
                                      20
                                           07
                                              -3
                                      20
                                           07
                                              -5
                                      20
                                           07
                                              -7
                                      20
                                           07
                                              -9
                                      20
                                        07
                                           -1 0
                                      20
                                        07
                                           -1  2
                                      20
                                           08
                                              -3
                                      20
                                           08
                                              -5
                                      20
                                           08
                                              -7
                                      20
                                           08
                                              -9
                                      20
                                        08
                                           -1  1
                                      20
                                           09
                                              -1

     successfully processed?          20
                                      20
                                           09
                                           09
                                              -3
                                              -5
                                      20
                                           09
                                              -7
                                      20
                                           09
                                              -9
                                      20
                                        09
how many % of the field is filled and


                                           -1  1
                                      20
                                           10
                                              -1
                                      20
                                           10
                                              -3
                                      20
                                           10
                                              -5
                                      20
                                           10
                                              -7
                                      20
                                           10
                                              -9
                                                         Quality measure
                                                        completeness: 88%
reconstruction: 5€

                     temperature: 32˚C

             accuracy
timeliness
Auto-measurable
■   completeness – easily
■   accuracy – somehow
■   credibility – not-so
■   timeliness – easily
■   consistency – yes
■   integrity – yes
What does that mean:
“high quality data?”


          ?
85%
appropriate for given
     purpose
attach quality report
Quality Measurement
   for accuracy and transparency
■ why to measure?
■ when to measure?
■ where to measure?
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index




        keep intermediate results for auditability
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index




                   insert probes at appropriate places
like unit testing:

1. write probes
2. set data quality indicators
3. pass data through
SQL


                            PostgreSQL
   yml                       database
                               table

YAML directory   coalesce
                  values
                                                     {x:.2%}
                                         +           15.00%

                            data audit   threshold   formatted
                                                       printer
field                            nulls     status   distinct
------------------------------------------------------------
file                             0.00%         ok        100
source_code                      0.00%         ok          6
year                             0.00%         ok          6
donor_code                       0.00%         ok          2
receiver_name                    1.25%       fail      10363
receiver_address                13.29%       fail       9979
receiver_ico                    13.53%       fail       5813
project                          0.01%         ok      28370
program                          0.00%         ok         29
subprogram                      11.60%       fail        177
project_budget                  14.48%       fail       9487
requested_amount                88.73%       fail       1356
received_amount                  9.32%       fail       2179
contract_number                 13.29%       fail      28627
contract_date                   57.88%       fail       1425
source_comment                  99.93%       fail          9
source_id                       89.52%       fail        814
E and T from ETL
     E as Extraction
HTML Documents
Ceci ne sont pas des données
Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)
html
 body
 div id=#page
  div id=#page
   div id=#container
        div id=#main
         div id=#innerMain
          div (anonymous)
           div (anonymous)
              table tbody
                             tr   td
                                       tabletbody
                                                tr td
                                                        table trtd
                                                           tbody
                                                                  tabletd value
                                                                    √tr
Data Cleansing introduction (for BigClean Prague 2011)
Now: you parse!
       3 seconds




   *non-technical explanation follows
<SPAN class=podnazov>More information
</SPAN>
?
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...
?
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...

here is a subtitle
and it should be in upper-case:
o
And here is another subtitle:
dkaz na (non-breaking space) projekt

                                              much better

               here is a label: Odkaz na projekt
“Structured”
spreadsheets


          error prone
          more work needed
✓ structured file format
1

                                             2




             4



3
                                      5




    (1) image & title
    (2) repeating groups of columns
    (3) padding rows/columns
    (4) removed redundancy for readability
    (5) colored cells
1




                    2




                        3




(1) header with row padding
(2) multi-row logical cell
(3) broken pattern
1
                           2




(1) multi-row cell
(2) more values in a row
why?


source                                               id
                                                   itemid
         file format parser   data extraction
                                                   class id
                                                       item
                                                  amount
                                                       class
                                                          item
                                                      amount
                                                         class
                             why not?                   amount

  “structured”
       file
                                               raw data
E and T from ETL
    T as Transformation
Basic pattern
 slightly more technical
source   lists and maps




  ?

         +
                          target



                            ?


                   diff



                    ?




                 target
SELECT ...
EXCEPT
SELECT ...

      *in PostgreSQL, not in MySQL
sta_vvo_vysledky
sta_regis




                                             -                                              -


                                                              map_suppliers
                                      1
                                          unknown suppliers




                                             ?    Slovensko

                          +

                      2


     +

              tmp_coalesced_suppliers_sk


     -
                          sta_suppliers



     +
                  3

         new suppliers
Script or manual?

       script
Script or manual?
script




■ recurrent processing (weekly, monthly,...)
■ huge amount of data


■ one-time processing
■ small amount of data
appropriate tool
 for given task
balance
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index
Brewery
 data streams
Data Sources                                      Data Targets



                        CSV file

                                                                 relational database
                                    data stream
                                    processing
      Google Spreadsheet

                                                                       report

               X
  remote Excel Spreadsheet    URL




                   processing streams
data row         data row                   data row
data source                                                                     data target




                                            value       value        value   value




                 id             id                       id
               item           item                     item
               class          class                    class
              amount         amount                   amount
data source                                                                     data target
               data record    data record              data record




                                               id          value

                                              item         value

                                              class     value

                                            amount         value
Sources

                       X
                                      SQL

     CSV file         XLS file        SQL query   mongo DB



                      yml



Google spreadsheet YAML directory    row list   record list
Targets

                                                yml
                  SQL

  CSV file       SQL table        mongo DB    YAML directory



                {x:.2%}
 <html>         15.00%

HTML table   formatted printer    row list     record list
Record Operations

+
                       !

append   distinct   aggregate    merge (join)



                                                           !x
          ?                 ?                              n

sample   select     set select   data audit     numerical statistics*
Field Operations
                     A→B
                       re             +                +
field map          text substitute   value threshold*     derive*




   abc
                                       +
string strip   consolidate value    histogram/bin*     set to flag*
                   to type
+
      SQL




            ?   <html>




SQL
yml             nodes = {
                    "source": CSVSourceNode(...),
                    "clean": CoalesceValueToTypeNode(),
                    "output": DatabaseTableTargetNode(...),
                    "audit": AuditNode(...),
                    "threshold": ValueThresholdNode(),
                    "print": FormattedPrinterNode()
                }

                connections = [
                                  ("source", "clean"),
                                  ("clean", "output"),
SQL
                                  ("clean", "audit"),
                                  ("audit", "threshold"),
                                  ("threshold", "print")
                                  ]

      +         ... # configure nodes here

                stream = Stream(nodes, connections)
                stream.initialize()
      {x:.2%}   stream.run()
      15.00%

More Related Content

Viewers also liked

Data Quality Best Practices Nbk Auto May 06 2010
Data Quality Best Practices  Nbk Auto May 06 2010Data Quality Best Practices  Nbk Auto May 06 2010
Data Quality Best Practices Nbk Auto May 06 2010
Rami Mansour
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity Model
Data Blueprint
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
Dimitris Kontokostas
 
Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
Zuhair khayyat
 
Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data Management
Empowered Holdings, LLC
 
Data cleansing
Data cleansingData cleansing
Data cleansing
kunaljain1701
 

Viewers also liked (6)

Data Quality Best Practices Nbk Auto May 06 2010
Data Quality Best Practices  Nbk Auto May 06 2010Data Quality Best Practices  Nbk Auto May 06 2010
Data Quality Best Practices Nbk Auto May 06 2010
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity Model
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
 
Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data Management
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 

Similar to Data Cleansing introduction (for BigClean Prague 2011)

Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Business Intelligence Research
 
Regulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the ConsequencesRegulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the Consequences
Mercatus Center
 
Wellness & Consumer Driven Health Care
Wellness & Consumer Driven Health CareWellness & Consumer Driven Health Care
Wellness & Consumer Driven Health Care
guest00dbec2
 
Wellness & Consumer Driven Health Care
Wellness & Consumer Driven Health CareWellness & Consumer Driven Health Care
Wellness & Consumer Driven Health Care
guest00dbec2
 
Wellness &amp; Consumer Driven Health Care
Wellness &amp; Consumer Driven Health CareWellness &amp; Consumer Driven Health Care
Wellness &amp; Consumer Driven Health Care
Smeaco
 
AEFI Dhamija
AEFI DhamijaAEFI Dhamija
AEFI Dhamija
Prabir Chatterjee
 
Dr. Elwynn Taylor - Weather Outlook
Dr. Elwynn Taylor - Weather OutlookDr. Elwynn Taylor - Weather Outlook
Dr. Elwynn Taylor - Weather Outlook
John Blue
 
22.02, Group 5 — Concept of sustainable development in built environment
22.02, Group 5 — Concept of sustainable development in built environment22.02, Group 5 — Concept of sustainable development in built environment
22.02, Group 5 — Concept of sustainable development in built environment
WDC_Ukraine
 
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Burton Lee
 
The BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and resultsThe BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and results
George Kershoff
 

Similar to Data Cleansing introduction (for BigClean Prague 2011) (10)

Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
 
Regulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the ConsequencesRegulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the Consequences
 
Wellness & Consumer Driven Health Care
Wellness & Consumer Driven Health CareWellness & Consumer Driven Health Care
Wellness & Consumer Driven Health Care
 
Wellness & Consumer Driven Health Care
Wellness & Consumer Driven Health CareWellness & Consumer Driven Health Care
Wellness & Consumer Driven Health Care
 
Wellness &amp; Consumer Driven Health Care
Wellness &amp; Consumer Driven Health CareWellness &amp; Consumer Driven Health Care
Wellness &amp; Consumer Driven Health Care
 
AEFI Dhamija
AEFI DhamijaAEFI Dhamija
AEFI Dhamija
 
Dr. Elwynn Taylor - Weather Outlook
Dr. Elwynn Taylor - Weather OutlookDr. Elwynn Taylor - Weather Outlook
Dr. Elwynn Taylor - Weather Outlook
 
22.02, Group 5 — Concept of sustainable development in built environment
22.02, Group 5 — Concept of sustainable development in built environment22.02, Group 5 — Concept of sustainable development in built environment
22.02, Group 5 — Concept of sustainable development in built environment
 
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
 
The BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and resultsThe BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and results
 

More from Stefan Urbanek

StepTalk Introduction
StepTalk IntroductionStepTalk Introduction
StepTalk Introduction
Stefan Urbanek
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Stefan Urbanek
 
Sepro - introduction
Sepro - introductionSepro - introduction
Sepro - introduction
Stefan Urbanek
 
New york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionNew york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introduction
Stefan Urbanek
 
Cubes 1.0 Overview
Cubes 1.0 OverviewCubes 1.0 Overview
Cubes 1.0 Overview
Stefan Urbanek
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explained
Stefan Urbanek
 
Cubes – ways of deployment
Cubes – ways of deploymentCubes – ways of deployment
Cubes – ways of deployment
Stefan Urbanek
 
Knowledge Management Lecture 4: Models
Knowledge Management Lecture 4: ModelsKnowledge Management Lecture 4: Models
Knowledge Management Lecture 4: Models
Stefan Urbanek
 
Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionDallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality Perception
Stefan Urbanek
 
Dallas Data Brewery - introduction
Dallas Data Brewery - introductionDallas Data Brewery - introduction
Dallas Data Brewery - introduction
Stefan Urbanek
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
Stefan Urbanek
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
Stefan Urbanek
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Stefan Urbanek
 
Knowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: CycleKnowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: Cycle
Stefan Urbanek
 
Knowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsKnowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizations
Stefan Urbanek
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presence
Stefan Urbanek
 
Open spending as-is 2011-06
Open spending   as-is 2011-06Open spending   as-is 2011-06
Open spending as-is 2011-06
Stefan Urbanek
 
Cubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP FrameworkCubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP Framework
Stefan Urbanek
 
Open Data Decentralisation
Open Data DecentralisationOpen Data Decentralisation
Open Data Decentralisation
Stefan Urbanek
 
Knowledge Management Introduction
Knowledge Management IntroductionKnowledge Management Introduction
Knowledge Management Introduction
Stefan Urbanek
 

More from Stefan Urbanek (20)

StepTalk Introduction
StepTalk IntroductionStepTalk Introduction
StepTalk Introduction
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
 
Sepro - introduction
Sepro - introductionSepro - introduction
Sepro - introduction
 
New york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionNew york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introduction
 
Cubes 1.0 Overview
Cubes 1.0 OverviewCubes 1.0 Overview
Cubes 1.0 Overview
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explained
 
Cubes – ways of deployment
Cubes – ways of deploymentCubes – ways of deployment
Cubes – ways of deployment
 
Knowledge Management Lecture 4: Models
Knowledge Management Lecture 4: ModelsKnowledge Management Lecture 4: Models
Knowledge Management Lecture 4: Models
 
Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionDallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality Perception
 
Dallas Data Brewery - introduction
Dallas Data Brewery - introductionDallas Data Brewery - introduction
Dallas Data Brewery - introduction
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
 
Knowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: CycleKnowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: Cycle
 
Knowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsKnowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizations
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presence
 
Open spending as-is 2011-06
Open spending   as-is 2011-06Open spending   as-is 2011-06
Open spending as-is 2011-06
 
Cubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP FrameworkCubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP Framework
 
Open Data Decentralisation
Open Data DecentralisationOpen Data Decentralisation
Open Data Decentralisation
 
Knowledge Management Introduction
Knowledge Management IntroductionKnowledge Management Introduction
Knowledge Management Introduction
 

Recently uploaded

Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
Brian Pichman
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
ssuser1915fe1
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
bellared2
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
Priyanka Aash
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
Anant Gupta
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
aslasdfmkhan4750
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Torry Harris
 
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
Priyanka Aash
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
DianaGray10
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
Baishakhi Ray
 
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
Priyanka Aash
 

Recently uploaded (20)

Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
 
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
 
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
 

Data Cleansing introduction (for BigClean Prague 2011)

  • 1. Data Cleansing What about quality? Stefan Urbanek stefan.urbanek@gmail.com @Stiivi March 2011
  • 2. Content ■ Introduction ■ What is data quality? ■ E and T from ETL ■ Summary
  • 4. Brewery analytical data streams & Cubes online analytical processing github/bitbucket: Stiivi
  • 6. What is data quality ?
  • 7. Dimensions ■ completeness – data provided ■ accuracy – reflecting real world ■ credibility – regarded as true ■ timeliness – up-to-date ■ consistency – matching facts across datasets ■ integrity – valid references between datasets
  • 9. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 1 20 06 -1 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 1 20 08 -1 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 how many % of the field is filled and 09 -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 55%
  • 10. type 1 type 2 +
  • 11. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 0 20 05 -1 2 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 0 20 07 -1 2 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 09 how many % of the field is filled and -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 88%
  • 12. reconstruction: 5€ temperature: 32˚C accuracy
  • 14. Auto-measurable ■ completeness – easily ■ accuracy – somehow ■ credibility – not-so ■ timeliness – easily ■ consistency – yes ■ integrity – yes
  • 15. What does that mean: “high quality data?” ?
  • 16. 85%
  • 19. Quality Measurement for accuracy and transparency
  • 20. ■ why to measure? ■ when to measure? ■ where to measure?
  • 21. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index keep intermediate results for auditability
  • 22. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index insert probes at appropriate places
  • 23. like unit testing: 1. write probes 2. set data quality indicators 3. pass data through
  • 24. SQL PostgreSQL yml database table YAML directory coalesce values {x:.2%} + 15.00% data audit threshold formatted printer
  • 25. field nulls status distinct ------------------------------------------------------------ file 0.00% ok 100 source_code 0.00% ok 6 year 0.00% ok 6 donor_code 0.00% ok 2 receiver_name 1.25% fail 10363 receiver_address 13.29% fail 9979 receiver_ico 13.53% fail 5813 project 0.01% ok 28370 program 0.00% ok 29 subprogram 11.60% fail 177 project_budget 14.48% fail 9487 requested_amount 88.73% fail 1356 received_amount 9.32% fail 2179 contract_number 13.29% fail 28627 contract_date 57.88% fail 1425 source_comment 99.93% fail 9 source_id 89.52% fail 814
  • 26. E and T from ETL E as Extraction
  • 28. Ceci ne sont pas des données
  • 31. html body div id=#page div id=#page div id=#container div id=#main div id=#innerMain div (anonymous) div (anonymous) table tbody tr td tabletbody tr td table trtd tbody tabletd value √tr
  • 33. Now: you parse! 3 seconds *non-technical explanation follows
  • 35. ?
  • 36. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ...
  • 37. ?
  • 38. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ...
  • 39. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ... here is a subtitle and it should be in upper-case: o And here is another subtitle: dkaz na (non-breaking space) projekt much better here is a label: Odkaz na projekt
  • 40. “Structured” spreadsheets error prone more work needed
  • 42. 1 2 4 3 5 (1) image & title (2) repeating groups of columns (3) padding rows/columns (4) removed redundancy for readability (5) colored cells
  • 43. 1 2 3 (1) header with row padding (2) multi-row logical cell (3) broken pattern
  • 44. 1 2 (1) multi-row cell (2) more values in a row
  • 45. why? source id itemid file format parser data extraction class id item amount class item amount class why not? amount “structured” file raw data
  • 46. E and T from ETL T as Transformation
  • 47. Basic pattern slightly more technical
  • 48. source lists and maps ? + target ? diff ? target
  • 49. SELECT ... EXCEPT SELECT ... *in PostgreSQL, not in MySQL
  • 50. sta_vvo_vysledky sta_regis - - map_suppliers 1 unknown suppliers ? Slovensko + 2 + tmp_coalesced_suppliers_sk - sta_suppliers + 3 new suppliers
  • 52. Script or manual? script ■ recurrent processing (weekly, monthly,...) ■ huge amount of data ■ one-time processing ■ small amount of data
  • 53. appropriate tool for given task
  • 55. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index
  • 57. Data Sources Data Targets CSV file relational database data stream processing Google Spreadsheet report X remote Excel Spreadsheet URL processing streams
  • 58. data row data row data row data source data target value value value value id id id item item item class class class amount amount amount data source data target data record data record data record id value item value class value amount value
  • 59. Sources X SQL CSV file XLS file SQL query mongo DB yml Google spreadsheet YAML directory row list record list
  • 60. Targets yml SQL CSV file SQL table mongo DB YAML directory {x:.2%} <html> 15.00% HTML table formatted printer row list record list
  • 61. Record Operations + ! append distinct aggregate merge (join) !x ? ? n sample select set select data audit numerical statistics*
  • 62. Field Operations A→B re + + field map text substitute value threshold* derive* abc + string strip consolidate value histogram/bin* set to flag* to type
  • 63. + SQL ? <html> SQL
  • 64. yml nodes = { "source": CSVSourceNode(...), "clean": CoalesceValueToTypeNode(), "output": DatabaseTableTargetNode(...), "audit": AuditNode(...), "threshold": ValueThresholdNode(), "print": FormattedPrinterNode() } connections = [ ("source", "clean"), ("clean", "output"), SQL ("clean", "audit"), ("audit", "threshold"), ("threshold", "print") ] + ... # configure nodes here stream = Stream(nodes, connections) stream.initialize() {x:.2%} stream.run() 15.00%

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n