Data Cleansing introduction (for BigClean Prague 2011)

Data Cleansing
What about quality?

Stefan Urbanek
stefan.urbanek@gmail.com
@Stiivi March 2011

Content

■ Introduction
■ What is data quality?
■ E and T from ETL
■ Summary

http://vestnik.transparency.sk

Brewery
analytical data streams

&
Cubes
online analytical processing

github/bitbucket: Stiivi

What is data quality

?

Dimensions
■ completeness – data provided
■ accuracy – reﬂecting real world
■ credibility – regarded as true
■ timeliness – up-to-date
■ consistency – matching facts across datasets
■ integrity – valid references between datasets

all

none
better

0%
25%
50%
75%
100%

20
05
-3
20
05
-5
20
05
-7
20
05
-9
20
05
-1 1
20
06
-1
20
06
-3
20
06
-5
20
06
-7
20
06
-9
20
06
-1 1
20
07
-1
20
07
-3
20
07
-5
20
07
-7
20
07
-9
20
07
-1 1
20
08
-1
20
08
-3
20
08
-5
20
08
-7
20
08
-9
20
08
-1 1
20
09
-1

successfully processed? 20
20
09
09
-3
-5
20
09
-7
20
09
-9
20
how many % of the ﬁeld is ﬁlled and

09
-1 1
20
10
-1
20
10
-3
20
10
-5
20
10
-7
20
10
-9
Quality measure
completeness: 55%

all

none
better

0%
25%
50%
75%
100%

20
05
-3
20
05
-5
20
05
-7
20
05
-9
20
05
-1 0
20
05
-1 2
20
06
-3
20
06
-5
20
06
-7
20
06
-9
20
06
-1 1
20
07
-1
20
07
-3
20
07
-5
20
07
-7
20
07
-9
20
07
-1 0
20
07
-1 2
20
08
-3
20
08
-5
20
08
-7
20
08
-9
20
08
-1 1
20
09
-1

successfully processed? 20
20
09
09
-3
-5
20
09
-7
20
09
-9
20
09
how many % of the ﬁeld is ﬁlled and

-1 1
20
10
-1
20
10
-3
20
10
-5
20
10
-7
20
10
-9
Quality measure
completeness: 88%

reconstruction: 5€

temperature: 32˚C

accuracy

Auto-measurable
■ completeness – easily
■ accuracy – somehow
■ credibility – not-so
■ timeliness – easily
■ consistency – yes
■ integrity – yes

What does that mean:
“high quality data?”

?

appropriate for given
purpose

Quality Measurement
for accuracy and transparency

■ why to measure?
■ when to measure?
■ where to measure?

from staging to analytical data

from source to staging data analytical model
since 2009 description

Download Parse Load source Cleanse Create cube

staging clean data
raw sources HTML files YAML files contracts table
(staging)

from source to staging data
2005-2008

REGIS (SK "unknown" fact table
organisations) suppliers map dimension tables
Load source
Download Parse 08
08

YAML files
raw sources
2005-2008
search index

Pre-process
Create
search index
One HTML per
Large HTML files
Procurement dimension tables search index
(one per year)
Document

dimension index

keep intermediate results for auditability




staging clean data
(staging)

2005-2008

Load source
Download Parse 08
08

YAML ﬁles
raw sources
2005-2008
search index

Pre-process
Create
search index
One HTML per
Large HTML ﬁles
(one per year)
Document

dimension index

insert probes at appropriate places

like unit testing:

1. write probes
2. set data quality indicators
3. pass data through

SQL

PostgreSQL
yml database
table

YAML directory coalesce
values
{x:.2%}
+ 15.00%

data audit threshold formatted
printer

field nulls status distinct
------------------------------------------------------------
file 0.00% ok 100
source_code 0.00% ok 6
year 0.00% ok 6
donor_code 0.00% ok 2
receiver_name 1.25% fail 10363
receiver_address 13.29% fail 9979
receiver_ico 13.53% fail 5813
project 0.01% ok 28370
program 0.00% ok 29
subprogram 11.60% fail 177
project_budget 14.48% fail 9487
requested_amount 88.73% fail 1356
received_amount 9.32% fail 2179
contract_number 13.29% fail 28627
contract_date 57.88% fail 1425
source_comment 99.93% fail 9
source_id 89.52% fail 814

E and T from ETL
E as Extraction

html
body
div id=#page
div id=#page
div id=#container
div id=#main
div id=#innerMain
div (anonymous)
div (anonymous)
table tbody
tr td
tabletbody
tr td
table trtd
tbody
tabletd value
√tr

Now: you parse!
3 seconds

*non-technical explanation follows

More information

o

dkaz na projekt
...

o

dkaz na projekt
...

here is a subtitle
and it should be in upper-case:
o
And here is another subtitle:
dkaz na (non-breaking space) projekt

much better

here is a label: Odkaz na projekt

“Structured”
spreadsheets

error prone
more work needed

1

2

4

3
5

(1) image & title
(2) repeating groups of columns
(3) padding rows/columns
(4) removed redundancy for readability
(5) colored cells

1

2

3

(1) header with row padding
(2) multi-row logical cell
(3) broken pattern

1
2

(1) multi-row cell
(2) more values in a row

why?

source id
itemid
ﬁle format parser data extraction
class id
item
amount
class
item
amount
class
why not? amount

“structured”
ﬁle
raw data

E and T from ETL
T as Transformation

Basic pattern
slightly more technical

source lists and maps

?

+
target

?

diff

?

target

SELECT ...
EXCEPT
SELECT ...

*in PostgreSQL, not in MySQL

sta_vvo_vysledky
sta_regis

- -

map_suppliers
1
unknown suppliers

? Slovensko

+

2

+

tmp_coalesced_suppliers_sk

-
sta_suppliers

+
3

new suppliers

Script or manual?

script

Script or manual?
script

■ recurrent processing (weekly, monthly,...)
■ huge amount of data

■ one-time processing
■ small amount of data

appropriate tool
for given task




staging clean data
(staging)

2005-2008

Load source
Download Parse 08
08

YAML ﬁles
raw sources
2005-2008
search index

Pre-process
Create
search index
One HTML per
Large HTML ﬁles
(one per year)
Document

dimension index

Data Sources Data Targets

CSV ﬁle

relational database
data stream
processing
Google Spreadsheet

report

X
remote Excel Spreadsheet URL

processing streams

data row data row data row
data source data target

value value value value

id id id
item item item
class class class
amount amount amount
data source data target
data record data record data record

id value

item value

class value

amount value

Sources

X
SQL

CSV ﬁle XLS ﬁle SQL query mongo DB

yml

Google spreadsheet YAML directory row list record list

Targets

yml
SQL

CSV ﬁle SQL table mongo DB YAML directory

{x:.2%}
<html> 15.00%

HTML table formatted printer row list record list

Record Operations

+
!

append distinct aggregate merge (join)

!x
? ? n

sample select set select data audit numerical statistics*

Field Operations
A→B
re + +
ﬁeld map text substitute value threshold* derive*

abc
+
string strip consolidate value histogram/bin* set to ﬂag*
to type

+
SQL

? <html>

SQL

yml nodes = {
"source": CSVSourceNode(...),
"clean": CoalesceValueToTypeNode(),
"output": DatabaseTableTargetNode(...),
"audit": AuditNode(...),
"threshold": ValueThresholdNode(),
"print": FormattedPrinterNode()
}

connections = [
("source", "clean"),
("clean", "output"),
SQL
("clean", "audit"),
("audit", "threshold"),
("threshold", "print")
]

+ ... # configure nodes here

stream = Stream(nodes, connections)
stream.initialize()
{x:.2%} stream.run()
15.00%

Data Cleansing introduction (for BigClean Prague 2011)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Data Cleansing introduction (for BigClean Prague 2011)

Similar to Data Cleansing introduction (for BigClean Prague 2011) (11)

More from Stefan Urbanek

More from Stefan Urbanek (20)

Recently uploaded

Recently uploaded (20)

Data Cleansing introduction (for BigClean Prague 2011)

Editor's Notes