Data Cleansing introduction (for BigClean Prague 2011)
Apr. 3, 2011•0 likes
20 likes
Be the first to like this
Show More
•4,866 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Report
Technology
Business
Presentation from the BigClean event in spring 2011 in Prague. Briefly introduces to data quality, cleansing and shows some examples from existing open data/open government projects.
■ why to measure?
■ when to measure?
■ where to measure?
from staging to analytical data
from source to staging data analytical model
since 2009 description
Download Parse Load source Cleanse Create cube
staging clean data
raw sources HTML files YAML files contracts table
(staging)
from source to staging data
2005-2008
REGIS (SK "unknown" fact table
organisations) suppliers map dimension tables
Load source
Download Parse 08
08
YAML files
raw sources
2005-2008
search index
Pre-process
Create
search index
One HTML per
Large HTML files
Procurement dimension tables search index
(one per year)
Document
dimension index
keep intermediate results for auditability
from staging to analytical data
from source to staging data analytical model
since 2009 description
Download Parse Load source Cleanse Create cube
staging clean data
raw sources HTML files YAML files contracts table
(staging)
from source to staging data
2005-2008
REGIS (SK "unknown" fact table
organisations) suppliers map dimension tables
Load source
Download Parse 08
08
YAML files
raw sources
2005-2008
search index
Pre-process
Create
search index
One HTML per
Large HTML files
Procurement dimension tables search index
(one per year)
Document
dimension index
insert probes at appropriate places
html
body
div id=#page
div id=#page
div id=#container
div id=#main
div id=#innerMain
div (anonymous)
div (anonymous)
table tbody
tr td
tabletbody
tr td
table trtd
tbody
tabletd value
√tr
<SPAN class=podnazov
style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na projekt
...
here is a subtitle
and it should be in upper-case:
o
And here is another subtitle:
dkaz na (non-breaking space) projekt
much better
here is a label: Odkaz na projekt
from staging to analytical data
from source to staging data analytical model
since 2009 description
Download Parse Load source Cleanse Create cube
staging clean data
raw sources HTML files YAML files contracts table
(staging)
from source to staging data
2005-2008
REGIS (SK "unknown" fact table
organisations) suppliers map dimension tables
Load source
Download Parse 08
08
YAML files
raw sources
2005-2008
search index
Pre-process
Create
search index
One HTML per
Large HTML files
Procurement dimension tables search index
(one per year)
Document
dimension index
Data Sources Data Targets
CSV file
relational database
data stream
processing
Google Spreadsheet
report
X
remote Excel Spreadsheet URL
processing streams
data row data row data row
data source data target
value value value value
id id id
item item item
class class class
amount amount amount
data source data target
data record data record data record
id value
item value
class value
amount value
Sources
X
SQL
CSV file XLS file SQL query mongo DB
yml
Google spreadsheet YAML directory row list record list
Targets
yml
SQL
CSV file SQL table mongo DB YAML directory
{x:.2%}
<html> 15.00%
HTML table formatted printer row list record list
Record Operations
+
!
append distinct aggregate merge (join)
!x
? ? n
sample select set select data audit numerical statistics*
Field Operations
A→B
re + +
field map text substitute value threshold* derive*
abc
+
string strip consolidate value histogram/bin* set to flag*
to type