Presentation at the Semstats2017 workshop (http://semstats.org/2017/) for the paper "Publishing Linked Statistical Data: Aragón, a Case Study", by Oscar Corcho, Idafen Santana-Pérez, Hugo Lafuente, David Portolés, César Cano, Alfredo Peris, José María Subero.
VIP Call Girls Pune Vani 8617697112 Independent Escort Service Pune
Publishing Linked Statistical Data: Aragón, a case study
1. Oscar Corcho1, Idafen Santana-Pérez1,
Hugo Lafuente2, David Portolés3,
César Cano4, Alfredo Peris4 and José María Subero4
1 Ontology Engineering Group, Universidad Politécnica de Madrid
2 Localidata
3 Idearium Consultores
4 Gobierno de Aragón
Publishing Linked
Statistical Data:
Aragón, a case study
ocorcho@fi.upm.es
@ocorcho
22/10/2017
SemStats 2017 @ ISWC
2. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Context
2
IAEst: Instituto Aragonés de Estadística
o http://www.aragon.es/iaest
o The statistical office from Aragón
o Offering open data through
• Open Data portal in Aragón (http://opendata.aragon.es/)
• Their own portal (our interest is on the database of
“estadística local”)
3. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Context: Existing IAEst data infrastructure
3
Existing data infrastructure
o Data warehouse infrastructure based on an Oracle BI
o Exports into different formats, including CSVs
4. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Context: Existing IAEst data infrastructure
4
Existing data infrastructure
o Data warehouse infrastructure based on an Oracle BI
o Exports into different formats, including CSVs
o http://www.aragon.es/DepartamentosOrganismosPublicos/Institu
tos/InstitutoAragonesEstadistica/AreasGenericas/ci.EstadisticaL
ocal.detalleDepartamento
Data retrieval and browsing
o Taxonomy-based
o Fixed filters coded in the app
o User selects
• Administrative division
• The concrete municipality
• Browses the folder structure
o Data retrieved in HTML, PDF
or CSV
5. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Predesigned reports
offered from Oracle BI
Web app for
Estadística Local
Context: Existing IAEst web app
6. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Context: Existing IAEst data sharing
En la Web del IAEst
o http://www.aragon.es/DepartamentosOrganismosPublicos/Institu
tos/InstitutoAragonesEstadistica/AreasGenericas/ci.EstadisticaL
ocal.detalleDepartamento
En OpenDataAragón
o http://opendata.aragon.es/catalogo/edificios-superficie-y-
vivienda-comarcas
7. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Goals
7
Extract those statistical reports, transform them into
RDF according to W3C standards, curate them, link
them to the existing Linked Data from Aragón (mostly
URIs from municipalities and regions) and provide an
API and a new user interface to make use of them
8. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Results
8
An easier-to-maintain data transformation process
o Enriching existing Linked Data APIs from Aragón
o Using GitHub for
• Version control and archival
• Continuous updates: detecting new data and data structures
on a daily basis
• https://github.com/aragonopendata/local-data-aragopedia/
Developer-friendly API
Additional user interface
o Improving data retrieval and browsing capabilities
Side effect: data curation
o Many errors and improvements detected in pre-existing CSV
exports, which have been corrected throughout the process
9. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Transformation and publication process
9
Initial characterisation
•Identify sources
•Identify dimensions
and measurements
Transformation
•Daily data download
•Processing (UTF8)
•Upload into GitHub
•New dimensions/measures
annotation
•RDF transformation
Publication and use
•Linked Data APIs
https://github.com/aragonopendata/local-data-aragopedia/
10. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Initial characterisation
10
Identify and download data sources to be published
(~1000)
o https://github.com/aragonopendata/local-data-
aragopedia/tree/master/data/resource/DatosDescarga-UTF8
Pre-process data (UTF-8 encoding, download error
verification and retrials)
Identify potential dimensions and measurements
o Analysis of column header names (e.g., municipio, comarca),
and data content (how many different values)
• https://github.com/aragonopendata/local-data-
aragopedia/blob/master/data/resource/heads.txt
o From 700+ dimensions to ~500
• Curated by IAEst experts (e.g., Male, M, Males, Female, F,
Females, Women, Men)
11. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Initial characterisation
11
SKOS concept schemes for each dimension
o https://github.com/aragonopendata/local-data-
aragopedia/tree/master/data/dump/DatosTTL/codelists
o Mapping files available in GitHub (e.g.,
https://github.com/aragonopendata/local-data-
aragopedia/blob/master/data/metadata/mapping-tipo-edificio-
detalle.xlsx)
12. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Initial characterisation
12
Measurement properties
o https://github.com/aragonopendata/local-data-
aragopedia/blob/master/data/dump/DatosTTL/codelists/propertie
s.ttl
DSDs
o https://github.com/aragonopendata/local-data-
aragopedia/tree/master/data/dump/DatosTTL/dataStructures
Errors were identified during this phase
o Same concept, different names (e.g. sexo and género)
o Typos in header names
o Columns with no values
o Data belonging to wrong municipalities and districts
o https://github.com/aragonopendata/local-data-
aragopedia/blob/master/data/dump/errorReport.txt
13. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Continuous Transformation
13
Continuous production cycle
o Update RDF as reports are generated, modified or removed
Executed every night
o Retrieves all the reports from the list (generated before)
o Checks whether the reports have been already transformed
or if the contain new data
o Hash signatures for each generated Data Cube
• https://github.com/aragonopendata/local-data-
aragopedia/blob/master/data/resource/hashcode.csv
• Used to compare data versions
• If hashes do not match, the Data Cube is marked to be
regenerated
14. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Continuous Transformation
14
Each iteration generates a GitHub issue, listing the
cubes that have must be created, modified, etc.
o https://github.com/aragonopendata/local-data-
aragopedia/issues
• https://github.com/aragonopendata/local-data-
aragopedia/issues/93 (new data)
• https://github.com/aragonopendata/local-data-
aragopedia/issues/457 (datacube to delete, new
configurations needed)
o When user interaction is needed, this is reflected in the issue
text, and the IAEst responsible needs to update it
RDF transformation is done according to the
configuration file
o https://github.com/aragonopendata/local-data-
aragopedia/blob/master/data/metadata/Informe-01-010001-
A-TC-TM-TP.xlsx
15. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Continuous Transformation
15
RDF data is stored in GitHub (new version)
o https://github.com/aragonopendata/local-data-
aragopedia/tree/master/data/dump/DatosTTL/informes
RDF data is stored in the Open Data Aragón
SPARQL endpoint
o http://opendata.aragon.es/sparql
o Reusing the 3cixty KB deployment utilities
o Each cube is stored on its own graph
o Graphs updated for Data Structure Definition (DSD),
properties and SKOS information
16. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Data transformation. In summary…
bi.aragon.es
Google
Drive
Dataset and
configuration download
New dataset?
GitHub
Sí
For each
dataset
Generate new
configuration and
create an issue
New structure?
No
Create
issue
Sí
New data?
Regenerate
data and
create issue
No
Sí
SPARQL
17. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Data publication and use
17
Data can be accessed
o API (using ELDA)
• http://opendata.aragon.es/herramientas/apis?#aragodbpedia
o GitHub (CSVs, RDF)
o SPARQL endpoint
SPARQL
Elda
Linked Data
18. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Data API
http://opendata.aragon.es/herramientas/apis?#aragodb
pedia
19. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Data publication and use
19
Aragopedia
o http://opendata.aragon.es/apps/aragopedia/datos
o Where, when and what (dónde, cuándo y qué)
o Data can be downloaded in
• CSV
• JSON
20. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Aragopedia
20
Aragopedia
o JSON result of querying about
• Maestrazgo region (where)
• population (what)
• in 1999 (when)
21. Publishing Linked Statistical Data: Aragón, a case study. – SemStats 2017
Conclusions (Results)
21
An easier-to-maintain data transformation process
o Enriching existing Linked Data APIs from Aragón
o Using GitHub for
• Version control and archival
• Continuous updates: detecting new data and data structures
on a daily basis
• https://github.com/aragonopendata/local-data-aragopedia/
Developer-friendly API
Additional user interface
o Improving data retrieval and browsing capabilities
Side effect: data curation
o Many errors and improvements detected in pre-existing CSV
exports, which have been corrected throughout the process
22. Oscar Corcho1, Idafen Santana-Pérez1,
Hugo Lafuente2, David Portolés3,
César Cano4, Alfredo Peris4 and José María Subero4
1 Ontology Engineering Group, Universidad Politécnica de Madrid
2 Localidata
3 Idearium Consultores
4 Gobierno de Aragón
Publishing Linked
Statistical Data:
Aragón, a case study
ocorcho@fi.upm.es
@ocorcho
22/10/2017
SemStats 2017 @ ISWC