Statistical Data in RDF
Knowledge Engineering Group
Seminar, November 4th 2010
Jindřich Mynarz
@jindrichmynarz
Scope of the talk
• not microdata (e.g., survey data)
• but aggregated data (e.g., averages)
• only RDF
• overview of exis...
RDF
• separation of content and layout
o in tabular data table layout defines
the way of interpretation
• flexible, schema...
Existing statistics in RDF
• CIA World Factbook
• U.S. Census 2000 dataset
• LOIUS - Italian linked university statistics
...
Eurostat data
• Freie Universität Berlin - D2R Server
• riese (RDFizing and Interlinking the EuroStat Data Set
Effort)
• O...
Governmental statistics
• data.gov
• data.gov.uk
o EnAKTing mashups and data visualizations
o population, crime, CO2 emiss...
Data modelling
• what is being modelled?
o the real world
o a part of the real world
o statistics
• two parts of modelling...
Structural semantics
• means of expression for the cube's
structure
• groups, slices, time series
• addressed in Data Cube...
Domain semantics
• how a dataset refers to the
things that it is about
• connecting statistical
observations to the model ...
Vocabularies
• number of ad hoc vocabularies
• riese
• SCOVO
• SCOVOLink
• Data Cube
• SDMX/RDF
SCOVO
• The Statistical Core Vocabulary
• inspired by riese vocabulary
• modelling of dimensions and observations as separ...
Data Cube
• inspired by SCOVO
o added expressive power
• generalization from SDMX/RDF
• re-use of SKOS for codelists
Data Cube
Data Cube
• dimensions (rdf:Property)
• coded values (skos:Concept)
Data Cube
SDMX/RDF
• Statistical Data and Metadata eXchange reformulated in
RDF
• built on top of Data Cube
• contains:
o sdmx
o sdm...
Important parts of modelling
• re-use
• units
• time
• identifiers
• URI patterns
Re-use oriented design
• re-purposing parts of the
existing datasets
• re-using shared
vocabularies
• vocabulary hi-jackin...
Units of measurement
• implicit
o “78693011 mˆ2”, “117 
b”
o eurostat:total_ar
ea_km2
• explicit
o :unit, sdmx-
attribute:...
Modelling of time
• exclusion of the dimension of time (D2R Eurostat, U.S. 
Census 2000)
• time dimension (riese, SDMX/RDF...
Identifiers
• blank nodes
• URIs
• HTTP URIs
URI design patterns
• on the Web
o http://
• human-readable
o what/is/this/about
• clustered by resource type
o type/uniqu...
Following steps
• data conversion
• interlinking dataset's resources
• linking external datasets
• publishing
Legacy datasets
• statistics-specific data formats
• implicit context of interpretation
• parsing, cleaning
• conversion m...
Linking
• re-use by reference
• lightweight intergration
• linkable data
• linking properties
o e.g., owl:sameAs, 
skos:cl...
Publishing data
• new dissemination standards
• exchanging data with the Web
• RDF dumps
• linked data distribution
• SPAR...
Linked open data cloud
Benefits
• data can be intergrated
• open data
• re-usable data
• data available for applications
Integration
• combining and merging with other datasets
• re-use oriented design
Open data
• freedom of information for public sector information
• open licences
o Creative Commons, Open Government Licen...
Anyone can solve the cube
• data is available for individual
analysis
• offices for national statistics
still have the mon...
Building on top of statistical data
• once the data is available
useful applications can be
built on top of it
• data visu...
Questions!
Thank you for attention!
Image credits
Semantic Web Rubik's Cube. http://www.flickr.com/photos/dullhunk/3448804778/
Rubik's Cube. http://www.flickr...
Upcoming SlideShare
Loading in...5
×

Statistical data in RDF

3,121

Published on

Slides for a seminar at Knowledge Engineering Group, November 4th, 2010.

Published in: Technology, Education
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,121
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
60
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Statistical data in RDF

  1. 1. Statistical Data in RDF Knowledge Engineering Group Seminar, November 4th 2010 Jindřich Mynarz @jindrichmynarz
  2. 2. Scope of the talk • not microdata (e.g., survey data) • but aggregated data (e.g., averages) • only RDF • overview of existing statistical datasets
  3. 3. RDF • separation of content and layout o in tabular data table layout defines the way of interpretation • flexible, schema-less data format o not overly inclusive, nor overly exclusive
  4. 4. Existing statistics in RDF • CIA World Factbook • U.S. Census 2000 dataset • LOIUS - Italian linked university statistics • Linked Environment Data • EnAKTing datasets • data.gov.uk datasets
  5. 5. Eurostat data • Freie Universität Berlin - D2R Server • riese (RDFizing and Interlinking the EuroStat Data Set Effort) • OntologyCentral - real-time wrapper • Eurostat's own RDF datasets
  6. 6. Governmental statistics • data.gov • data.gov.uk o EnAKTing mashups and data visualizations o population, crime, CO2 emissions, transport, agriculture, education...
  7. 7. Data modelling • what is being modelled? o the real world o a part of the real world o statistics • two parts of modelling o structural semantics o domain semantics
  8. 8. Structural semantics • means of expression for the cube's structure • groups, slices, time series • addressed in Data Cube vocabulary
  9. 9. Domain semantics • how a dataset refers to the things that it is about • connecting statistical observations to the model of the domain described by them • domain is a set of non- information resources
  10. 10. Vocabularies • number of ad hoc vocabularies • riese • SCOVO • SCOVOLink • Data Cube • SDMX/RDF
  11. 11. SCOVO • The Statistical Core Vocabulary • inspired by riese vocabulary • modelling of dimensions and observations as separate resources • lightweight, easy to adopt • SCOVOLink addresses domain semantics
  12. 12. Data Cube • inspired by SCOVO o added expressive power • generalization from SDMX/RDF • re-use of SKOS for codelists
  13. 13. Data Cube
  14. 14. Data Cube • dimensions (rdf:Property) • coded values (skos:Concept)
  15. 15. Data Cube
  16. 16. SDMX/RDF • Statistical Data and Metadata eXchange reformulated in RDF • built on top of Data Cube • contains: o sdmx o sdmx-attribute o sdmx-code o sdmx-concept o sdmx-dimension o sdmx-measure o sdmx-metadata o sdmx-subject 
  17. 17. Important parts of modelling • re-use • units • time • identifiers • URI patterns
  18. 18. Re-use oriented design • re-purposing parts of the existing datasets • re-using shared vocabularies • vocabulary hi-jacking and extension
  19. 19. Units of measurement • implicit o “78693011 mˆ2”, “117  b” o eurostat:total_ar ea_km2 • explicit o :unit, sdmx- attribute:unitMea sure
  20. 20. Modelling of time • exclusion of the dimension of time (D2R Eurostat, U.S.  Census 2000) • time dimension (riese, SDMX/RDF) o dimension:Time, sdmx:TimeRole o time series
  21. 21. Identifiers • blank nodes • URIs • HTTP URIs
  22. 22. URI design patterns • on the Web o http:// • human-readable o what/is/this/about • clustered by resource type o type/unique-id • standardized o {provider 1}/path/to/an/observation o {provider 2}/path/to/an/observation • hierarchical o {broader}/{narrower}  • reflecting the location of an observation in a data cube o {dimension 1}/{dimension 2}
  23. 23. Following steps • data conversion • interlinking dataset's resources • linking external datasets • publishing
  24. 24. Legacy datasets • statistics-specific data formats • implicit context of interpretation • parsing, cleaning • conversion mechanisms o SQL DB wrappers (e.g.,  D2R Server) o real-time exporters (e.g.,  OntologyCentral) o RDFizers (e.g. RDF123) o custom-built scripts
  25. 25. Linking • re-use by reference • lightweight intergration • linkable data • linking properties o e.g., owl:sameAs,  skos:closeMatch
  26. 26. Publishing data • new dissemination standards • exchanging data with the Web • RDF dumps • linked data distribution • SPARQL • RDFa
  27. 27. Linked open data cloud
  28. 28. Benefits • data can be intergrated • open data • re-usable data • data available for applications
  29. 29. Integration • combining and merging with other datasets • re-use oriented design
  30. 30. Open data • freedom of information for public sector information • open licences o Creative Commons, Open Government Licence... • public domain
  31. 31. Anyone can solve the cube • data is available for individual analysis • offices for national statistics still have the monopoly on data collection, but no longer on interpretation of that data • data-driven journalism
  32. 32. Building on top of statistical data • once the data is available useful applications can be built on top of it • data visualizations • data analysis tools
  33. 33. Questions!
  34. 34. Thank you for attention!
  35. 35. Image credits Semantic Web Rubik's Cube. http://www.flickr.com/photos/dullhunk/3448804778/ Rubik's Cube. http://www.flickr.com/photos/bramus/3249196137/ Hypercube. http://commons.wikimedia.org/wiki/File:Hypercube.png PICOL: Pictorial communication language. http://picol.org/ Dictionary. http://www.flickr.com/photos/horiavarlan/4268897748/ Oops! http://www.flickr.com/photos/rore/299375688/ Tape Measure. http://www.flickr.com/photos/wwarby/4915969081/ Rubik's Cube 1. http://www.flickr.com/photos/lifeontheedge/374960949/ Detroit's Skyline. http://www.flickr.com/photos/showmeone/4154861617/ Linked Oped Data Cloud. http://richard.cyganiak.de/2007/10/lod/ Cube. http://followtherhythm.deviantart.com/art/cube-128329792 Data Cube diagram. http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/qb- fig1.png
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×