Validating statistical index data
represented in RDF using
SPARQL queries
WESO Research Group
University of Oviedo, Spain
...
Motivation
The WebIndex Project
Measure impact of the Web in different countries
First publication: 2012, 2 web sites:
Vis...
WebIndex Workflow
Raw Data
Excel
Computed data
Excel
External
sources
All data
RDF Datastore
Visualizations
Technical details
Index made from
61 countries (100 planned for 2013)
85 indicators:
51 Primary (questionnaires)
34 Second...
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4...
WebIndex
computation Process (2)
Simplified with one indicator, 3 years and 4 countries
Country 2009 2010 2011
Spain -0.57...
Example of data
representation in RDF
Example:
obs:obsM23 a qb:Observation ;
cex:computation
[ a cex:Z-Score ;
cex:observa...
Computex
Vocabulary for statistical computations
Extends RDF Data Cube
Ontology at: http://purl.org/weso/computex
Some ter...
Validation process
Last year (2012)
Template based validation
MD5 checksum of each observation and some fields
This year (...
Validation approach
SPARQL CONSTRUCT queries instead of ASK
IF no error THEN empty model
ELSE RDF graph with error
informa...
ASK WHERE {
?dim a qb:DimensionProperty .
FILTER NOT EXISTS { ?dim rdfs:range [] }
}
SPARQL queries
RDF Data Cube
We conve...
SPARQL expressivity
SPARQL can express complex validation patterns.
Example: Mean
CONSTRUCT {
[ a cex:Error ;
cex:errorPar...
Limitations of
SPARQL expressivity
Some built-in functions are not standardized
Example: z-score employs standard deviatio...
SPARQL expressivity
CONSTRUCT { # ... omitted for brevity
} WHERE {
?obs cex:computation [a cex:AverageGrowth; cex:observa...
RDF Profiles
Descriptions (with constraints) of dataset families
Can be used to validate (and compute) RDF graphs
Examples...
Cube Profile
W3c Candidate Recommendation (http://www.w3.org/TR/vocab-data-cube/)
1 base ontology + RDF Schema inference
2...
Computex Profile
1 base ontology
http://purl.org/weso/computex
Imports RDF Data Cube
cex:computex a cex:ValidationProfile ...
Computex Validation tool
Available at:
http://computex.herokuapp.com
Features
Informative error messages
Option to generat...
Computex Validation Tool
http://computex.herokuapp.com
Conclusions
WebIndex = use case for RDF Validation
SPARQL queries as an implementation approach
Challenges:
Expressivity l...
END OF PRESENTATION
COMPLEMENTARY SLIDES
Limits of
SPARQL expressivity
Ranking of values using GROUP_CONCAT
SELECT ?x ?v ?ranking {
?x :value ?v .
{ SELECT (GROUP_...
Limits of
SPARQL expressivity
Ranking of values comparing all values
SELECT ?x ?v (COUNT(*) as ?ranking) WHERE {
?x :value...
Upcoming SlideShare
Loading in …5
×

Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

7,620 views

Published on

Presented at W3c RDF Validation Workshop
https://www.w3.org/2012/12/rdf-val/

Published in: Technology, Education

Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

  1. 1. Validating statistical index data represented in RDF using SPARQL queries WESO Research Group University of Oviedo, Spain Jose Emilio Labra Gayo Jose María Álvarez Rodríguez
  2. 2. Motivation The WebIndex Project Measure impact of the Web in different countries First publication: 2012, 2 web sites: Visualization http://thewebindex.org Data portal http://data.webfoundation.org
  3. 3. WebIndex Workflow Raw Data Excel Computed data Excel External sources All data RDF Datastore Visualizations
  4. 4. Technical details Index made from 61 countries (100 planned for 2013) 85 indicators: 51 Primary (questionnaires) 34 Secondary (external sources) WebIndex RDF data Modeled on top of RDF Data Cube More than 1 million triples Linked data: DBPedia, Organizations, etc. Records data computation
  5. 5. Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 WebIndex computation process (1) Simplified with one indicator, 3 years and 4 countries Raw Data Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Imputed Data Filtered Data (Indicator A) Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Normalized Data (z-scores) Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 More details can be found here: http://thewebindex.org/about/methodology/computation/
  6. 6. WebIndex computation Process (2) Simplified with one indicator, 3 years and 4 countries Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Normalized Data (z-scores) Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 More details can be found here: http://thewebindex.org/about/methodology/computation/ Adjusted data Country A B C D ... Spain 8 7 9.1 7.1 ... Finland 7 8 7.1 8 ... Chile 8 9 7.6 6 ... Group indicators Country Readiness Impact Web Composite Spain 5.7 3.5 5.1 4.5 Finland 5.5 3.9 7.1 4.9 Chile 6.7 4.5 7.6 5.1 Rankings Country Readiness Impact Web Composite Spain 2 3 3 3 Finland 3 2 2 2 Chile 1 1 1 1
  7. 7. Example of data representation in RDF Example: obs:obsM23 a qb:Observation ; cex:computation [ a cex:Z-Score ; cex:observation obs:obsA23 ; cex:slice slice:sliceA09 ; ] ; cex:value -0.57 ; cex:md5-checksum "2917835203..." ; cex:indicator indicator:A ; cex:concept country:ESP ; qb:dataSet dataset:A-Normalized ; # ... other declarations omitted for brevity It declares that the value of this observation was obtained as z-score of obs:obsA23 over slice:sliceA09 Each observation follows the RDF Data Cube vocabulary extented with metadata about how it was obtained
  8. 8. Computex Vocabulary for statistical computations Extends RDF Data Cube Ontology at: http://purl.org/weso/computex Some terms: cex:Concept cex:Indicator cex:Computation cex:WeightSchema qb:Observation ...
  9. 9. Validation process Last year (2012) Template based validation MD5 checksum of each observation and some fields This year (2013) SPARQL based validation 3 levels of validation RDF Data Cube Shapes of data Computation process Ultimate goal: automatically compute the index
  10. 10. Validation approach SPARQL CONSTRUCT queries instead of ASK IF no error THEN empty model ELSE RDF graph with error informationCONSTRUCT { [ a cex:Error ; cex:errorParam [cex:name "obs"; cex:value ?obs ] , [cex:name "value1"; cex:value ?value1 ] , [cex:name "value2"; cex:value ?value2 ] ; cex:msg "Observation has two different values" . ] } WHERE { ?obs a qb:Observation . ?obs cex:value ?value1 . ?obs cex:value ?value2 . FILTER ( ?value1 != ?value2 ) }
  11. 11. ASK WHERE { ?dim a qb:DimensionProperty . FILTER NOT EXISTS { ?dim rdfs:range [] } } SPARQL queries RDF Data Cube We converted RDF Data Cube integrity constraints from ASK to CONSTRUCT queries CONSTRUCT { [ a cex:Error ; cex:errorParam [cex:name "dim"; cex:value ?dim ] ; cex:msg "Every Dimension must have a declared range" . ] } WHERE { ?dim a qb:DimensionProperty . FILTER NOT EXISTS { ?dim rdfs:range [] } }
  12. 12. SPARQL expressivity SPARQL can express complex validation patterns. Example: Mean CONSTRUCT { [ a cex:Error ; cex:errorParam # ...omitted cex:msg "Mean value does not match" ] . } WHERE { ?obs a qb:Observation ; cex:computation ?comp ; cex:value ?val . ?comp a cex:Mean . { SELECT (AVG(?value) as ?mean) ?comp WHERE { ?comp cex:observation ?obs1 . ?obs1 cex:value ?value ; } GROUP BY ?comp } FILTER( abs(?mean - ?val) > 0.0001) }}
  13. 13. Limitations of SPARQL expressivity Some built-in functions are not standardized Example: z-score employs standard deviation. It requires built-in function: sqrt Available in some SPARQL implementations Ranking of values was not obvious (*) 2 solutions: Using GROUP_CONCAT Check a value against all the other values Neither solution is efficient (*) http://stackoverflow.com/questions/17313730/how-to-rank-values-in-sparql
  14. 14. SPARQL expressivity CONSTRUCT { # ... omitted for brevity } WHERE { ?obs cex:computation [a cex:AverageGrowth; cex:observations ?ls] ; cex:value ?val . ?ls rdf:first [cex:value ?v1 ] . { SELECT ( SUM(?v_n / ?v_n1)/COUNT(*) as ?meanGrowth) WHERE { ?ls rdf:rest* [ rdf:first [ cex:value ?v_n ] ; rdf:rest [ rdf:first [ cex:value ?v_n1 ]]] . } } FILTER (abs(?meanGrowth * ?v1 - ?val) > 0.001) } * This solution was provided by Joshua Taylor
  15. 15. RDF Profiles Descriptions (with constraints) of dataset families Can be used to validate (and compute) RDF graphs Examples: Cube (Statistical data) Normalization Update queries + Integrity constraints Computex (statistical computations) Adds more queries Other user defined profiles Cube Computex WebIndexCloudIndex
  16. 16. Cube Profile W3c Candidate Recommendation (http://www.w3.org/TR/vocab-data-cube/) 1 base ontology + RDF Schema inference 2 update queries (normalization algorithm) 21 integrity constraint queries cex:cube a cex:ValidationProfile ; cex:name "RDF Data Cube" ; cex:ontologyBase cube:cube.ttl ; cex:expandSteps ( [ cex:name "Closure" ; cex:uri cube:closure.ru ] [ cex:name "Flatten" ; cex:uri cube:flatten.ru ] ) ; cex:integrityQuery [ cex:name "IC-1. Unique DataSet" ; cex:uri cube:q1.sparql ]; # ... .
  17. 17. Computex Profile 1 base ontology http://purl.org/weso/computex Imports RDF Data Cube cex:computex a cex:ValidationProfile ; cex:name "Computex" ; cex:ontologyBase cex:computex.ttl ; cex:import profile:cube.ttl ; cex:expandSteps ( [ cex:name "Copy Raw"; cex:uri cex:copyRaw.sparql ] # Other UPDATE queries ) ; cex:integrityQuery [ cex:name "Adjusted" cex:uri cex:adjusted.sparql ] , # Other integrity queries
  18. 18. Computex Validation tool Available at: http://computex.herokuapp.com Features Informative error messages Option to generate EARL report 2 validation profiles (Cube and Computex) Source code in Scala available at: http://github.com/weso/computex Based on profiles
  19. 19. Computex Validation Tool http://computex.herokuapp.com
  20. 20. Conclusions WebIndex = use case for RDF Validation SPARQL queries as an implementation approach Challenges: Expressivity limits of SPARQL Complexity of some queries Concept of RDF Profiles Declarative, Turtle syntax Intermediate level between OWL and SPARQL
  21. 21. END OF PRESENTATION
  22. 22. COMPLEMENTARY SLIDES
  23. 23. Limits of SPARQL expressivity Ranking of values using GROUP_CONCAT SELECT ?x ?v ?ranking { ?x :value ?v . { SELECT (GROUP_CONCAT(?x;separator="") as ?ordered) { { SELECT ?x { ?x :value ?v . } ORDER BY DESC(?v) } } } BIND (str(?x) as ?xName) BIND (strbefore(?ordered,?xName) as ?before) BIND ((strlen(?before) / strlen(?xName)) + 1 as ?ranking) } ORDER BY ?ranking
  24. 24. Limits of SPARQL expressivity Ranking of values comparing all values SELECT ?x ?v (COUNT(*) as ?ranking) WHERE { ?x :value ?v . [] :value ?u . FILTER( ?v <= ?u ) } GROUP BY ?x ?v ORDER BY ?ranking * This solution was provided by Joshua Taylor

×