2. The Problem
• Knowledge
• Provenance
Regulation A Art 12 Art 14, lid 3, 2e volzin Art 14, lid 3, 2e volzin
(01-01-2011) (04-02-2011) (11-06-2008) (01-07-2011)
• Open Data: public service falls short
• Large scale validation of CEN MetaLex
• “Linked Open Government Data”
3. Current
Situation
Public content services hosted at wetten.nl
4. Wetten.nl XML Service
http://wetten.overheid.nl/xml.php?regelingID=...
• Only available format is BWB XML
• Only current version
• Content at document level
• Identification at document level
• Identifiers are not dereferencable
• Hardly any metadata (e.g. version date)
• Only available context is position in text
6. Identifiers &
Juriconnect
1.0:c:BWBR0005416&artikel=6
vs
http://wetten.overheid.nl/cgi-bin/deeplink/law1/bwbid=BWBR0005416/article=6/date=2005-01-14
vs
http://wetten.overheid.nl/BWBR0005416/TitelII698946/HoofdstukII/Artikel16/
geldigheidsdatum_14-01-2005
• Juriconnect?
• URN-based... but no naming server
• (cf. Document Object Identifiers)
• Named elements do not carry identifier
• No explicit version information, only contextual
7. Sources used...
• List of all regulations in “XML”
• Wetten.nl XML Service
• Metadata in HTML table on wetten.nl
(the “info page”)
• ... so let’s get started already
9. Our Goals
• “Deserialize” regulation content
(e.g. topic-based browsing)
• Extract and reconstruct implicit information
(identifiers, metadata)
• Annotate regulations
(reconstructed metadata, third-party metadata)
• Annotate using regulations
(knowledge based systems, services, business processes ...)
• Accessible and reusable for any other party
(shared vocabularies, standard access)
10. Requirements
• Unique, persistent identification
• Generic XML structure of documents
• Extensible metadata framework
• Flexible web services
11. Technology Choices
• URL-like URIs
• CEN MetaLex XML documents
• Linked Data / RDF metadata
(extensibility to OWL, RIF)
• Transparent REST-services
12. Step 2
Come up with persistent identifiers at
element level and a solid versioning scheme
13. Identification
• Web-enabled “URL-like” URIs
• e.g. http://doc.metalex.eu/....
• “Cool” URIs (http://www.w3.org/TR/cooluris/)
• “Accept”-header based dereferencing
• Different types of content at same URI
14. Levels of Identification
Bibliographic
Work
Entity
realizes
• IFLA FRBR levels Expression
embodies
• Work Manifestation
exemplifies
• Expression Item
• Manifestation XML version of
regulation on
XML version of Version of
Regulation
regulation regulation
my harddisk
15. Transparent Identifiers
• Hierarchical information (work)
http://doc.metalex.eu/id/BWBR0011823/hoofdstuk/1/artikel/1
http://doc.metalex.eu/id/BWBR0011823/artikel/1
• Version and language (expression)
http://doc.metalex.eu/id/BWBR0011823/hoofdstuk/1/artikel/1/nl/2010-09-01
• Format information (manifestation)
http://doc.metalex.eu/doc/BWBR0011823/hoofdstuk/1/artikel/1/nl/2010-09-01/data.xml
16. Problem
• URIs don’t carry semantics...
• Detect changes:
• which element versions are the same
• ... and which versions are different?
Art. 44, lid 4
(2011-03-26)
Art. 44, lid 4
(2011-04-05)
from: Besluit prudentiële regels Wft, BWBR0020420
19. Procedure
For each BWB XML file listed,
if update has occurred since latest run,
download latest version,
scrape metadata, and
produce:
Persistent URIs
CEN MetaLex + Citations
Inline RDFa (optional) or RDF graph (optional),
Pajek “.net” files (optional)
20. CEN MetaLex
• Straightforward 1:1 mapping
• ... some minor fixes
• Mint URI’s on the fly
• Convert citations on the fly
• Generate metadata on the fly
• “inline” inside mcontainer elements
21. Results
14
Table 1. Conversion performance for 300 randomly selected regulations.
Number % Number %
42
Substitutions Corrections
container 22312 29 % artikel 2525 72 %
hcontainer 3730 5% divisie 519 15 %
htitle 3730 5% colspec 289 8%
block 34325 44 % illustratie 54 2%
inline 13527 17 % others 99 3%
Total 77624 Total 3486
Total no. of regulations 300
Revoked regulations 109 30 %
Correction % 4%
Lastly, the MDS offers a simple search interface for finding regulations based on
the title and version date.
6 Conclusion(full description in draft ISWC 2011 paper)
and Results
We ran the MetaLex conversion script on all regulations available through the
wetten.nl portal, resulting in a total of 27.687 versions of regulations being con-
40
23. Metadata Vocabularies
• “RDFized” BWB elements
• MetaLex ontology
• FRBR type, modification events, structure
• Dublin Core
• title, alternativeTitle, version
• FOAF
• page, homepage
• Simple Event Model (SEM)
• Open Provenance Model vocabulary (OPMV)
• W3C Time Ontology
25. Events & Provenance
The date at which the expression was created
"2009-10-23"^^xsd:date time:Instant ml:Date sem:Time
rdf:value
sem:hasTimeStamp rdf:type
rdf:type sem:timeType
time:inXSDDateTime rdf:type
opmv:Process http://doc.metalex.eu/id/date/2009-10-23 sem:Event ml:LegislativeModification
sem:hasTime rdf:type
rdf:type time:hasEnd rdf:type
ml:date sem:eventType The creation event of the regulation
http://doc.metalex.eu/id/process/BWBR0017869/2009-10-23 http://doc.metalex.eu/id/event/BWBR0017869/2009-10-23 opmv:Artifact
opmv:wasGeneratedAt
The process that generated the expression ml:resultOf
rdf:type ml:BibliographicExpression
opmv:wasGeneratedBy
rdf:type
http://doc.metalex.eu/id/BWBR0017869/2009-10-23
The expression (version) URI of a regulation
27. Document Serving
• RESTful API
• Implement Cool URIs
(Dereference to XML, RDF, .net)
• Shorthands (‘/latest’)
• SPARQL endpoint
• Citation graphs
• Rudimentary (and unpredictable) search
• CSS Stylesheet for CEN MetaLex XML
28. Dereferencing (RDF)
File containing Turtle serialisation of SCBD http://doc.metalex.eu/id/BWBR0011823/nl/2010-09-01
Accept: application/x-turtle
1 Client requests URI
MDS returns Turtle 5
http://doc.metalex.eu/doc/BWBR0011823/nl/2010-09-01/data.ttl
2 Server redirects to manifestation URI (HTTP 303)
JSON serialisation SPARQL
Triplestore returns SCBD 4 of SCBD Query 3 Server queries triplestore for Symmetric Concise Bounded Description (SCBD)
http://www.w3.org/Submission/CBD
29. Dereferencing (XML)
Location of Manifestation http://doc.metalex.eu/id/BWBR0011823/nl/2010-09-01
Accept: text/xml
http://doc.metalex.eu/files/BWBR0011823_2010-03-01_mls.xml 1 Client requests URI
MDS redirects to Manifestation URI (HTTP 302) 6
http://doc.metalex.eu/doc/BWBR0011823/nl/2010-09-01/data.xml
2 Server redirects to manifestation URI (HTTP 303)
Triplestore returns URI of Manifestation 5 Manifestation Glob 3 Server queries file store for XML manifestation
4 If no manifestation exist, extract from parent
(extract)
(Clients may render XML using CSS stylesheet)
30. Dereferencing (...)
• Other RDF syntaxes
application/rdf+xml, text/rdf+n3
• HTML clients
application/xml, application/xhtml+xml, text/html
• Redirect (303) to Marbles browser
• Pajek clients
text/plain
• Download .net file
• View using Gephi Toolkit
http://gephi.org
31. Technical Details
• Current situation
• +/- 27 thousand regulations
• 87.9 million triples (legislation.gov.uk: 1.9 billion)
• Updated daily
• Technical details
• Dell PowerEdge II T110, 32GB RAM
• Garlik 4Store triplestore (http://4store.org)
• Python Django web applications
• Tomcat servlet + Gephi Toolkit API
• See http://doc.metalex.eu
32. Step 5
Use: social network analysis and concept
extraction (ongoing work)
33. Network Analysis
• Impact of regulation on other
regulations
(combine with work on court rulings)
• Connectedness
• “Importance” of articles
• Analysis tools
• Pajek, Gephi