The MetaLex Document Server - Legal Documents as Versioned Linked Data

8,439 views
8,638 views

Published on

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
8,439
On SlideShare
0
From Embeds
0
Number of Embeds
3,446
Actions
Shares
0
Downloads
94
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

The MetaLex Document Server - Legal Documents as Versioned Linked Data

  1. 1. The MetaLexDocument Server Rinke Hoekstra Universiteit van Amsterdam
  2. 2. The Problem• Knowledge• Provenance Regulation A Art 12 Art 14, lid 3, 2e volzin Art 14, lid 3, 2e volzin (01-01-2011) (04-02-2011) (11-06-2008) (01-07-2011)• Open Data: public service falls short• Large scale validation of CEN MetaLex• “Linked Open Government Data”
  3. 3. Current SituationPublic content services hosted at wetten.nl
  4. 4. Wetten.nl XML Service http://wetten.overheid.nl/xml.php?regelingID=...• Only available format is BWB XML• Only current version• Content at document level• Identification at document level• Identifiers are not dereferencable• Hardly any metadata (e.g. version date)• Only available context is position in text
  5. 5. BWBId Web Servicehttp://wetten.overheid.nl/BWBIdService/BWBIdList.xml.zipNB: The problem with the XML processing instruction was reported and fixed, but returned sometime last week
  6. 6. Identifiers & Juriconnect 1.0:c:BWBR0005416&artikel=6 vshttp://wetten.overheid.nl/cgi-bin/deeplink/law1/bwbid=BWBR0005416/article=6/date=2005-01-14 vs http://wetten.overheid.nl/BWBR0005416/TitelII698946/HoofdstukII/Artikel16/ geldigheidsdatum_14-01-2005 • Juriconnect? • URN-based... but no naming server • (cf. Document Object Identifiers) • Named elements do not carry identifier • No explicit version information, only contextual
  7. 7. Sources used...• List of all regulations in “XML”• Wetten.nl XML Service• Metadata in HTML table on wetten.nl (the “info page”)• ... so let’s get started already
  8. 8. Step 1Requirements
  9. 9. Our Goals• “Deserialize” regulation content (e.g. topic-based browsing)• Extract and reconstruct implicit information (identifiers, metadata)• Annotate regulations (reconstructed metadata, third-party metadata)• Annotate using regulations (knowledge based systems, services, business processes ...)• Accessible and reusable for any other party (shared vocabularies, standard access)
  10. 10. Requirements• Unique, persistent identification• Generic XML structure of documents• Extensible metadata framework• Flexible web services
  11. 11. Technology Choices• URL-like URIs• CEN MetaLex XML documents• Linked Data / RDF metadata (extensibility to OWL, RIF)• Transparent REST-services
  12. 12. Step 2 Come up with persistent identifiers atelement level and a solid versioning scheme
  13. 13. Identification• Web-enabled “URL-like” URIs • e.g. http://doc.metalex.eu/....• “Cool” URIs (http://www.w3.org/TR/cooluris/) • “Accept”-header based dereferencing • Different types of content at same URI
  14. 14. Levels of Identification Bibliographic Work Entity realizes• IFLA FRBR levels Expression embodies • Work Manifestation exemplifies • Expression Item • Manifestation XML version of regulation on XML version of Version of Regulation regulation regulation my harddisk
  15. 15. Transparent Identifiers• Hierarchical information (work) http://doc.metalex.eu/id/BWBR0011823/hoofdstuk/1/artikel/1 http://doc.metalex.eu/id/BWBR0011823/artikel/1• Version and language (expression) http://doc.metalex.eu/id/BWBR0011823/hoofdstuk/1/artikel/1/nl/2010-09-01• Format information (manifestation) http://doc.metalex.eu/doc/BWBR0011823/hoofdstuk/1/artikel/1/nl/2010-09-01/data.xml
  16. 16. Problem • URIs don’t carry semantics... • Detect changes: • which element versions are the same • ... and which versions are different? Art. 44, lid 4 (2011-03-26)Art. 44, lid 4(2011-04-05) from: Besluit prudentiële regels Wft, BWBR0020420
  17. 17. Opaque Identifiershttp://doc.metalex.eu/BWBR0011823/hoofdstuk/1/artikel/34b0cee26ee5138c74aa2c62caf2c117d3c616e9 s1 s2 frbr:realizes frbr:realizes s1t1 s1t2 s1t3 s2t1 s2t2 s2t3 ... owl:sameAs owl:sameAs owl:sameAs owl:sameAs AE6 B9C 3F5 • Content information • Unique SHA1 Hash of text
  18. 18. Step 3Generic conversion of BWB XML to a generic XML format (CEN MetaLex) and appropriate metadata
  19. 19. Procedure For each BWB XML file listed,if update has occurred since latest run, download latest version, scrape metadata, and produce: Persistent URIs CEN MetaLex + Citations Inline RDFa (optional) or RDF graph (optional), Pajek “.net” files (optional)
  20. 20. CEN MetaLex• Straightforward 1:1 mapping • ... some minor fixes• Mint URI’s on the fly• Convert citations on the fly• Generate metadata on the fly • “inline” inside mcontainer elements
  21. 21. Results14 Table 1. Conversion performance for 300 randomly selected regulations. Number % Number % 42 Substitutions Corrections container 22312 29 % artikel 2525 72 % hcontainer 3730 5% divisie 519 15 % htitle 3730 5% colspec 289 8% block 34325 44 % illustratie 54 2% inline 13527 17 % others 99 3% Total 77624 Total 3486 Total no. of regulations 300 Revoked regulations 109 30 % Correction % 4%Lastly, the MDS offers a simple search interface for finding regulations based onthe title and version date.6 Conclusion(full description in draft ISWC 2011 paper) and ResultsWe ran the MetaLex conversion script on all regulations available through thewetten.nl portal, resulting in a total of 27.687 versions of regulations being con- 40
  22. 22. Citations• Juriconnect citations: 1.0:v:BWBR0020486&artikel=6 1.0:c:BWBR0020486&artikel=6• MetaLex identifiers: http://doc.metalex.eu/id/BWBR0020486/artikel/6 http://doc.metalex.eu/id/BWBR0020486/artikel/6/2009-01-01
  23. 23. Metadata Vocabularies• “RDFized” BWB elements• MetaLex ontology • FRBR type, modification events, structure• Dublin Core • title, alternativeTitle, version• FOAF • page, homepage• Simple Event Model (SEM)• Open Provenance Model vocabulary (OPMV)• W3C Time Ontology
  24. 24. Events & Provenance The date at which the expression was created"2009-10-23"^^xsd:date time:Instant ml:Date sem:Time rdf:value sem:hasTimeStamp rdf:type rdf:type sem:timeType time:inXSDDateTime rdf:type opmv:Process http://doc.metalex.eu/id/date/2009-10-23 sem:Event ml:LegislativeModification sem:hasTime rdf:type rdf:type time:hasEnd rdf:type ml:date sem:eventType The creation event of the regulation http://doc.metalex.eu/id/process/BWBR0017869/2009-10-23 http://doc.metalex.eu/id/event/BWBR0017869/2009-10-23 opmv:Artifact opmv:wasGeneratedAtThe process that generated the expression ml:resultOf rdf:type ml:BibliographicExpression opmv:wasGeneratedBy rdf:type http://doc.metalex.eu/id/BWBR0017869/2009-10-23 The expression (version) URI of a regulation
  25. 25. Step 4Publish: The MetaLex Document Server (MDS)
  26. 26. Document Serving• RESTful API • Implement Cool URIs (Dereference to XML, RDF, .net) • Shorthands (‘/latest’) • SPARQL endpoint • Citation graphs• Rudimentary (and unpredictable) search• CSS Stylesheet for CEN MetaLex XML
  27. 27. Dereferencing (RDF)File containing Turtle serialisation of SCBD http://doc.metalex.eu/id/BWBR0011823/nl/2010-09-01 Accept: application/x-turtle 1 Client requests URI MDS returns Turtle 5 http://doc.metalex.eu/doc/BWBR0011823/nl/2010-09-01/data.ttl 2 Server redirects to manifestation URI (HTTP 303) JSON serialisation SPARQLTriplestore returns SCBD 4 of SCBD Query 3 Server queries triplestore for Symmetric Concise Bounded Description (SCBD) http://www.w3.org/Submission/CBD
  28. 28. Dereferencing (XML) Location of Manifestation http://doc.metalex.eu/id/BWBR0011823/nl/2010-09-01 Accept: text/xml http://doc.metalex.eu/files/BWBR0011823_2010-03-01_mls.xml 1 Client requests URIMDS redirects to Manifestation URI (HTTP 302) 6 http://doc.metalex.eu/doc/BWBR0011823/nl/2010-09-01/data.xml 2 Server redirects to manifestation URI (HTTP 303) Triplestore returns URI of Manifestation 5 Manifestation Glob 3 Server queries file store for XML manifestation 4 If no manifestation exist, extract from parent (extract) (Clients may render XML using CSS stylesheet)
  29. 29. Dereferencing (...)• Other RDF syntaxes application/rdf+xml, text/rdf+n3• HTML clients application/xml, application/xhtml+xml, text/html • Redirect (303) to Marbles browser• Pajek clients text/plain • Download .net file • View using Gephi Toolkit http://gephi.org
  30. 30. Technical Details• Current situation • +/- 27 thousand regulations • 87.9 million triples (legislation.gov.uk: 1.9 billion) • Updated daily• Technical details • Dell PowerEdge II T110, 32GB RAM • Garlik 4Store triplestore (http://4store.org) • Python Django web applications • Tomcat servlet + Gephi Toolkit API• See http://doc.metalex.eu
  31. 31. Step 5Use: social network analysis and concept extraction (ongoing work)
  32. 32. Network Analysis• Impact of regulation on other regulations (combine with work on court rulings)• Connectedness• “Importance” of articles• Analysis tools • Pajek, Gephi

×