Service-Oriented Architecture for automatic markup of documents

  • 382 views
Uploaded on

Presentación WLIC IFLA 2014

Presentación WLIC IFLA 2014

More in: Software
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
382
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • +
  • Poner diagrama de flujo con variantes
    E interfaz de configuraciòn(esquemas de configurado)
    (diagrama entrada salida pero con flujo dentro de la caja)
  • Poner diagrama de flujo con variantes
    E interfaz de configuraciòn(esquemas de configurado)
    (diagrama entrada salida pero con flujo dentro de la caja)

Transcript

  • 1. Service-Oriented Architecture for automatic markup of documents. An use case for legal documents. Francisco Adolfo Cifuentes-Silva Library of Congress of Chile - BCN 2014-08-19 “Digital law libraries at the crossroads: Innovative solutions to complex challenges.”
  • 2. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Project context It borns in response to two (2) problems: To be able for to obtain all the parliamentary interventions, within the legislative process (Congress sessions and related documents) To know the evolution and the discussion around a law, since that this is defined as a bill until it is published as law Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 2 1 2
  • 3. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Project context It borns in response to two (2) problems: To be able for to obtain all the parliamentary interventions, within the legislative process (Congress sessions and related documents) To know the evolution and the discussion around a law, since that this is defined as a bill until it is published as law Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 3 And in an automated way! And in an automated way! 1 2
  • 4. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Project context How to: Two (2) sibling projects: Parliamentary Labor project (PL): To be able for to obtain all the parliamentary interventions, within the legislative process (Congress sessions and related documents) History of the Law project (HL): To know the evolution and the discussion around a law, since that this is defined as a bill until it is published as law Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 4 1 2
  • 5. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Project context “Sibling projects” because both are possible processing the same documents: • Session dailies • Debate reports • Reports • Amendments • Bills • etc. Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 5
  • 6. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 6 Project context
  • 7. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 7 Congress and legal resources Project context
  • 8. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 8 Chilean Congress - Senate - Chamber of Deputies Project context
  • 9. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 9 Legal resources production - Session dailies - Debate reports - Bills, etc Project context
  • 10. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 10 Congress and legal resources Workflow Project context
  • 11. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 11 Business Processes - Each type of document has an own process flow - BCN implements a Workflow Management System for PL & HL Project context
  • 12. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 12 Congress and legal resources Tools Project context Workflow
  • 13. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 13 Support tools - Automatic XML Marker - Web XML Editor - XSD in the base of support tools Project context
  • 14. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 14 Congress and legal resources Tools XML Storage Project context Workflow
  • 15. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 15 XML Storage - SVN server for XML documents - Allow us manage all XML versions - REST access: HTTP GET, PUT Project context
  • 16. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 16 Tools XML Storage Information extraction Linked Open Data Congress and legal resources Project context Workflow
  • 17. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 17 Information Extraction New information is extracted from enriched XML in two formats: - Linked Open Data - Relational data (facts table) Project context
  • 18. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 18 Tools XML Storage Information extraction Linked Open Data Congress and legal resources Project context Workflow
  • 19. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 19 Tools XML Storage Information extraction Linked Open Data Congress and legal resources New data is used for a new process Project context Workflow
  • 20. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Strategic decisions Service Oriented Architecture Our focus: - HTTP is the base - REST Web Services - W3C Web Standards Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 20
  • 21. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Strategic decisions Service Oriented Architecture Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 21 Workflow Management SystemWorkflow Management System Automatic MarkupAutomatic Markup XML EditorXML Editor RDF TriplestoreRDF Triplestore SVN XMLSVN XML MediatorMediator Web ServicesWeb Services
  • 22. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Strategic decisions Linked Open Data - LOD Since 2011 BCN publishes LOD:  Dataset of legal norms  Dataset of legislative documents  Datasets and ontologies about:  People  Geographic places  Organizations  Others like roles, bills, congress structure, etc. Please visit http://datos.bcn.cl !!  Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 22
  • 23. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Strategic decisions Linked Open Data For automatic markup we are using: • URIs for legal documents • URIs for metadata • URIs for named entities: – URIs for people – URIs for organizations – URIs for roles – URIs for events – URIs for locations – …. URIs for all Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 23
  • 24. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Strategic decisions The definition of a XML Schema We need a XML Schema for markup of documents, and eventually interchange the documents, so we have two big choices: • Own XML Schema = low interoperability, reusability and high cost • Standard XML Schema = high interoperability, reusability and low cost Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 24
  • 25. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Strategic decisions Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 25
  • 26. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Strategic decisions The definition of a XML Schema Standard XML Schema = high interoperability, reusability and low cost Ok but, why Akoma-Ntoso? Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 26
  • 27. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Strategic decisions Akoma-Ntoso - XML Schema for legal documents designed and supported by “great minds” in OASIS Group  - Support to many types of documents: (session daily, bills, debate reports, amendments, among others) Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 27
  • 28. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Strategic decisions Akoma-Ntoso - There is a growing set of tools for working with him, such as Web XML editors or office editor tools, example: – LegisProWeb – Bungeni – Lime Editor Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 28
  • 29. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 29 Plain Text Named Entities recognitionNamed Entities recognition URI assignmentURI assignment Structural MarkupStructural Markup Akoma-Ntoso translationAkoma-Ntoso translation XML AKN Automatic XML Marker
  • 30. Automatic markup in XML Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 30 Plain Text Named Entities recognitionNamed Entities recognition URI assignmentURI assignment Structural MarkupStructural Markup Akoma-Ntoso translationAkoma-Ntoso translation XML AKN Automatic XML Marker Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements
  • 31. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Named Entity Recognizer (NER) - We need to identify entities in the text - We are using a spanish adapted version of Stanford NER which uses a CRF classifier. - The classifier was trained with large documents achieving results over 80% of effectivity in entity recognition Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 31
  • 32. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Named Entity Recognizer (NER) Web service, written in Java and based in the Stanford NER Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 32
  • 33. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 33 Plain Text Named Entities recognitionNamed Entities recognition URI assignmentURI assignment Structural MarkupStructural Markup Akoma-Ntoso translationAkoma-Ntoso translation XML AKN Automatic XML Marker
  • 34. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML URI assignment - Once the NER find all entities, we need to assign its URI - This tool is called “The Mediator” and it has been developed in collaboration with the Weso Research Group of the University of Oviedo. Francisco Adolfo Cifuentes-Silva -Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 34
  • 35. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Mediator output in XML Web service, written in Java and based in Apache Lucene Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 35
  • 36. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Mediator features - Connected to SPARQL Endpoint - It allows to set context information for each work session (ex: date, chamber, type of doc. in markup) - Using the context information, it applies a set of heuristics for each entity type, identifying correctly the URI for each one Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 36
  • 37. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 37 Plain Text Named Entities recognitionNamed Entities recognition URI assignmentURI assignment Structural MarkupStructural Markup Akoma-Ntoso translationAkoma-Ntoso translation XML AKN Automatic XML Marker
  • 38. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Structural markup - The problem is to detect structural sections - Combination of methods: - Regular expressions - Algorithms for detecting sequences - Rules and algorithms Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 38
  • 39. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Structural markup - The combination of methods depends on each document type - Finally, the object representation of document (simmilar to DOM) is converted to ad-hoc XML Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 39
  • 40. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Structural markup Web service and written in Java Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 40
  • 41. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 41 Plain Text Named Entities recognitionNamed Entities recognition URI assignmentURI assignment Structural MarkupStructural Markup Akoma-Ntoso translationAkoma-Ntoso translation XML AKN Automatic XML Marker
  • 42. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Akoma-Ntoso translator - We need AKN documents for edition, enrichment and extraction - AKN is a complex schema - The best solution was to build a web service for convert ad-hoc XML to AKN Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 42
  • 43. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Automatic markup in XML Akoma-Ntoso translator Web service and written in Java Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 43
  • 44. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Results and discussion Positive impact in the work, reducing dramatically time of XML markup compared to manual labeling of documents reducing time and cost of product generation Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 44
  • 45. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Results and discussion Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 45 Time for completing a History of the Law in distinct scenarios
  • 46. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Conclusions  SOA has provided to improve each component separately impacting positively the final result (ex. Datasets, NER training, heuristics)  It is possible to integrate aditional XML Schemas to output Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 46
  • 47. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Conclusions  The automatic markup of XML documents, and subsequent manual enrichment of metadata provides an excelent source for data extraction  Our solution based on SOA allow us an easy integration of exceptions and new cases in the markup Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 47
  • 48. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Future Work Alfonso Pérez, Director of the BCN, has installed the concept of “Semantic Library” like one of the main objectives of the BCN in the institutional strategic plan. This new concept implies to apply the automatic markup schema to all BCN areas, developing new markup schemas and possible new challenges in terms of identify document sections and semantic content. Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 48
  • 49. Project context Strategic decisions - SOA - Linked Open Data - Akoma-Ntoso Automatic markup in XML - Named Entity Recognizer - URI assignment - Structural Markup - Akoma-Ntoso translator Results and discussion Conclussions Future work Acknowledgements Acknowledgements • Library of Congress of Chile Team  • Developers team – Ricardo Muñoz – Claudio Devia – Eridan Otto – David Vilches – Me Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 49
  • 50. Thanks for your attention! fcifuentes <at> bcn <dot> cl twitter.com/fcifuentes www.slideshare.net/francisco.cifuentes www.linkedin.com/in/fcifuentes Francisco Adolfo Cifuentes-Silva - Library of Congress of Chile 50 Me If you need more details, you can contact me: