Eclipse For Document Processing

413 views

Published on

An emerging rapprochement is observed between two separated markets, namely BI and ECM. In this context, a mid-term objective is to inform BI analytics with content provided by unstructured documents. In particular, if it were possible to correlate such different sources of data, this would create a wider reasoning space than the somewhat limited enterprise history currently used by BI.

The major difficulty before achieving this consists in building semantic content silos from enterprise documents.

Xeproc is a technology built on EMF/GMF frameworks and distributed under EPL, which we expect to play a role in such a challenging migration from content to BI.

Based on a simple Ecore model made of documents, components, validation and views, the original point is that the associated designer can be instrumented by extending it with interpreters.

Visual feedback associated with view and validation specifications helps in collaboratively building the silos definition and instanciation.


Published in: Technology
  • Be the first to comment

Eclipse For Document Processing

  1. 1. An EMF path* from ECM to BI Thierry Jacquin Xerox Research Centre Europe * There is no path to business intelligence: intelligence is the path.
  2. 2. Context of this talk <ul><li>Two separated markets… </li></ul><ul><ul><li>Business Intelligence </li></ul></ul><ul><ul><ul><ul><li>Business transaction logs </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Analytics </li></ul></ul></ul></ul><ul><ul><li>Enterprise Content Management </li></ul></ul><ul><ul><ul><ul><li>Document repositories </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Search </li></ul></ul></ul></ul><ul><li>… and an emerging rapprochement </li></ul><ul><ul><li>Analytics on content management </li></ul></ul><ul><ul><li>Informing traditional BI with semantic content </li></ul></ul>
  3. 3. Enterprise Content Management <ul><li>Document repository </li></ul><ul><ul><ul><li>Collections </li></ul></ul></ul><ul><ul><ul><li>Browsing </li></ul></ul></ul><ul><ul><ul><li>CRUD </li></ul></ul></ul><ul><ul><ul><li>Metadata & Indexing </li></ul></ul></ul><ul><ul><ul><li>Search </li></ul></ul></ul><ul><li>Workflow engine </li></ul><ul><ul><ul><li>Status driven </li></ul></ul></ul><ul><ul><ul><li>Role centric </li></ul></ul></ul>
  4. 4. Business Intelligence <ul><li>Accounting database is the raw material </li></ul><ul><ul><ul><li>Log of transactions </li></ul></ul></ul><ul><ul><ul><li>Enriched for analytics needs </li></ul></ul></ul><ul><li>Dedicated warehouses or silos over years </li></ul><ul><ul><ul><li>Enterprise products </li></ul></ul></ul><ul><ul><ul><li>Enterprise segments </li></ul></ul></ul><ul><ul><ul><li>Geographic features </li></ul></ul></ul><ul><li>Analytics to quantify enterprise practices </li></ul><ul><ul><ul><li>Remains focused on enterprise data </li></ul></ul></ul><ul><ul><ul><li>Business analytics eager to incorporate wider business context </li></ul></ul></ul>
  5. 5. Informing BI with semantic content BI data flow Soon Xeproc An Eclipse based framework
  6. 6. Xeproc Path <ul><li>Pointing on stored document </li></ul><ul><ul><ul><li>Traceability (qualify and verify) </li></ul></ul></ul><ul><ul><ul><li>Explicit semantic illustrations </li></ul></ul></ul><ul><li>Computable metadata formats </li></ul><ul><ul><ul><li>METS (records and BI) </li></ul></ul></ul><ul><ul><ul><li>RDF (tuples and inferring) </li></ul></ul></ul><ul><li>Domain specific semantics </li></ul><ul><ul><ul><li>Concepts </li></ul></ul></ul><ul><ul><ul><li>Relations </li></ul></ul></ul><ul><li>Logical structure of documents </li></ul><ul><ul><ul><li>Layout Analysis </li></ul></ul></ul><ul><ul><ul><li>Text / image analysis </li></ul></ul></ul>Visualization Annotation Collaboration Doc. Processing Enterprise Documents Enterprise Content Silo
  7. 7. Xeproc Monitored Path Designer <ul><ul><li>EMF based </li></ul></ul><ul><ul><li>GMF designed </li></ul></ul><ul><ul><li>XML instrumented </li></ul></ul><ul><ul><li>Distributed under EPL </li></ul></ul>Monitoring based on validations and visualizations
  8. 8. Xeproc Model Xeproc monitoring extensibility <ul><li>Extension points </li></ul><ul><li>paletteComponent </li></ul><ul><ul><li>Pdftoxml </li></ul></ul><ul><li>xeprocPlayer </li></ul><ul><ul><li>Full play on demand </li></ul></ul><ul><li>stepplayer </li></ul><ul><ul><li>Xsltproc </li></ul></ul><ul><li>validator </li></ul><ul><ul><li>XSD / rnc/ rng schemas </li></ul></ul><ul><ul><li>Versus Reference </li></ul></ul><ul><li>renderer </li></ul><ul><ul><li>Xslt based </li></ul></ul><ul><li>renderingViewer </li></ul><ul><ul><li>SWT browser </li></ul></ul><ul><ul><li>ATF mozilla </li></ul></ul>Java classpath xeprocURI resolvers <ul><li>URI referenced resources </li></ul><ul><li>in plug-in , project or remote spaces </li></ul><ul><li>Component </li></ul><ul><ul><li>Document processing </li></ul></ul><ul><ul><ul><li>XML input </li></ul></ul></ul><ul><ul><ul><li>XML output </li></ul></ul></ul><ul><li>Validation </li></ul><ul><ul><li>Schema </li></ul></ul><ul><ul><li>Reference set </li></ul></ul><ul><li>View </li></ul><ul><ul><li>Specification in renderer syntax </li></ul></ul><ul><ul><li>Applied on step output </li></ul></ul><ul><ul><li>Manual correction </li></ul></ul><ul><ul><li>Import and export of annotations </li></ul></ul>
  9. 9. Conclusion <ul><li>EMF base and MDA trend </li></ul><ul><ul><li>Xpand easy to apply on Xeproc </li></ul></ul><ul><ul><li>SOA by design, since every model element is a resource URI </li></ul></ul><ul><ul><li>Specifications in the model clearly separated from engines </li></ul></ul><ul><ul><li>Design is monitored when engines plugged in the designer </li></ul></ul><ul><li>Join the Xeproc Community </li></ul><ul><li>http://www.xrce.xerox.com/Xeproc </li></ul><ul><li>Shaman as an experimental test bed </li></ul><ul><ul><li>Long term digital preservation </li></ul></ul><ul><ul><li>Data grid context </li></ul></ul><ul><ul><li>ECAD/MCAD bridge via content silo </li></ul></ul>Domain Specific Language

×