How we use the Semantic Web stackThe classical Semantic Web architecture forintegrating data from various silos relies on ...
Harvest and Normalize – Those are regular functions as seen in ETL systems: extract the data fromthe sources, clean it and...
Enrich - The next step is dedicated to enriching both the objects (the records captured in the sources)and the graph. Depe...
Upcoming SlideShare
Loading in …5

ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic Web stack in a corporate IS environment."


Published on

ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic Web stack in a corporate IS environment."

This paper has been selected and presented in the Industry track at ISWC 2012 Boston by Fabrice Lacroix – Antidot
This paper has been selected and presented in the Industry track at ISWC 2012 Boston by Fabrice Lacroix – Antidot.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic Web stack in a corporate IS environment."

  1. 1.     Linked Enterprise Data: leveraging the Semantic Web stack in a corporate IS environment This paper has been selected and presented in the Industry track at ISWC 2012 Boston Fabrice Lacroix – Antidot - lacroix@antidot.net The context Business information systems (IS) have developed incrementally. Each new operating need has generated an ad hoc application: ERP, CRM, EDM, directories, messaging, extranet and so on. IS development has been driven by applications and processes, each new application creating another data silo. Organizations are now facing a new challenge: how to manage and extract value from this disparate, isolated data. Companies need an agile information system to deliver new applications at an ever-increasing pace, developed from existing data, without creating a new warehouse or adding complexity. Over the past twenty years, various solutions attempting to tackle the problems raised by data proliferation have appeared: BI, MDM, SOA. While these tools undoubtedly provide benefits, they entail in most cases a long and costly deployment process and make the overall system even more complex. What’s more, none of them is able to address the challenges of an ever faster changing technological environment. A versatile IS should: • Pool data to create information that will provide a new operational service, • Integrate and distribute data between applications, both internally and externally with its ecosystem, • Provide an information infrastructure that emphasizes agility and ease of use. Therefore, we need to look beyond the technological issues and change the paradigm. Instead of focusing on applications, we must place the data at the heart of the approach. And for that, the recent evolution of the Web blazes the trail. Why we use the Semantic Web stack Originally designed to serve as a universal document publication system, the Web has radically evolved over the past 15 years. The Web of Data, also known as the Semantic Web, is the latest iteration of the Web, in which computers can process and exchange information automatically and unambiguously. It goes well beyond simple access to raw data by providing a way of interweaving the semantized data. This process, known as Linked Data, creates a decentralized knowledge base in which the value of each piece of information is enhanced by its links to complementary data. Being a software vendor in the realm of information access solutions (enterprise search engines, data management and enrichment), Antidot has been working for a long time on solutions creating a unified informational space drawing on all of the company’s documents and data, meshing unstructured and structured information. In 2003 Antidot foresaw in the Semantic Web technologies an elegant way to tackle the challenge of enterprise data integration and in 2006 we started evaluating and integrating them in our solutions. Four years of development and several major projects with various customers and business subjects allowed us to figure out a way to efficiently use those technologies. We strongly support Linked Enterprise Data (LED), the application of the Linked Data principles to the corporate IS1. However, the way we use Semantic Web may seem heretical to the conventional principles. Hereafter we report key aspects of our approach.                                                                                                                 1 For more information on our LED approach, read our white paper “Linked Enterprise Data – Principles, Uses and Benefits” – http://bit.ly/LED-EN (PDF, 24 pages, 5.6MB) © Antidot – Linked Enterprise Data – ISWC 2012 1/4
  2. 2. How we use the Semantic Web stackThe classical Semantic Web architecture forintegrating data from various silos relies on afederated principle where a query issynchronously distributed over the sourcesthrough SPARQL endpoints exposed by each ofthem.This approach presents many scientific andtechnological challenges but considering therationale behind the Web of Data and the needto work in the gigantic open Web space, thisseems to be the only reasonable way to make itwork.Though theoretically correct, this approach is not applicable to the corporate IS for a large variety ofreasons: • The corporate information system is built with numerous legacy or closed applications that cannot be adapted or extended with Sparql endpoints. • The enterprise information realm is made up at 80% of unstructured or semi-structured data that cannot fit in the model as such. • Enterprises do not want access to raw data in RDF format. They want to reap valuable information derived from the data, which requires large and complex computations to create these new informational objects. • The bottom-up approach of mapping silos and their data to RDF to fit the model requires an enormous work for defining vocabularies or ontologies for each source, which is a too heavy investment. • Companies dream of seamlessly integrating external data to leverage their internal information. But this external data is mostly available in XML or JSON through Web Services, and not yet in RDF, so that using Sparql as a way to query and integrate does not make sense. • IT departments have invested heavily in their “relational database for storing / XML for exchanging / Web apps for accessing” infrastructure. Their staffs are trained for this paradigm. They lack in-house skills for integrating the graph-way-of-thinking. • Stability matters most and Semantic Web technology is unknown, considered as new and immature: CIOs are not ready to take the risk of adding load and technological uncertainty on systems that are critical to the company for its daily business operations.For all these reasons, the RDF-Sparql paradigm as described above is not ready to enter the corporateIS and it may even face some resistance.However, we think the Semantic Web is the solution to create an agile information system. The waywe use it at Antidot is tightly related to the architecture of the data processing workflow we set up inprojects. Being a long time software vendor of information access solutions, we have rapidly come tothe conclusion that there is no good search engine, whatever the technology, if the data quality is notgood enough.To meet this need, we have developed Antidot Information Factory (AIF), a software solutiondesigned specifically to enrich and leverage structured and unstructured data. Antidot InformationFactory is an "information generator" that orchestrates large-scale processing of existing data andautomates publishing of enriched or newly created information.The data processing workflows, named dataflows, always have the same pattern: Capture – Normalize– Semantize – Enrich – Build – Expose.© Antidot – Linked Enterprise Data – ISWC 2012 2/4
  3. 3. Harvest and Normalize – Those are regular functions as seen in ETL systems: extract the data fromthe sources, clean it and transform it. We tailor the Normalize process by aligning fields content inorder to mesh data coming from different sources (such as records from a CRM and an ERP). Forextracting records from relational databases and transforming selected records in RDF, we havedeveloped a R2RML and Direct Mapping compliant module named db2triples2.Semantize – This critical step is a corner stone of our approach. We cherry-pick a subset ofinteresting fields of each object and create their RDF triples counterpart.Generating the triples requires two actions: URI generation – The URIs are generated according to few principles. We chose the form of a URL even though they are not directly dereferenceable: since the sources are not Semantic Web compliant, our solution is in charge of maintaining a mapping allowing the real server access. The path contains the necessary information to access the record in the source system. Example: a record extracted from the CRM will have a URI of the form http://data.mycompany.com/crm/expr_id, where data.mycompany.com points to our solution, crm is a nickname for the CRM source chosen during setup, and expr_id is an expression and/or identifier that unambiguously points the record and allows backtracking to the original data. Choosing the predicates – Experience has led us to the conclusion that “big ontologies” and “upfront ontology design” must be avoided in enterprise projects. The idea is not to model or describe each and every aspect of the processes and data inside the company, but to build incrementally the necessary information. Not to mention the fact that in our approach, the graph is a means and not an aim. Therefore, we foster the use of existing vocabularies (like DC, Foaf, Organization, …) and mesh them as needed. When the enterprise has defined internal XML formats, we reuse them by transforming tags and attributes name to triples predicates. Pragmatism is the rule.Unstructured documents like office files, PDF files or emails content don’t fit the RDF formalism andcannot be linked to the graph as such. Extra work is necessary: First, we transform available metadata like document name, author, creation date, sender and receivers for a mail, subject and so forth into RDF. Then, we use text-mining technology to extract named entities like people, organizations, products, etc. from the documents. These entities lists are generated using different sources of the enterprise: directories, CRM or ERP are providing people and company names, while products are listed in ERPs or taxonomies. Each annotation generates a triple where the subject is the document URI, the object is the entity URI, and the predicate depends on the entity type but mostly means “quotes” (doc_URI quotes entity_URI). And last, we run various specific algorithms designed to do document versus document comparison to detect duplicates, different versions of the same document, inclusions, semantically related ones, etc. Each of these relations is inserted in the graph with an appropriate predicate.By doing so, and thanks to the syntactic alignment done at the Normalize step, we start linking datatogether, mostly based on shared field values. This creates a first sparse graph.But the key question here is “Why do we transform only a subpart of the harvested data in RDF andwhat do we do with the rest of it?” Indeed, not to mention the fact that text documents are not graphfriendly, as stated above we only transform a selected part of the structured data into RDF: From a technical standpoint we don’t feel like the technology is mature and stable enough to proceed differently. In industrial projects, millions of seed objects are regularly extracted from the sources (invoices, clients, files, etc.), each having tens of fields. And having billions of triples doesn’t scale well in available triplestores. Transforming only a subpart of the data largely simplifies the task of choosing the predicates, hence reinforces the choice of using many small available vocabularies instead of big ontologies. The data that is not transformed to RDF is stored by Information Factory for later use during the Build step.The very diversity of enterprise data implies a flexible and pragmatic strategy: the graph is only a partof it.                                                                                                                2 db2triples is compatible with R2RML and Direct Mapping Recommendations from May 29th 2012 and hassuccessfully passed the validation tests (http://www.w3.org/2001/sw/rdb2rdf/implementation-report/). We haveopen sourced it and made it available at http://github.com/antidot/db2triples.© Antidot – Linked Enterprise Data – ISWC 2012 3/4
  4. 4. Enrich - The next step is dedicated to enriching both the objects (the records captured in the sources)and the graph. Depending on the content type and of the project needs, we run various algorithmssuch as text mining, classification, topic detection, etc depending on the content type and projectneeds. This complementary information is included into the graph. We also integrate externalinformation either by importing data sets in the graph or by querying external sources and mappingthe result to RDF triples to link them to the graph.Build - Once the graph is decorated, the key step is to build the knowledge objects that are the realtarget of the project. We start by executing inference rules (mostly Sparql construct queries) in orderto saturate the graph. Then we extract those objects from the graph: this requires a mix of selectqueries plus dedicated graph traversal and sub-graph selection algorithms. Moreover, since we haven’toriginally transformed all the data in RDF nor transferred it to the graph, the objects we extract arelike skeletons that need to be complemented with the original data that was left outside the graph: thistask is completed through specific algorithms and tools embedded in Information Factory, designed tomerge RDF-objects extracted from the graph with structured data and documents previouslyharvested.Expose - Finally, the knowledge objects created are made available to the IS and the users throughvarious ways, depending on the environment and the needs. They can be dumped in XML files,injected in a database, or indexed and made available through a semantic search engine following theSearch Base Application – SBA - paradigm. Of course, these objects can be loaded in a triplestore andmade available in RDF native format through a dedicated Sparql endpoint.Hence, we have chosen the Semantic Web technology for very pragmatic reasons: • The RDF/OWL formalism is perfectly suited for modeling. Its graph nature fits our needs of agility and flexibility. It fosters bottom-up small-scale projects that will offer a quick, inexpensive response to an isolated business need. Each new project will gradually enlarge the information graph, without the need to revise or overhaul the initial models. • The Semantic Web benefits from an ecosystem of existing solutions and tools (triplestores, inference engines, Sparql endpoints, modeling tools, etc.), as well as partners and skills. The open and standard nature of its formats and protocols guarantees investment sustainability: the created data is always accessible and reusable, independently of the technology providers. • The momentum around Open Data and Linked Data strengthens the credibility of the approach. Though not yet mainstream, CIOs are interested in evaluating the technology and benefits behind these concepts, and they approve of the opportunity to extend toward a real Semantic Web project in the future, leveraging this first investment.ConclusionLinked Enterprise Data’ strategy and underlying Semantic Web standards represent a comprehensiveresponse to the challenge of creating an agile, high-performance information system.Our approach has proven to be pragmatic and efficient, delivering the project agility and the dataversatility expected. Our value proposition is not technology itself: we offer to create valuableinformation in an agile way for business needs.CIOs don’t express yet the need for Semantic Web or Linked Data, they haven’t planned to setup atriplestore in their infrastructure. But we think that the Semantic Web stack is the right tool.Linked Enterprise Data approach will prove its value whereas we might not be able to convince ourcustomers to dive directly into a global Web of Data approach. And it might allow later projects to usemore directly and openly Semantic Web technologies.© Antidot – Linked Enterprise Data – ISWC 2012 4/4