Our information system, like any other corporate IS is blossoming with of all type of information. Most of it this information is UNstructured.
And part of it is structured : mostly due to relational database storage underlying business applications.This is applications we run internally: CRM, ERP, Support tracking, …
Many approaches have been developed to solve this problem of isolated silos.Most of them only apply to structured data (BI, MDM).And in most cases they entail a long and costly deployment process and make the system more complex.
Enterprise search is not a solution. And we know that for sure since we are a leading vendor in the realm of search solutions.The problem is related to the very nature of current search engines :- they are document oriented : they read documents, they index documents, they reply documents.
This is what we want: agile information, meshed, merged, enriched.
What you see is not data mashup! Not just data put side by side.Some information you see here need advanced processing that can not be done on the fly.
The solution is to change the paradigm: forget the applications and the APIs.Just look at the data.
Weneed to create the Enterprise graph
There is a solution:one that has been thought and designed for the Web.If it works for the Web, it should work for youand us.
The architecture for integrating data on the Web from various silos relies on a federated principle where a query is synchronously distributed over the sources through SPARQL endpoints exposed by each of them.This approach presents many scientific and technological challenges but considering the rationale behind the Web of Data and the need to work in the gigantic open Web space, this seems to be the only reasonable way to make it work.
Though theoretically correct, this approach is not applicable to the corporate IS for a large variety of reasons:• The corporate information system is built with numerous legacy or closed applications that cannot be adapted or extended with Sparql endpoints• The enterprise information realm is made up at 80% of unstructured or semi-structured data that cannot fit in the model as such.• Enterprises do not want access to raw data in RDF format. They want to reap valuable information derived from the data, which requires large and complex computations to create these new informational objects.• The bottom-up approach of mapping silos and their data to RDF to fit the model requires an enormous work for defining vocabularies or ontologies for each source, which is a too heavy investment.• Companies dream of seamlessly integrating external data to leverage their internal information. But this external data is mostly available in XML or JSON through Web Services, and not yet in RDF, so that using Sparql as a way to query and integrate does not make sense.• ITdepartments have invested heavily in their “relational database for storing / XML for exchanging / Web apps for accessing” infrastructure. Their staffs are trained for this paradigm. They lack in-house skills for integrating the graph-way-of-thinking.• Stability matters most and Semantic Web technology is unknown, considered as new and immature: CIOs are not ready to take the risk of adding load and technological uncertainty on systems that are critical to the company for its daily business operations.
Does not work because process: modeling, know-how technology: performance, scalability enterprise don’t care about technology, especially if new one.
We tailor the Normalize process by aligning fields content in order to mesh data coming from different sources (such as records from a CRM and an ERP).R2RML and Direct Mapping compliant module named db2triples.
“Why do we transform only a subpart of the harvested data in RDF and what do we do with the rest of it?” Indeed, not to mention the fact that text documents are not graph friendly, as stated above we only transform a selected part of the structured data into RDF:From a technical standpoint we don’t feel like the technology is mature and stable enough to proceed differently. In industrial projects, millions of seed objects are regularly extracted from the sources (invoices, clients, files, etc.), each having tens of fields. And having billions of triples doesn’t scale well in available triplestores.Transforming only a subpart of the data largely simplifies the task of choosing the predicates, hence reinforces the choice of using many small available vocabularies instead of big ontologies.The data that is not transformed to RDF is stored by Information Factory for later use during the Build step.
Unstructured documents like office files, PDF files or emails content don’t fit the RDF formalism and cannot be linked to the graph as such.Extra work is necessary: First, we transform available metadata like document name, author, creation date, sender and receivers for a mail, subject and so forth into RDF.Then, we use text-mining technology to extract named entities like people, organizations, products, etc. from the documents. These entities lists are generated using different sources of the enterprise: directories, CRM or ERP are providing people and company names, while products are listed in ERPs or taxonomies.
And last, we run various specific algorithms designed to do document versus document comparison to detect duplicates, different versions of the same document, inclusions, semantically related ones, etc. Each of these relations is inserted in the graph with an appropriate predicate.
It is like cooking: the rules are your own personal touch. Rules depend on the information and knowledge you want to create by inferring on the graph.
We created the graph by inserting basic triples. Then we grew the graph with enriching and inferring.Now it is time to extract the information we need.For this, we first select the resources we look for.Then we follow some links to grab the information and create basic objects.
We agree we all would like to see those technologies invading the information systeml.We would like to put these stickers on this beautiful zSeries mainframe. But what does it mean? How can we do that?
ISWC 2012 - Industry Track - Linked Enterprise Data: leveraging the Semantic Web stack in a corporate IS environment.
Linked Enterprise Data LEVERAGING THE SEMANTIC WEB STACK IN A CORPORATE ENVIRONMENT ISWC 2012 – BOSTON FABRICE LACROIX – LACROIX@ANTIDOT.NET 1Copyright Antidot™
Antidot – who we are French-based Software Vendor Since 1999 | Paris, Lyon, Aix-en-Provence Information access | Data management Mission: Provide our customers with innovative customizable solutions that help them create value with their data, and make their employees more aware and efficient. 2Copyright Antidot™
What we want ERP CRM Production LDAP ECM Support Files 10Copyright Antidot™
Changing the paradigm Switching from an application view to a data centric way of thinking. 11Copyright Antidot™
Bring out the implicit Build the Giant Enterprise Graph 12Copyright Antidot™
LED Linked Enterprise Data application of the Semantic Web technologies and Linked Data principles to the enterprise infrastructure 13Copyright Antidot™
What works for the Web… Federating silos on the Web http://www.w3.org/People/Ivan/CorePresentations/RDFTutorial/Slides.html#(102) 14Copyright Antidot™
…can’t always be used in corporate IS Legacy apps can’t be "Sparql’ed" 80% un- or semi- structured data don’t fit in the model as such Defining vocabularies/ontologies for silos is too complex and expensive Don’t want RDF per se but valuable information External data is available in XML/JSON through Web Services Staff trained for RDB, XML, Web apps. No Risk and stability strategy: SemWeb technology considered as new and immature 15Copyright Antidot™
The RDF/storage approach Setting up a global RDF repository does not work either ITs are afraid by the "RDF everywhere" activists 16Copyright Antidot™
Semantic Web technology still is the right solution in corporate environment BUT it is not an aim JUST use it as a means 17Copyright Antidot™
Just do it Think of it as a stream paradigm build new objects using existing data without interfering with the existing infrastructure with SemWeb somewhere under the hood 18Copyright Antidot™
Enterprise Graph HowTo Construct the graph generate triples from data create triples from documents Leverage the graph enrich infer Browse the graph select resources build objects Trash the graph 19Copyright Antidot™
How: extract & normalize Harvest and normalize as in an ETL fetch, clean, transform… normalize records (names, IDs) to prepare the linking step For databases db2triples : an RDB2RDF implementation by Antidot (open source, W3C validated) 20Copyright Antidot™
How: semantize Don’t transform everything in RDF cherry-pick a subset of interesting fields for each object and create their RDF triples counterpart interesting == needed for linking or inferring Semantize 21Copyright Antidot™
How: semantize Triples generation Be smart: avoid upfront ontology design, use small vocabularies Be pragmatic: transform XML tags and field names to predicates Be agile: only insert what you need. And when you need more, add more. Semantic Web fuels the modeling, linking and information building process 22Copyright Antidot™
Enterprise Graph HowTo Construct the graph generate triples from data create triples from documents Leverage the graph enrich infer Browse the graph select resources build objects Trash the graph 23Copyright Antidot™
How: semantize Unstructured documents Extract metadata and transform them as needed to RDF. ➡ Ex: author =>dc:creator Use of text-mining to extract named entities: people, organizations, products… ➡ generate those entities list using the data sources: directory for employees, CRM for companies and people, ERP for products ➡ create triples like doc_URI quotes entity_URI 24Copyright Antidot™
How: semantize Unstructured documents Compare documents using various and dedicated algorithms ➡ is the same ➡ is included ➡ is similar ➡ is related Generates new triples ➡ create triples like <docA>is_sub_version_of<docB> 25Copyright Antidot™
Enterprise Graph HowTo Construct the graph generate triples from data create triples from documents Leverage the graph enrich infer Browse the graph select resources build objects Trash the graph 26Copyright Antidot™
How: enrich Enrich the graph run specific algorithms to generate more links and triples (classifiers, topic detection, …) insert external data gathered from the LOD or other external datasets or APIs 27Copyright Antidot™
How: infer Create new knowledge add rules according to your needs IF a coworker is quoted in documents AND this coworker belongs to a business unit THEN the business unit is bound to the documents 28Copyright Antidot™
Enterprise Graph HowTo Construct the graph generate triples from data create triples from documents Leverage the graph enrich infer Browse the graph select resources build objects Trash the graph 29Copyright Antidot™
How: build Build select resources corresponding to objects seeds (using Sparql queries) for each seed, follow links smartly in order to create basic objects Build 30Copyright Antidot™
How: build Finalize decorate the new knowledge objects with data set apart (not loaded in the triplestore) now we have rich user-actionable objects Build Finalize 31Copyright Antidot™
Enterprise Graph HowTo Construct the graph generate triples from data create triples from documents Leverage the graph enrich infer Browse the graph select resources build objects Trash the graph 32Copyright Antidot™
How: expose Make the new information available to users and to the entire IS Relational DB Enrich Harvest Semantize RDF Triplestore (Linked Data) Normalize Classify Annotate Indexation AFS search engine 33Copyright Antidot™
Conclusion It works! The triples we create and the inference rules we add are dictated by the goal / application ➡ usage and value oriented We benefit from the lazy-flexible-dynamic modeling of RDF-RDFS-OWL ➡ we are agile What matters is the graph. But the graph is not the triplestore ➡ storage independent 34Copyright Antidot™
There’s an app for that Antidot Information Factory a software solution designed specifically to leverage structured and unstructured data enable large-scale processing of existing data automate publishing of enriched or newly created information. Harvest Normalize Semantize Enrich Build Expose 35Copyright Antidot™
The Giant Enterprise Graph Now we have a path to let SemWeb enter the enterprise 36Copyright Antidot™
Discuss Understand Learn Exchange www.antidot.net firstname.lastname@example.org THANKS FOR YOUR ATTENTION QUESTIONS? 37Copyright Antidot™