The document discusses using a semantic approach to data to enable smart content. It describes how traditional databases store data in silos by category and attributes. A new approach uses a NoSQL database with semantics to break down these silos. An example is given of a Saturday Night Live app that allows semantic search of talent, episodes, characters and other metadata. The operational data hub pattern is proposed to ingest raw data from various sources, harmonize it using semantics, and allow both analytical and operational applications to access the normalized data. This creates smart content by breaking down data silos.
This pattern for an operational data hub has been developed from real world requirements from ongoing customer projects at Fortune 500 companies, by working closely with the on-site consultants to capture the common patterns that were used to conceptualize the overall framework.
At the high level it comes down to 3 step process:
Data Ingestion: Where source documents from multiple sources land into MarkLogic staging area – taking full advantage of ‘load as is’ capability of MarkLogic
Data Harmonization: This is where MarkLogic secret sauce – its rich indexes and powerful query/scripting mechanism either using Xquery or JavaScript is used to standardize a variety of source documents into entity instances where entity could be a Customer, a Patient, an Insurance Claim and so on. The entity instances are documents which are formatted using the envelop pattern which is again very commonly used in the field to capture the standard attributes related to the entities. This is where complexity of data silos – different models, different schemas, different languages is turned into standardized data structures within MarkLogic that are made available in what is called as ‘final’ area.
Serving: Serving refers to making the data accessible to a number of consuming applications. These could be operational apps, analytical apps and so on. All applications have different needs in terms of how they want to access the data, what data formats they can handle etc. In other words ‘Schema on Read’ which is another aspect where MarkLogic is really good at. Again this is where MarkLogic indexes and rich query capabilities play an important role. Since all the data is available in standardized entity instances in the final area, it’s easier to serve that data in whatever format the consuming application needs it.
The best practices captured here are:
Separation of staging/final area – Typically these are two different databases. They have different indexes, different security policies, and different life cycle management or storage management policies associated with them.
Concept of the envelope pattern to capture standardized entity attributes without throwing away the original source documents.
Implementation best practices – making sure there are guardrails in place when it comes to implementing all different steps but specifically the data harmonization step.