The Informed Data Lake Strategy
Historically, the volume and extent of data that an enterprise could store, assemble,
analyze and act upon, exceeded the capacity of their computing resources and was too
expensive. The solution was to model some extract of a portion of the available data
into a data model or schema, presupposing what was “important,” and then fit the
incoming data into that structure.
But, the economics of data management today allow for the gathering of practically any
data, in any form, skipping the engineered schema with the presumption that
understanding the data can happen on an as-needed basis.
This newer approach of putting data in one place for later use, is now described as a
Data Lake. But sooner or later, one has to pay the piper, as this Data Lake approach
involves mechanical, time-consuming data preparation and filtering that is often one-
off, consumes a large percentage of the data scientist’s time and provides no reference
to the content or meaning of the data.
The alternative is Informed Data Lake. The difference between an Informed Data Lake
and the static “dumb” data neatly arranged in a Data Lake, is a comprehensive set of
capabilities that provide a graph based linked and contextualized information fabric
(semantic metadata and linked datasets) where NLP (Natural Language Processing),
Sentiment Analysis, Rules Engines, Connectors, Canonical Models for common
domains and cognitive tools that can be plugged in to turn “dumb” data into
information assets with speed, agility, reuse and value.
Today, companies in Pharmaceutical, Life Sciences, Financial Services, Retail,
Government Agencies, and many other industries, are seeking ways to make the full
extent of their data more insightful, valuable and actionable. Informed Data Lakes are
leading the way through graph based data discovery and investigative analytics tools
and techniques that uncover hidden relationships in your data, while enabling iterative
question and answering of that linked data.
The Problem with a Traditional Data Lake Approach
Anyone tasked with analyzing data for understanding past events and/or predicting the
future knows that the data assembled is always “used.” It’s secondhand. Data is
captured digitally for purposes that are almost exclusively meant for purposes other
than analysis. Operational systems automate processes and capture data to record the
transactions. Document data is stored in formats for presentation and written in flowing
prose without obvious structure; it’s written to read (or just record), not mined for later
analysis. Clickstreamdata in web applications is captured and stored in a verbose
stream, but has to be reassembled for sense-making.
Organizations implementing or just contemplating a data lake often operate under the
misconception that having the data in one place is an enabler to broader and more
useful analytics leading to better decision-making and better outcomes. There is a large
hurdle facing these kinds of approaches – while the data may be in one place physically
(typically a Hadoop cluster), in essence all that is created is a collection of data siloes,
unlinked and not useful in a broader context, reducing the data lake to nothing more
than a collection of disparate data sources. Here are the issues:
Data quality issues –data not curated. Users have to deal with repeat data, old
data, contextually wrong data
Data can be in different formats or different languages
Companies being misinformed as to who can process and use the data
Also the ever changing nature of diverse data requires that the processing and
analysis of the data be dynamic and evolves as the data changes or as the need
The effort of adding meaning and context to the data falls on the shoulders of the
analysts, a true time-sink. Data Scientists and Business Analysts are spending 50-80% of
their time preparing and organizing their data and only 20% of their time analyzing it.
But what if data could describe itself?
What if analysts could link and contextualize data from different domains
without having to go through the effort of curating it for themselves?
What if you had a Informed Data Lake to address all those issues and more?
The Argument for a Informed Data Lake
Existing approaches to curating, managing and analyzing data (and metadata) are mostly
based on relational technology, which, for performance reasons, usually simplifies data
and strips it of meaningful relationships while locking it into rigid schema. The
traditional approach is to predefine what you want to do with the data, define the
model and subsequent version of physical optimization and subsets for special uses.
Overall you use the data as you originally designed it. If changes are needed, going back
to redesign and modify is complicated.
The rapid rise of interest in “big data” has spawned a variety of technology approaches
to solve, or at least, ease this problem such as text analytics and bespoke applications of
AI algorithms. They work. They perform functions that are too time-consuming to do
manually but they are incomplete because each one is too narrow, aimed at only a
single domain or document type, or too specific in its operation. They mostly defy the
practice of agile reuse because each new source, or even each new special extraction for
a new purpose, has to start from scratch.
Given these limitations, where does one turn for help? A Informed Data Lake is the
At the heart of the Informed Data Lake approach is the linking and contextualizing of all
forms of data using semantic based technology. Though descriptions of semantic
technology are often times complicated, the concept itself is actually very simple:
- It supplies meaning to data that travels with the data
- The model of the data is updated on the fly as new data enters
- The model also captures and understands the relationship between things from
which it can actually do a certain level of reasoning without programming
- Information from many sources can be linked, not through views or indexes, but
through explicit and implicit relationships that are native to model
Conceptually, the Informed Data Lake is a departure from the earliest principle of IT:
Parsimonious development derived from a mindset of managing from scarcity1 and
deploying only simplified models. Instead of limiting the amount of data available, or
even the access to it, Informed Data Lakes are driven by the abundance of resources
and data. Semantic metadata provides the ability to find, link and contextualize
information in a vast pool of data.
1 Managing from scarcityhas historicallydrivenIT to developanddeployusing the least amount ofcomputing
resources under the assumption that these resources were precious andexpensive. In the current computing
economy, the emphasis hasshifted awayfromscarcityof hardware to scarcityof time andattentionof knowledge
The Informed Data Lake works because it is based on a dynamic semantic model-
approach based on graph driven ontologies. In technical terms, an ontology represents
the meaning and relationships of data in a graph, an extremely compact and efficient
way to define and use disparate data sources via semantic definitions based on business
usage including terminology and rules that can be managed by business users:
• Source data, application interfaces, operational data, and model metadata are
all described in a consistent ontology framework supporting detailed semantics
of the underlying objects. This means constraints on types, relations, and
description logic, for example, are handled uniformly for all underlying
• The ontology represents both schema and data in the same way. This means
that the description of metadata about the sources also represents a machine-
readable way of representing the data itself for translation, transmission,
query, and storage.
• An ontology can richly describe behavior of services and composite applications
in a way that a relational model can only do by being tightly bound to the
• The ontology is a run-time model, not just a design-time model. The ontology is
used to generate rules, mappings, transforms, queries, and UI because all of the
elements are combined under a single structure.
• There is no reliance on indexes, keys, or positional notation to describe the
elements of the ontology. Implementations do not break when local changes
• An ontological representation encourages both top-down, conceptual
description and bottom-up, source- or silo-based representation of existing
data. In fact, these can be in separate ontologies and easily brought together.
• The ontology is designed to scale across users, applications, and organizations.
Ontologies can easily share elements in an open and standard way, and
ontology tools (for design, query, data exchange, etc.) don't have to change in
any way to reference information across ontologies.
Assuming a data lake is built for a broad audience, it is likely that no one party will
have the complete set of data they think is of interest. Instead, is will be a union of all
of those ideas, plus many more that arise as things are discovered, situations evolve
and new sources of data become available. Thinking in the existing mode of database
schema design, inadequate metadata features of Hadoop and just managing from
scarcity general, will fail under the magnitude of this effort. What the Informed Data
Lake does is take the guesswork out of what the data means, what it’s related to and
how it can be dynamically linked together without endless data modeling and
All of the features and capabilities below are needed to keep a data lake from turning
into a data swamp, where no one quite knows what it contains or if it is reliable.
Informed Data Lake Features:
Connectors to practically any source
Graph based, linked and contextualized data
Dynamic Ontology Mapping
Auto-generated conceptual models
Advanced Text Analytics
Annotation, Harmonization and Canonicalization
“Canonical” models to simplify ingest and classifying of new sources
Semantics querying and data enrichment
Fully customizable dashboards
With full data provenance adhering to IT standards
Sample Informed Data Lake Capabilities:
Manage business vocabulary along with technical syntax
Actively resolve differences in vocabulary across different departments,
business units and external data
Support consistent assignment of business policies and constraints across
various applications, users and sources
Accurately reflect all logical consequences of changes and dynamically reflect
change in affected areas
Unify access to content and data
Assure and manage reuse of linked and contextualized data
A vendor providing metadata based on semantic technology is in a unique position to
provide these capabilities required to build and deploy the Informed Data Lake. It is
based on open standards and has taken a semantic approach from its beginnings. In
addition, it has incorporated a very rich tool set that includes dozens of 3rd party
applications that operate seamlessly within the Informed Data Platform™. This is
central to our ability to move the task of data integration and data extraction to more
advanced knowledge Integration and knowledge extraction, without which it is
impossible to fuel solutions in the areas of competitive intelligence, Insider trading
surveillance, investigatory analytics and Customer 360, risk and compliance, as well as
feeding existing BI applications (a requirement that is not going away anytime soon).
A Informed Data Lake Solution
The specific design pattern of the Informed Data Lake enables data science because
analytics does not end with a single hypothesis test. Simple examples of “Data
Scientists” building models on the data lake and saving the organization vast sums of
money make good copy, but they do not represent what happens in the real world.
Often, the first dozen hypotheses are either obvious or non-demonstrable. When the
model characterization comes back it presents additional components to validate and
cross correlate. It is this discovery process that the data lake somehow needs to
facilitate, and it needs to facilitate it well, otherwise the cost of the analytics is too high
and the process is too slow to realize business value.
To enable that continuous improvement process of deep analytics requires more than a
data strategy, it needs a tool chain to solve model refinement, and the best-known
method to date is the Informed Data Lake. The significant pain point for deep analytics
is refinement. And the lower the refinement costs are, the more business value can be
At some point you may have heard the criticismof BI and OLAP tools that you were
constrained to the questions that were implicit in their models. In fact, the same
criticismhas been leveled at data warehouses. The fact remains that both data
warehouses and BI tools limit your questions to those that can be answered, not just
with the available data, but how it is arranged physically and how well the query
optimizer can resolve the query.
Now imagine what is possible if you could ask any question of the data in a massive data
lake? This is where the Informed Data Lake comes into play.
Catalog capabilities allow for massive amounts of metadata and instantaneous access to
it. Thus any user (or process) can “go shopping” for a dataset that interests them.
Because the metadata is constructed in the form of an in-memory graph, linking and
joining data that is of far different structures and perhaps never linked before, can be
On a browser like interface,, the graph can show you not only the typical ways different
data sets can be linked and joined, it can even recommend other datasets that you
haven’t considered. This is where the “Smart” really comes in.
Once data is selected, the in-memory graph processing (up to six million “triples” per
second) are analyzed and traversed to provide the instantaneous joins that would be
impossible in a relational database. The net result is that arbitrarily complex models and
tools can ask any question with unlimited joins as a result of processing optimized for
multi-core CPU’s, very large memory models and fast interconnect across processing
Informed Data Lake in Action
Pharma R&D Intelligence:
Clinical trials involve great quantities of data from many sources, a perfect problem for
a Informed Data lake. The Informed Data lake allows the loading, unification and
ingestion of the data without knowing a priori what analytics would be needed. In
particular, evaluating drug response would link many sources of data following
participants with severity and occurrence of adverse drug reaction, across multiple
trials, as well as unknown other classes of data.
Clinical trial data investigators and analysts can see the value of the graph based
approach with the linking and contextualization they could not do otherwise. They see
many benefits including:
Identifying patients for enrollment based on more substantive criteria
Monitoring in real-time, to identify safety or operational signals,
Blending data from physicians and CROs (contract research organizations)
Insider Trading and Compliance Surveillance:
In the financial services space, he combination of deep analysis of large datasets with
targeted queries of specific events and people give forms and regulators an opportunity
to catch wrongdoing early.
Identify an employee who has unusually high level of suspicious trading activity.
Spot patterns in which certain employees have histories of making the exact
same trades at the exact same times.
Compare employees' behaviors to their past histories, and spot situations where
employees' trading patterns make sudden, drastic changes
Making sense of data lakes takes discipline because a one-off approach will drain your
best resources of time and patience. The Informed Data Lake approach, complete with a
suite of NLP, AI, graph-based models and semantic technology is the sensible approach.
Your two most expensive assets are staff and time. The Informed Data Lake allows you
to do your work quicker, cheaper, faster, with more flexibility and greater accuracy,
which has a major impact on your business. Without the Informed Data Lake, the data is
a bewildering collection of pieces that analysts and data scientists can only understood
in small pieces, diluting the value of the data lake.
The whole extended fabric of an ontology solution and its ability to plug in third-party
abilities collapses many layers of logical and physical models in traditional data
warehousing/business intelligence architectures into a single model. With the Informed
Data Lake approach, tangible benefits accrue:
Widespread understanding of the model across many domains in the
Rapid implementation of new studies and applications by expanding the model,
not re-designing it (even small adjustments to relational databases involve
development at the logical, physical and downstream models, with time-
Application of Solution Accelerators that provide bundled models by
industry/application type that can be modified for your specific need
“Data Democratization” making data available to users across the organization
for their own data discovery and analytic needs, extracting greater value from
Discovering hidden patterns in relationships, something not possible with the
rotational and drill down capabilities of IB tools
The ability for iterative question and answering, continuous data discovery and
run time analytics across huge amounts of data and, more importantly, linked
data from sources not typically associated previously
In conclusion, the Informed Data Lake layers a disparate collection of data sources of
unknown origin, quality and currency, into a facility for almost limitless exploration and