Practical advice for unstructured data for slideshare 6:29
Practical Advice About Unstructured Data
By Neil Raden firstname.lastname@example.org
Principal Analyst, Hired Brains Research, LLC
Dealing with unstructured data requires a knowledge integration process as opposed to
a data integration process and is excruciatingly difficult without a model-based
Analytics and “Used Data”
Anyone tasked with analyzing data for understanding past events and/or predicting the
future knows that the data assembled is always “used.” It’s secondhand. Data is
captured digitally for purposes that are almost exclusively meant for purposes other
than analysis. Operational systems automate processes and capture data to record the
transactions. Document data is stored in formats for presentation and written in flowing
prose without obvious structure; it’s written to read (or just record), not mined for later
analysis. Clickstreamdata in web applications is captured and stored in a verbose
stream, but has to be reassembled for sense-making.
Another fact that analysts know is that a single source of data may be useful, but it
becomes exponentially more useful when it can be combined with other sources, a
process called integration or blending. Integrating data from internal, structured
sources, such as ERP and CRMsystems is actually difficult enough, as we learned from
data warehousing, but best practices and useful tools emerged over the past two
decades. Getting around semantic mismatch in different sources is still an issue and
cannot be solved merely pointing and clicking at column names. However, current
practices are still too time-consuming and rigid – different applications and/or areas of
an organization may integrate the same data from the same sources, but they are
duplicative, incompatible and rigid. No matter how well data warehouses, BI and ETL
served (and continue to serve) organizations, business requirements today demand a
better solution using:
Multiple File Formats
Multiple Intake Sources
Multiple Entities and Concepts
Learning to capture data from logs and steaming API’s
Integrating external data sources, especially those that we call “unstructured1” is so
challenging that often the time it takes to process it with structured data integration
techniques and manual inspection exceeds the window in which the analysis is useful.
It would seem that the job of pulling these sources of information to draw insight is a
nearly impossible job, and in most cases, it is, without the proper tools. Those tasked
with the job are talented people who more or less straddle the boundary between IT
and business domain experts, including those who are identified as “data scientists.” But
it is well documented that these people spend an inordinate amount of time on the data
preparation/data integration tasks, especially duplicating the efforts of others around
them, which robs them of valuable time that should be devoted to discovering insight
and proposing action.
What Actually Is Unstructured Data?
Before presenting some solutions to these hurdles, some precision on the terms
“structured” and “unstructured” is helpful.
Structured data is that data that is stored in a fixed schema, such as relational database
tables or other fixed format files based on a data model. Computer software developed
both by in-house IT organizations and software vendors have, for a few decades, used
the concept of structured data for consistency, reliability, etc. However, even structured
data can be difficult to mine because the application logic and/or humans who drive the
systems can do all sorts of things to degrade the semantic consistency in the files (the
most widespread ERP system disables referential integrity in many of its tables for
performance reasons and handles it in the application logic, making it difficult to just
extract its data without invoking its API’s). This becomes an even more difficult problem
when extracting from more than one system when semantic consistency does not exist
across the systems. This was a lesson learned in data warehousing and should be a
warning to those expanding their analytic databases – it takes time.
The naïve definition of unstructured data is anything that isn’t structured, but that isn’t
true. “Unstructured” is really more of a fuzzy class definition than it is a precise one.
Some may suggest that there is no unstructured data, only data whose structure we
can’t elucidate. For example, a 100-page nicely structured paper report placed on top of
a stick of dynamite and ignited would fall to earth in a million shards of paper
fragments. Surely this would meet some criteria of unstructured-ness. But if we could
precisely model the interaction of heat, light, the angle of the sun, ambient
temperature, wind velocity and direction and probably a thousand other variables, we
could find the structure. Of course, we can’t in this case. And that is the essence of
unstructured data – finding the structure is more troublesome than applying some other
techniques to withdraw whatever latent value is in the data itself.
Data such as streaming telemetry, web logs and many other sources are also lumped
into the unstructured category, even if they have a defined record structure. However,
they are not based on a defined data model; their structure such as it is can change
dynamically, therefor extracting information from them depends on making
assumptions about the size and shapes of the records, which is not a reliable method.
Twitter is a good example. Records from the Twitter API used to have forty-five fields
(the actual text message being only one of them), but it expands without notice. So the
first definition, that unstructured data is all data that isn’t structured, may be the best
How Hard Are We Working?
The rapid rise of interest in “big data” has spawned a variety of technology approaches
to solve, or at least, ease this problem such as text analytics, sentiment analysis, NLP
and bespoke applications of AI algorithms. They work. They perform functions that are
too time-consuming to do manually but they are incomplete because each one is too
narrow, aimed at only a single domain or document type, or too specific in its operation.
They mostly defy the practice of agile reuse because each new source, or even each new
special extraction for a new purpose, has to start from scratch.
The only workable solution is one that provides all of the tools to make and enhance a
“Smart Data Platform” seamlessly, relieving the analysts and data scientists from the
tedious effort of integrating and the inputs and outputs many operations.
Data Scientists and Business Analysts are spending 50-80% of their time2 preparing and
organizing their data (due in large measure to the preponderance of sorting through
unstructured data and linked data (structured and unstructured) and only 20% of their
time analyzing it. Furthermore, unstructured data is such an untapped wealth of
information, it is estimated that more than 80% of all data is unstructured3 Obviously,
whatever the proportion, you simply must have an unstructured data strategy.
If you consider that much of the useful analysis you will do must break through the
artificial barriers between structured data and unstructured data: old tools, old ways,
the crush of managing from scarcity, what can you do to create interactive data
exploration with no boundaries?
Knowledge Integration Solution
The first step in getting control of and value from this disjoint and vast collection of data
is a universal way to represent it and its meaning regardless of the source or format.
Instead of data integration, one has to invest in the concept of knowledge integration
and knowledge extraction. A “Smart Data” platform is needed with a minimum of:
Advanced Text Analytics
Annotation, Harmonization and Canonicalization
Dynamic Ontology Mapping
Auto-generated conceptual models
Semantics querying and data enrichment
Fully customizable dashboards
With full data provenance adhering to IT standards
A complete solution should be based on open standards and a semantic approach from
its beginnings. In addition, it should incorporate a very rich tool set that includes easy
inclusion of 3rd party applications that operate seamlessly within the Smart Data
Platform. This is central to the ability to move the task of data integration and data
extraction to more advanced knowledge Integration and knowledge extraction, without
which it is impossible to fuel solutions in the areas of investigatory analytics, Customer
360, competitive intelligence, Insider trading surveillance, risk and compliance, as well
as feeding existing BI applications (a requirement that is not going away anytime soon).
It all works because it is based on a dynamic model-based approach.
What is a model-based approach?
At the heart of the model-based approach is integrating all forms of data in semantic
technology. Though descriptions of semantic technology are often complicated the
concept itself is actually very simple:
- It supplies meaning to data that travels with the data
- The model of the data is updated on the fly as new data enters
- It is a single, universal way to represent data from any source
- The model also captures and understands the relationship between things from
which it can actually do a certain level of reasoning without programming
- Information from many sources can be linked, not through views or indexes, but
through explicit and implicit relationships that are native to model
- The “model” is based on ontology
Suppose you wish to predict how many people will come to your Emergency Room in
the next month. You’ve done some preliminary research and found a high correlation of
people complaining on Twitter about certain symptoms and the likelihood they will visit
an ER (of course this is an oversimplified description; you would likely combine the
Twitter data with other elements to expand its value and strength of its prediction). You
chose to query and extract “tweets” from the Twitter API and begin to evaluate your
data. Unlike what you see in Twitter, a tweet from the API is a lengthy record with
dozens of fields that include sending server, ID, etc., etc. Only one field contains the
Why is this complicated? Consumers of tweets should tolerate the addition of new fields
and variance in ordering of fields with ease. Not all fields appear in all contexts. It is
generally safe to consider a nulled field an empty set, and the absence of a field as the
same thing. Twitter data is actually some of the more logical and standardized
“unstructured” data, but even Twitter data is a challenge. How do you actually get the
data you are looking for? Even more importantly, how will you extract it repeatedly
from subsequent draws without doing it manually?
The short answer is developing a “model” for that particular data source and applying
that model (and modifying it as things change quickly in the big data world) to quickly
extract and integrate the data for your ongoing analysis. The model should also help
when doing not just the problem at hand, but quick mash-ups as other ideas and issues
arise. The model makes it possible to combine the data from Twitter with any other data
in the model, at will – no need for design and new code and testing.
That is the essence of the model-based approach. The knowledge base not only
provides the all of the usual metadata one would find in a well-designed data
warehouse or MDM (Master Data Management) environment, but it provides the
“meaning” of the data (as well as models, etc.). What do we mean by meaning?
What is the definition of Neil Armstrong’s walk on the moon. It happened sometime in
July of 1969, at some location on the moon, for duration of a certain number of
minutes, etc. But the meaning of that walk was that for the first time a human being
stepped onto another planet and survived. The true meaning of that is how it affected
people, the end of the space race, and how it has affected technology and civilization
since. In other words, how it relates to other people and things. Meaning is found in
context and context is the set of relations between things.
As mentioned earlier in this paper, linking data from multiple sources has an
exponential effect in value and usefulness. For example, a tweet may contain a geocode
for the writer, but if your model already contains extensive geographic, economic and
even psychographic information for that geocode, each tweet essentially inherits all of
that information for no cost or effort. The value of this is almost impossible to measure.
Big data is often confused with social media like the Twitter example above, but not all
unstructured data is external to the enterprise. Commercial aircraft generate mountains
of streaming telemetry, but real-time analysis combines other data from other devices,
manufacturer specs, etc.; medical devices like chemotherapy infusion machines do too.
This data is captured by the manufacturers and used for preventive maintenance and
detection and alerting of abnormal readings. Charity:Water monitors streaming data
from its water projects around the undeveloped world and combines it with local
weather data and even third-party risk assessment data of troop and militia movements
for threat analysis. The economic realities of big data make integrating and analyzing
this data feasible, but first, you have to marshal all this unruly data to make it usable.
If you believed that you could develop a workable model for predicting visits to your ER
by mining text for the Twitter API, how would you go about it? There are, classically,
many algorithms for predicting flow in and through queues, but you need to know who
shows up at the ER door. Lets assume that you have a working model for internal
scheduling in the ER, but you need to create as input the flow of patients appearing at
the door. The Twitter feed can provide the time of day and the geographic location of
the sending server (as well as a mountain of other “metadata”). Your job is to extract
these various attributes that you think are significant predictors of the likelihood of a
visit. You will also want to combine this data with historical trend data you have on visits
and start to build some predictive models and test them.
Much of the data in a Twitter feed is fairly easy to understand, but the “tweet,” the text
message, is where the real nugget of insight comes from. Because of the limitation of
140 characters, the messages are often difficult for a machine to parse:
Amber and I have been In the waiting rm for 4 hrs. Never again do we casually
stroll into the er. Next time we shoot each other too.
Spacing, punctuation, and capitalization – all are pretty informal. And of course the last
sentence – sarcasm. Just trying to pick up keywords, which is what most text analysis
does, clearly will not suffice. True NLP (Natural Language Processing) is needed to make
sense of this. In fact, one might pick up a few tweets from the writer for the sentiment
analyzer to get of sense of his/her style.
Data from Twitter is only one of a myriad of unstructured data sources available, but it is
a useful source of available information from 300 million monthly active users. 500
million Tweets are sent per day. But in the big data world, that is just one specialized
source among hundreds of thousands.
Making sense of unstructured data takes discipline because a one-off approach will
drain your best resources of time and patience. A model-based approach, complete with
a suite of NLP, AI, graph-based models and semantic is the sensible approach.
The whole extended fabric of a complete solution and its ability to plug in third-party
abilities collapses many layers of logical and physical models in traditional data
warehousing/business intelligence architectures into a single model. With a model-
based approach, useful benefits accrue:
Widespread understanding of the model across many domains in the
Rapid implementation of new studies and applications by expanding the model,
not re-designing it (even small adjustments to relational databases involve
development at the logical, physical and downstream models, with time-
Application of Solution Accelerators that provide bundled models by
industry/application type that can be modified for your specific needs
The use of ontology was hampered a decade ago by poor performance but the
appearance of powerful graph databases and economical distributed computing
(Hadoop) make it an attractive solution.
1 Actually, a great deal of unstructured data is not external at all. Documents, reports,
spreadsheets, email, audio, video and picture data can all be found within an
3 Various reports use a figure from 80-85% such as those from IBM and Merrill-Lynch,
but it is impossible to be precise and on reality, it does not matter. What matters is how
much is relevant to you, but in practice, it is vast.