Becoming Datacentric

Becoming Datacentric
Timothy W. Cook, MSc
CEO, Data insights, Inc.
Shareable, Structured, Semantic Model (S3Model)

What Problem(s) AreWe Solving?
The ability to
share
machine
processable
information
between
applications
Within the
organization
Among
organizations

AI
Machine
learning
Decision
support
Massage!
QUALITY
QUALITY
QUALITY
Bad Input == Bad Output

Source Code
Context
Semantics
Data

“There are no statues of committees”
SDO
Top-Down
Consensus
Slow
Pseudo-
Representative
Pseudo-
Comprehensive
Consequence: Reality is misrepresented when
the SDO-built model does not fit the case

Example
InternationalStatisticalClassificationof
DiseasesandRelatedHealthProblems
(ICD)
First Review: 1893
Ninth Review: 1975
AIDS: Discovered in 1980
Tenth Review: 1990
For 10 years, the ICD-9 using systems
had to force AIDS into the 279.1 code
for “Deficiency of cell-mediated
immunity”
This is why
we don’t use
ICD
Oh
c’mon…

JustThe Bullets
(The devil is in the details)

The Bullets
1) Data is a key asset of any organization.
2) Data migrations are costly, both in process and in
information loss. It should be possible to store data for an
unlimited amount of time.
3) Information is data with context.
4) Context is the combination of ontological, temporal and
spatial semantics about when, where and how the data was
collected.
5) Knowledge is derived from information managed over
time.
6) An information provider today cannot know the use cases
of information consumers of tomorrow. Therefore creating
models with complete context that will fit all the use cases
forever is impossible.
7) Data models and information instances must be
computable, sharable, immutable, traceable and uniquely
identifiable.
8) Proper information modeling must be future proof; no
data is ever left behind.

Myth #1: "Big Data" Has a Universally
Accepted, Clear Definition
Two of these aspects are a particular concern for a data-centric approach:
Variability Velocity
The various definitions have 3V in common (some references reach to 10+V):
Volume: Existence of gigantic amounts of
data
Variability: Coexistence of structured, non-
structured, machine generated etc. data
Velocity: Data is produced, and it has to be
processed and consumed very fast
There is no consensus in scientific literature and on the specialized blogosphere
about the definition of Big Data

Myth #2: Big Data Is New
Collecting, processing and analyzing sheer
amounts of data is not a new activity in
mankind
• Example: Middle Age monks and their concordances
(correlations of every single word in the Bible)
What is new is the volume, size and
the speed it can be processed and
analyzed

Myth #3: Bigger Data Is Better
This is partially
fact: the bigger
the sample size,
the more
precise the
estimates are
However, large
sample sizes
with bad quality
data are
dangerously
misleading
Precision and
reliability are
both equally
important

Myth #4: Big Data Means Big
Marketing
The evidence that analyzing
Big Data increases the
number of customers is
uncertain
Big Data is useful when it
helps emerging actionable
insights
Example: a trending topic on
Twitter and more clicks on a
certain ad
That has little relevance in
strategic areas and public
services

Big Data vs. Long Data
FergusonAR et al. Big data from small data: data-sharing in the 'long tail' of neuroscience. Nature Neuroscience
2014; 17:1442–7. doi:10.1038/nn.3838
Real data is here
Data standards
operate here

Data Migration
• Estimated cost about $10,000/month
• 25% - 50% of the costs of acquiring a new software
40
50
60
70
80
90
100
110
120
1/1/2000
3/1/2000
5/1/2000
7/1/2000
9/1/2000
11/1/2000
1/1/2001
3/1/2001
5/1/2001
7/1/2001
9/1/2001
11/1/2001
1/1/2002
3/1/2002
5/1/2002
7/1/2002
9/1/2002
11/1/2002
1/1/2003
3/1/2003
5/1/2003
7/1/2003
9/1/2003
11/1/2003
1/1/2004
3/1/2004
5/1/2004
7/1/2004
9/1/2004
11/1/2004
1/1/2005
3/1/2005
5/1/2005
7/1/2005
9/1/2005
11/1/2005
1/1/2006
3/1/2006
5/1/2006
7/1/2006
9/1/2006
Diastolic BP Lower Limit of Normaliy for DBP Upper Limit of Normaliy for DBP Prehypertension limit
October 3, 2003:
1st prehypertensive
measurement
October 20, 2003:
2nd prehypertensive
measurement
May 21, 2003:
The JNC 7
Guideline is
published
March 31, 2003:
DBP = 84mmHg
(Normal according to the JNC 6)
October 7, 2006:
Death by stroke
February 2, 2003:
DBP = 88mmHg
(Normal according to the JNC 6)
January 2004:
Improper data
migration is
performed
Later, the hospital is
sued because audit
said Hypertension
should be diagnosed
in 2003-03-31

Weight?
Dalmatians?
Hospital
room no.?
Blood
Pressure!
Systolic or
Diastolic?
Supine,
standing or
sitting?
Time of
measurement?
Body
temperature?
Device
type? Room
temperature?
Etc. etc. ad nauseam

Data Model Definitions
UML Models SQL DBs Data dictionary
documents
CSV headers
P
R
O
V
I
D
E
R
C
O
M
S
U
M
E
R
Human Computer

Our Previous Findings/Insights
Findings
Data models
and data
descriptions
must be:
Sharable
Immutable
Machine
processable
Insights
We do not have enough
trained data scientists to
keep up with the
exploding amounts of
data
We cannot continue to
rely on human sorting
and cleaning of data

The Datacentric Framework
Be
Future
proof
Be
Transparent
Agile
To
evolve
without
Reducing
the value
Changing
the meaning
of existing
information
MUST:
(enough)
ORAND
Provide a clear path
For existing application-
centric industries
To transition to the new
paradigm
AND

Use of ExistingTechnology
Technology
used must
be tested
and reliable.
One
technology
can’t fix all
problems.
Old tools
are still
useful…
…so don’t build new
tools just because you
don’t understand the
old ones
How do multiple tools
fit together to solve
the problem?

Maturity Matters
JSON_LD
SHACL

Social Issues
Plain old inertia.
One more epi-cycle.
Build a new language.
We don’t share our data.

Outside the Scope of S3M
On the wire
syntax
Authentication Authorization Application
level
persistence

Implementation Goals
Use robust, off-the-shelf technologies where possible.
Implement with global and cross-domain usage in mind.
Implement with maximum reusability and capability for machine
processing as a major goal.
There must be a well defined process that provides for the smooth
transition from application-centric to data-centric information processing.
1
2
3
4

The Structured Semantic Shareable
Model (S3M)
S3M is based on the core
modelling concepts of
openEHR to provide
semantics external from
applications
From openEHR, S3M inherited the
multilevel model principles
S3M also uses certain
conceptual principles from
HL7 v3
From HL7, S3M inherited the XML-
based implementation
Innovations exclusive to
S3M:
Separate structure from semantics
Bottom-up data modeling enabled by
CUIDs
Semantic notation of XML Schemas
(not XML data!) with RDF

Data & Semantics Flow in the S3M
Ecosystem

S3M in a Nutshell
Technological
Approach
• Uses XML Schema 1.1 to build structural definitions/models (it was designed for this)
• Integrates RDF to define the semantics (it was designed for this)
Data Modeling
Approach
• Allows multiple modelers to define models of the same concept that are structurally and
semantically different. (the consensus & evolving science problems)
• Allows modelers to define the granularity of the model.
• Accommodates data that is outside the normal range, invalid according to constraints or is
missing completely.
• Allows modelers to use existing ontologies such as those on Bioportal, local ontologies or
other URIs that point to valid definitions such as web pages or even PDFs, if the need arises.
Provides a consistent foundation for automated,
machine processing.

S3M History
We have functioning prototype tools to generate models and convert existing datasets into
models and “validate-able” data.
Now at version 3.0 based on R&D and peer-reviewed publications, invited presentations and
feedback from those events.
We simplified the core and removed the healthcare specific components.
We modeled all of the NIH CDE, FHIR, a segment of ICD-10 and 11, selected clinical
guidelines, a mortality system and a hospital reporting system.
Project began in November 2009 as a healthcare specific project.

S3M –What’s Missing?
Improved documentation.
• Instead of http://datainsights.tech/S3Model/ and http://datainsights.tech/S3Model/rm/index.html
something more like http://xbrl.squarespace.com/xbrl-for-dummies/
Improved ontology links to one or more core ontologies. COMPLETED!
Training materials.
Sustainable business model. Investors and/or Partners.
A high visibility implementation as a demonstrable proof of concept.

Becoming Datacentric

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Becoming Datacentric

Similar to Becoming Datacentric (20)

More from Timothy Cook

More from Timothy Cook (20)

Recently uploaded

Recently uploaded (20)

Becoming Datacentric