3. What Problem(s) AreWe Solving?
The ability to
share
machine
processable
information
between
applications
Within the
organization
Among
organizations
7. What Problem(s) AreWe Solving?
“There are no statues of committees”
SDO
Top-Down
Consensus
Slow
Pseudo-
Representative
Pseudo-
Comprehensive
Consequence: Reality is misrepresented when
the SDO-built model does not fit the case
10. The Bullets
1) Data is a key asset of any organization.
2) Data migrations are costly, both in process and in
information loss. It should be possible to store data for an
unlimited amount of time.
3) Information is data with context.
4) Context is the combination of ontological, temporal and
spatial semantics about when, where and how the data was
collected.
5) Knowledge is derived from information managed over
time.
6) An information provider today cannot know the use cases
of information consumers of tomorrow. Therefore creating
models with complete context that will fit all the use cases
forever is impossible.
7) Data models and information instances must be
computable, sharable, immutable, traceable and uniquely
identifiable.
8) Proper information modeling must be future proof; no
data is ever left behind.
12. Myth #1: "Big Data" Has a Universally
Accepted, Clear Definition
Two of these aspects are a particular concern for a data-centric approach:
Variability Velocity
The various definitions have 3V in common (some references reach to 10+V):
Volume: Existence of gigantic amounts of
data
Variability: Coexistence of structured, non-
structured, machine generated etc. data
Velocity: Data is produced, and it has to be
processed and consumed very fast
There is no consensus in scientific literature and on the specialized blogosphere
about the definition of Big Data
13. Myth #2: Big Data Is New
Collecting, processing and analyzing sheer
amounts of data is not a new activity in
mankind
• Example: Middle Age monks and their concordances
(correlations of every single word in the Bible)
What is new is the volume, size and
the speed it can be processed and
analyzed
14. Myth #3: Bigger Data Is Better
This is partially
fact: the bigger
the sample size,
the more
precise the
estimates are
However, large
sample sizes
with bad quality
data are
dangerously
misleading
Precision and
reliability are
both equally
important
15. Myth #4: Big Data Means Big
Marketing
The evidence that analyzing
Big Data increases the
number of customers is
uncertain
Big Data is useful when it
helps emerging actionable
insights
Example: a trending topic on
Twitter and more clicks on a
certain ad
That has little relevance in
strategic areas and public
services
16. Big Data vs. Long Data
FergusonAR et al. Big data from small data: data-sharing in the 'long tail' of neuroscience. Nature Neuroscience
2014; 17:1442–7. doi:10.1038/nn.3838
Real data is here
Data standards
operate here
17. Data Migration
• Estimated cost about $10,000/month
• 25% - 50% of the costs of acquiring a new software
40
50
60
70
80
90
100
110
120
1/1/2000
3/1/2000
5/1/2000
7/1/2000
9/1/2000
11/1/2000
1/1/2001
3/1/2001
5/1/2001
7/1/2001
9/1/2001
11/1/2001
1/1/2002
3/1/2002
5/1/2002
7/1/2002
9/1/2002
11/1/2002
1/1/2003
3/1/2003
5/1/2003
7/1/2003
9/1/2003
11/1/2003
1/1/2004
3/1/2004
5/1/2004
7/1/2004
9/1/2004
11/1/2004
1/1/2005
3/1/2005
5/1/2005
7/1/2005
9/1/2005
11/1/2005
1/1/2006
3/1/2006
5/1/2006
7/1/2006
9/1/2006
Diastolic BP Lower Limit of Normaliy for DBP Upper Limit of Normaliy for DBP Prehypertension limit
October 3, 2003:
1st prehypertensive
measurement
October 20, 2003:
2nd prehypertensive
measurement
May 21, 2003:
The JNC 7
Guideline is
published
March 31, 2003:
DBP = 84mmHg
(Normal according to the JNC 6)
October 7, 2006:
Death by stroke
February 2, 2003:
DBP = 88mmHg
(Normal according to the JNC 6)
January 2004:
Improper data
migration is
performed
Later, the hospital is
sued because audit
said Hypertension
should be diagnosed
in 2003-03-31
23. Our Previous Findings/Insights
Findings
Data models
and data
descriptions
must be:
Sharable
Immutable
Machine
processable
Insights
We do not have enough
trained data scientists to
keep up with the
exploding amounts of
data
We cannot continue to
rely on human sorting
and cleaning of data
25. Use of ExistingTechnology
Technology
used must
be tested
and reliable.
One
technology
can’t fix all
problems.
Old tools
are still
useful…
…so don’t build new
tools just because you
don’t understand the
old ones
How do multiple tools
fit together to solve
the problem?
28. Social Issues
Plain old inertia.
One more epi-cycle.
Build a new language.
We don’t share our data.
29. Outside the Scope of S3M
On the wire
syntax
Authentication Authorization Application
level
persistence
30. Implementation Goals
Use robust, off-the-shelf technologies where possible.
Implement with global and cross-domain usage in mind.
Implement with maximum reusability and capability for machine
processing as a major goal.
There must be a well defined process that provides for the smooth
transition from application-centric to data-centric information processing.
1
2
3
4
31. The Structured Semantic Shareable
Model (S3M)
S3M is based on the core
modelling concepts of
openEHR to provide
semantics external from
applications
From openEHR, S3M inherited the
multilevel model principles
S3M also uses certain
conceptual principles from
HL7 v3
From HL7, S3M inherited the XML-
based implementation
Innovations exclusive to
S3M:
Separate structure from semantics
Bottom-up data modeling enabled by
CUIDs
Semantic notation of XML Schemas
(not XML data!) with RDF
34. S3M in a Nutshell
Technological
Approach
• Uses XML Schema 1.1 to build structural definitions/models (it was designed for this)
• Integrates RDF to define the semantics (it was designed for this)
Data Modeling
Approach
• Allows multiple modelers to define models of the same concept that are structurally and
semantically different. (the consensus & evolving science problems)
• Allows modelers to define the granularity of the model.
• Accommodates data that is outside the normal range, invalid according to constraints or is
missing completely.
• Allows modelers to use existing ontologies such as those on Bioportal, local ontologies or
other URIs that point to valid definitions such as web pages or even PDFs, if the need arises.
Provides a consistent foundation for automated,
machine processing.
35. S3M History
We have functioning prototype tools to generate models and convert existing datasets into
models and “validate-able” data.
Now at version 3.0 based on R&D and peer-reviewed publications, invited presentations and
feedback from those events.
We simplified the core and removed the healthcare specific components.
We modeled all of the NIH CDE, FHIR, a segment of ICD-10 and 11, selected clinical
guidelines, a mortality system and a hospital reporting system.
Project began in November 2009 as a healthcare specific project.
36. S3M –What’s Missing?
Improved documentation.
• Instead of http://datainsights.tech/S3Model/ and http://datainsights.tech/S3Model/rm/index.html
something more like http://xbrl.squarespace.com/xbrl-for-dummies/
Improved ontology links to one or more core ontologies. COMPLETED!
Training materials.
Sustainable business model. Investors and/or Partners.
A high visibility implementation as a demonstrable proof of concept.