SlideShare a Scribd company logo
1 of 37
Becoming Datacentric
Timothy W. Cook, MSc
CEO, Data insights, Inc.
Shareable, Structured, Semantic Model (S3Model)
An Overarching Principle
What Problem(s) AreWe Solving?
The ability to
share
machine
processable
information
between
applications
Within the
organization
Among
organizations
AI
Machine
learning
Decision
support
Massage!
QUALITY
QUALITY
QUALITY
Bad Input == Bad Output
What Problem(s) AreWe Solving?
Source Code
Context
Semantics
What Problem(s) AreWe Solving?
Data
What Problem(s) AreWe Solving?
“There are no statues of committees”
SDO
Top-Down
Consensus
Slow
Pseudo-
Representative
Pseudo-
Comprehensive
Consequence: Reality is misrepresented when
the SDO-built model does not fit the case
Example
InternationalStatisticalClassificationof
DiseasesandRelatedHealthProblems
(ICD)
First Review: 1893
Ninth Review: 1975
AIDS: Discovered in 1980
Tenth Review: 1990
For 10 years, the ICD-9 using systems
had to force AIDS into the 279.1 code
for “Deficiency of cell-mediated
immunity”
This is why
we don’t use
ICD
Oh
c’mon…
JustThe Bullets
(The devil is in the details)
The Bullets
1) Data is a key asset of any organization.
2) Data migrations are costly, both in process and in
information loss. It should be possible to store data for an
unlimited amount of time.
3) Information is data with context.
4) Context is the combination of ontological, temporal and
spatial semantics about when, where and how the data was
collected.
5) Knowledge is derived from information managed over
time.
6) An information provider today cannot know the use cases
of information consumers of tomorrow. Therefore creating
models with complete context that will fit all the use cases
forever is impossible.
7) Data models and information instances must be
computable, sharable, immutable, traceable and uniquely
identifiable.
8) Proper information modeling must be future proof; no
data is ever left behind.
Big Data
MYTHS (AND FACTS)
Myth #1: "Big Data" Has a Universally
Accepted, Clear Definition
Two of these aspects are a particular concern for a data-centric approach:
Variability Velocity
The various definitions have 3V in common (some references reach to 10+V):
Volume: Existence of gigantic amounts of
data
Variability: Coexistence of structured, non-
structured, machine generated etc. data
Velocity: Data is produced, and it has to be
processed and consumed very fast
There is no consensus in scientific literature and on the specialized blogosphere
about the definition of Big Data
Myth #2: Big Data Is New
Collecting, processing and analyzing sheer
amounts of data is not a new activity in
mankind
• Example: Middle Age monks and their concordances
(correlations of every single word in the Bible)
What is new is the volume, size and
the speed it can be processed and
analyzed
Myth #3: Bigger Data Is Better
This is partially
fact: the bigger
the sample size,
the more
precise the
estimates are
However, large
sample sizes
with bad quality
data are
dangerously
misleading
Precision and
reliability are
both equally
important
Myth #4: Big Data Means Big
Marketing
The evidence that analyzing
Big Data increases the
number of customers is
uncertain
Big Data is useful when it
helps emerging actionable
insights
Example: a trending topic on
Twitter and more clicks on a
certain ad
That has little relevance in
strategic areas and public
services
Big Data vs. Long Data
FergusonAR et al. Big data from small data: data-sharing in the 'long tail' of neuroscience. Nature Neuroscience
2014; 17:1442–7. doi:10.1038/nn.3838
Real data is here
Data standards
operate here
Data Migration
• Estimated cost about $10,000/month
• 25% - 50% of the costs of acquiring a new software
40
50
60
70
80
90
100
110
120
1/1/2000
3/1/2000
5/1/2000
7/1/2000
9/1/2000
11/1/2000
1/1/2001
3/1/2001
5/1/2001
7/1/2001
9/1/2001
11/1/2001
1/1/2002
3/1/2002
5/1/2002
7/1/2002
9/1/2002
11/1/2002
1/1/2003
3/1/2003
5/1/2003
7/1/2003
9/1/2003
11/1/2003
1/1/2004
3/1/2004
5/1/2004
7/1/2004
9/1/2004
11/1/2004
1/1/2005
3/1/2005
5/1/2005
7/1/2005
9/1/2005
11/1/2005
1/1/2006
3/1/2006
5/1/2006
7/1/2006
9/1/2006
Diastolic BP Lower Limit of Normaliy for DBP Upper Limit of Normaliy for DBP Prehypertension limit
October 3, 2003:
1st prehypertensive
measurement
October 20, 2003:
2nd prehypertensive
measurement
May 21, 2003:
The JNC 7
Guideline is
published
March 31, 2003:
DBP = 84mmHg
(Normal according to the JNC 6)
October 7, 2006:
Death by stroke
February 2, 2003:
DBP = 88mmHg
(Normal according to the JNC 6)
January 2004:
Improper data
migration is
performed
Later, the hospital is
sued because audit
said Hypertension
should be diagnosed
in 2003-03-31
Weight?
Dalmatians?
Hospital
room no.?
Blood
Pressure!
Systolic or
Diastolic?
Supine,
standing or
sitting?
Time of
measurement?
Body
temperature?
Device
type? Room
temperature?
Etc. etc. ad nauseam
Data Model Definitions
UML Models SQL DBs Data dictionary
documents
CSV headers
P
R
O
V
I
D
E
R
C
O
M
S
U
M
E
R
Human Computer
Structure vs. Semantics
Structure vs. Semantics
Our Previous Findings/Insights
Findings
Data models
and data
descriptions
must be:
Sharable
Immutable
Machine
processable
Insights
We do not have enough
trained data scientists to
keep up with the
exploding amounts of
data
We cannot continue to
rely on human sorting
and cleaning of data
The Datacentric Framework
Be
Future
proof
Be
Transparent
Agile
To
evolve
without
Reducing
the value
Changing
the meaning
of existing
information
MUST:
(enough)
ORAND
Provide a clear path
For existing application-
centric industries
To transition to the new
paradigm
AND
Use of ExistingTechnology
Technology
used must
be tested
and reliable.
One
technology
can’t fix all
problems.
Old tools
are still
useful…
…so don’t build new
tools just because you
don’t understand the
old ones
How do multiple tools
fit together to solve
the problem?
Maturity Matters
JSON_LD
SHACL
Social Issues
Plain old inertia.
One more epi-cycle.
Build a new language.
We don’t share our data.
Outside the Scope of S3M
On the wire
syntax
Authentication Authorization Application
level
persistence
Implementation Goals
Use robust, off-the-shelf technologies where possible.
Implement with global and cross-domain usage in mind.
Implement with maximum reusability and capability for machine
processing as a major goal.
There must be a well defined process that provides for the smooth
transition from application-centric to data-centric information processing.
1
2
3
4
The Structured Semantic Shareable
Model (S3M)
S3M is based on the core
modelling concepts of
openEHR to provide
semantics external from
applications
From openEHR, S3M inherited the
multilevel model principles
S3M also uses certain
conceptual principles from
HL7 v3
From HL7, S3M inherited the XML-
based implementation
Innovations exclusive to
S3M:
Separate structure from semantics
Bottom-up data modeling enabled by
CUIDs
Semantic notation of XML Schemas
(not XML data!) with RDF
S3M-based App Development
Data & Semantics Flow in the S3M
Ecosystem
S3M in a Nutshell
Technological
Approach
• Uses XML Schema 1.1 to build structural definitions/models (it was designed for this)
• Integrates RDF to define the semantics (it was designed for this)
Data Modeling
Approach
• Allows multiple modelers to define models of the same concept that are structurally and
semantically different. (the consensus & evolving science problems)
• Allows modelers to define the granularity of the model.
• Accommodates data that is outside the normal range, invalid according to constraints or is
missing completely.
• Allows modelers to use existing ontologies such as those on Bioportal, local ontologies or
other URIs that point to valid definitions such as web pages or even PDFs, if the need arises.
Provides a consistent foundation for automated,
machine processing.
S3M History
We have functioning prototype tools to generate models and convert existing datasets into
models and “validate-able” data.
Now at version 3.0 based on R&D and peer-reviewed publications, invited presentations and
feedback from those events.
We simplified the core and removed the healthcare specific components.
We modeled all of the NIH CDE, FHIR, a segment of ICD-10 and 11, selected clinical
guidelines, a mortality system and a hospital reporting system.
Project began in November 2009 as a healthcare specific project.
S3M –What’s Missing?
Improved documentation.
• Instead of http://datainsights.tech/S3Model/ and http://datainsights.tech/S3Model/rm/index.html
something more like http://xbrl.squarespace.com/xbrl-for-dummies/
Improved ontology links to one or more core ontologies. COMPLETED!
Training materials.
Sustainable business model. Investors and/or Partners.
A high visibility implementation as a demonstrable proof of concept.
Questions?

More Related Content

What's hot

Some Questions About Your Data
Some Questions About Your DataSome Questions About Your Data
Some Questions About Your Data
Damian T. Gordon
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...
Institute of Contemporary Sciences
 
Data Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceData Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open Source
Stratebi
 

What's hot (20)

Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 
Some Questions About Your Data
Some Questions About Your DataSome Questions About Your Data
Some Questions About Your Data
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AI
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Adventures in Data Profiling
Adventures in Data ProfilingAdventures in Data Profiling
Adventures in Data Profiling
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneous
 
Big data road map
Big data road mapBig data road map
Big data road map
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides
 
BIG DATA RESEARCH
BIG DATA RESEARCHBIG DATA RESEARCH
BIG DATA RESEARCH
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...
 
Data Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceData Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open Source
 
Data Mining vs Statistics
Data Mining vs StatisticsData Mining vs Statistics
Data Mining vs Statistics
 
DataHub
DataHubDataHub
DataHub
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics Capabilities
 
Lecture1
Lecture1Lecture1
Lecture1
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 

Similar to Becoming Datacentric

Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
Jordan Engbers
 

Similar to Becoming Datacentric (20)

Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1
Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1
Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1
 
The interoperability challenges of 3D personal data
The interoperability challenges of 3D personal dataThe interoperability challenges of 3D personal data
The interoperability challenges of 3D personal data
 
How do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfHow do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdf
 
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
Big Data becomes Big Analysis
Big Data becomes Big Analysis Big Data becomes Big Analysis
Big Data becomes Big Analysis
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdf
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Reasoning over big data
Reasoning over big dataReasoning over big data
Reasoning over big data
 

More from Timothy Cook

AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
Timothy Cook
 

More from Timothy Cook (20)

MLHIM @ CAIS 2014 - Buenos Aries
MLHIM @ CAIS 2014 - Buenos Aries MLHIM @ CAIS 2014 - Buenos Aries
MLHIM @ CAIS 2014 - Buenos Aries
 
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
 
MLHIM FHIES 2013
MLHIM FHIES 2013 MLHIM FHIES 2013
MLHIM FHIES 2013
 
MedWeb 3.0 @ CAIS 2013
MedWeb 3.0  @ CAIS 2013MedWeb 3.0  @ CAIS 2013
MedWeb 3.0 @ CAIS 2013
 
Presentation at the Escola Regional de Computação Aplicada à Saúde
Presentation at the Escola Regional de Computação Aplicada à SaúdePresentation at the Escola Regional de Computação Aplicada à Saúde
Presentation at the Escola Regional de Computação Aplicada à Saúde
 
Poster IHI 2012
Poster IHI 2012Poster IHI 2012
Poster IHI 2012
 
Poster IHI 2012
Poster IHI 2012Poster IHI 2012
Poster IHI 2012
 
Poster IHI 2012
Poster IHI 2012Poster IHI 2012
Poster IHI 2012
 
Poster CBIS 2012
Poster CBIS 2012Poster CBIS 2012
Poster CBIS 2012
 
Poster CBIS 2012
Poster CBIS 2012Poster CBIS 2012
Poster CBIS 2012
 
Presentation WIN 2012
Presentation WIN 2012Presentation WIN 2012
Presentation WIN 2012
 
Presentation HealthCom 2012
Presentation HealthCom 2012Presentation HealthCom 2012
Presentation HealthCom 2012
 
Presentation WIM 2011
Presentation WIM 2011Presentation WIM 2011
Presentation WIM 2011
 
Poster CBTMs 2011
Poster CBTMs 2011Poster CBTMs 2011
Poster CBTMs 2011
 
Timothy Cook, MSc. presents MLHIM @ WSCHA 2010
Timothy Cook, MSc. presents MLHIM @ WSCHA 2010Timothy Cook, MSc. presents MLHIM @ WSCHA 2010
Timothy Cook, MSc. presents MLHIM @ WSCHA 2010
 
Presentation WSCHA 2010 - in portuguese
Presentation WSCHA 2010 - in portuguesePresentation WSCHA 2010 - in portuguese
Presentation WSCHA 2010 - in portuguese
 
Presentation WSCHA 2010 - in English
Presentation WSCHA 2010 - in EnglishPresentation WSCHA 2010 - in English
Presentation WSCHA 2010 - in English
 
Presentation Minicourse for Summer Program LNCC 2010
Presentation Minicourse for Summer Program  LNCC 2010Presentation Minicourse for Summer Program  LNCC 2010
Presentation Minicourse for Summer Program LNCC 2010
 
Presentation Python Brasil [6] 2010
Presentation Python Brasil [6] 2010Presentation Python Brasil [6] 2010
Presentation Python Brasil [6] 2010
 
Poster MEDINFO 2010
Poster MEDINFO 2010Poster MEDINFO 2010
Poster MEDINFO 2010
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Becoming Datacentric

  • 1. Becoming Datacentric Timothy W. Cook, MSc CEO, Data insights, Inc. Shareable, Structured, Semantic Model (S3Model)
  • 3. What Problem(s) AreWe Solving? The ability to share machine processable information between applications Within the organization Among organizations
  • 5.
  • 7. What Problem(s) AreWe Solving? “There are no statues of committees” SDO Top-Down Consensus Slow Pseudo- Representative Pseudo- Comprehensive Consequence: Reality is misrepresented when the SDO-built model does not fit the case
  • 8. Example InternationalStatisticalClassificationof DiseasesandRelatedHealthProblems (ICD) First Review: 1893 Ninth Review: 1975 AIDS: Discovered in 1980 Tenth Review: 1990 For 10 years, the ICD-9 using systems had to force AIDS into the 279.1 code for “Deficiency of cell-mediated immunity” This is why we don’t use ICD Oh c’mon…
  • 9. JustThe Bullets (The devil is in the details)
  • 10. The Bullets 1) Data is a key asset of any organization. 2) Data migrations are costly, both in process and in information loss. It should be possible to store data for an unlimited amount of time. 3) Information is data with context. 4) Context is the combination of ontological, temporal and spatial semantics about when, where and how the data was collected. 5) Knowledge is derived from information managed over time. 6) An information provider today cannot know the use cases of information consumers of tomorrow. Therefore creating models with complete context that will fit all the use cases forever is impossible. 7) Data models and information instances must be computable, sharable, immutable, traceable and uniquely identifiable. 8) Proper information modeling must be future proof; no data is ever left behind.
  • 12. Myth #1: "Big Data" Has a Universally Accepted, Clear Definition Two of these aspects are a particular concern for a data-centric approach: Variability Velocity The various definitions have 3V in common (some references reach to 10+V): Volume: Existence of gigantic amounts of data Variability: Coexistence of structured, non- structured, machine generated etc. data Velocity: Data is produced, and it has to be processed and consumed very fast There is no consensus in scientific literature and on the specialized blogosphere about the definition of Big Data
  • 13. Myth #2: Big Data Is New Collecting, processing and analyzing sheer amounts of data is not a new activity in mankind • Example: Middle Age monks and their concordances (correlations of every single word in the Bible) What is new is the volume, size and the speed it can be processed and analyzed
  • 14. Myth #3: Bigger Data Is Better This is partially fact: the bigger the sample size, the more precise the estimates are However, large sample sizes with bad quality data are dangerously misleading Precision and reliability are both equally important
  • 15. Myth #4: Big Data Means Big Marketing The evidence that analyzing Big Data increases the number of customers is uncertain Big Data is useful when it helps emerging actionable insights Example: a trending topic on Twitter and more clicks on a certain ad That has little relevance in strategic areas and public services
  • 16. Big Data vs. Long Data FergusonAR et al. Big data from small data: data-sharing in the 'long tail' of neuroscience. Nature Neuroscience 2014; 17:1442–7. doi:10.1038/nn.3838 Real data is here Data standards operate here
  • 17. Data Migration • Estimated cost about $10,000/month • 25% - 50% of the costs of acquiring a new software 40 50 60 70 80 90 100 110 120 1/1/2000 3/1/2000 5/1/2000 7/1/2000 9/1/2000 11/1/2000 1/1/2001 3/1/2001 5/1/2001 7/1/2001 9/1/2001 11/1/2001 1/1/2002 3/1/2002 5/1/2002 7/1/2002 9/1/2002 11/1/2002 1/1/2003 3/1/2003 5/1/2003 7/1/2003 9/1/2003 11/1/2003 1/1/2004 3/1/2004 5/1/2004 7/1/2004 9/1/2004 11/1/2004 1/1/2005 3/1/2005 5/1/2005 7/1/2005 9/1/2005 11/1/2005 1/1/2006 3/1/2006 5/1/2006 7/1/2006 9/1/2006 Diastolic BP Lower Limit of Normaliy for DBP Upper Limit of Normaliy for DBP Prehypertension limit October 3, 2003: 1st prehypertensive measurement October 20, 2003: 2nd prehypertensive measurement May 21, 2003: The JNC 7 Guideline is published March 31, 2003: DBP = 84mmHg (Normal according to the JNC 6) October 7, 2006: Death by stroke February 2, 2003: DBP = 88mmHg (Normal according to the JNC 6) January 2004: Improper data migration is performed Later, the hospital is sued because audit said Hypertension should be diagnosed in 2003-03-31
  • 18. Weight? Dalmatians? Hospital room no.? Blood Pressure! Systolic or Diastolic? Supine, standing or sitting? Time of measurement? Body temperature? Device type? Room temperature? Etc. etc. ad nauseam
  • 19.
  • 20. Data Model Definitions UML Models SQL DBs Data dictionary documents CSV headers P R O V I D E R C O M S U M E R Human Computer
  • 23. Our Previous Findings/Insights Findings Data models and data descriptions must be: Sharable Immutable Machine processable Insights We do not have enough trained data scientists to keep up with the exploding amounts of data We cannot continue to rely on human sorting and cleaning of data
  • 24. The Datacentric Framework Be Future proof Be Transparent Agile To evolve without Reducing the value Changing the meaning of existing information MUST: (enough) ORAND Provide a clear path For existing application- centric industries To transition to the new paradigm AND
  • 25. Use of ExistingTechnology Technology used must be tested and reliable. One technology can’t fix all problems. Old tools are still useful… …so don’t build new tools just because you don’t understand the old ones How do multiple tools fit together to solve the problem?
  • 26.
  • 28. Social Issues Plain old inertia. One more epi-cycle. Build a new language. We don’t share our data.
  • 29. Outside the Scope of S3M On the wire syntax Authentication Authorization Application level persistence
  • 30. Implementation Goals Use robust, off-the-shelf technologies where possible. Implement with global and cross-domain usage in mind. Implement with maximum reusability and capability for machine processing as a major goal. There must be a well defined process that provides for the smooth transition from application-centric to data-centric information processing. 1 2 3 4
  • 31. The Structured Semantic Shareable Model (S3M) S3M is based on the core modelling concepts of openEHR to provide semantics external from applications From openEHR, S3M inherited the multilevel model principles S3M also uses certain conceptual principles from HL7 v3 From HL7, S3M inherited the XML- based implementation Innovations exclusive to S3M: Separate structure from semantics Bottom-up data modeling enabled by CUIDs Semantic notation of XML Schemas (not XML data!) with RDF
  • 33. Data & Semantics Flow in the S3M Ecosystem
  • 34. S3M in a Nutshell Technological Approach • Uses XML Schema 1.1 to build structural definitions/models (it was designed for this) • Integrates RDF to define the semantics (it was designed for this) Data Modeling Approach • Allows multiple modelers to define models of the same concept that are structurally and semantically different. (the consensus & evolving science problems) • Allows modelers to define the granularity of the model. • Accommodates data that is outside the normal range, invalid according to constraints or is missing completely. • Allows modelers to use existing ontologies such as those on Bioportal, local ontologies or other URIs that point to valid definitions such as web pages or even PDFs, if the need arises. Provides a consistent foundation for automated, machine processing.
  • 35. S3M History We have functioning prototype tools to generate models and convert existing datasets into models and “validate-able” data. Now at version 3.0 based on R&D and peer-reviewed publications, invited presentations and feedback from those events. We simplified the core and removed the healthcare specific components. We modeled all of the NIH CDE, FHIR, a segment of ICD-10 and 11, selected clinical guidelines, a mortality system and a hospital reporting system. Project began in November 2009 as a healthcare specific project.
  • 36. S3M –What’s Missing? Improved documentation. • Instead of http://datainsights.tech/S3Model/ and http://datainsights.tech/S3Model/rm/index.html something more like http://xbrl.squarespace.com/xbrl-for-dummies/ Improved ontology links to one or more core ontologies. COMPLETED! Training materials. Sustainable business model. Investors and/or Partners. A high visibility implementation as a demonstrable proof of concept.