FAIR data: Superior data visibility
and reuse without warehousing
Alan Morrison
Data Architecture Best Practices Summit
April 18, 2023
1
Alain
Audet
at
https://pixabay.com/photos/lake-foggy-lake-nature-landscape-6839357/
Outline of today’s talk
2
● Problem
○ Lack of desiloed, high quality, well-integrated data and logic at scale
○ Shortfalls of data warehousing
● Solution
○ FAIR data and knowledge graphs
■ Blended data + logic centered infrastructure
● Result
○ Case study examples
○ Organic, data-centric systems
○ Zero-copy integration feasibility
Problem: Data quality, siloing and
poor integration
3
Yes, data warehousing focused on the integration problem
4
● Pro: Identified the critical problem to solve
● Con: Advocated a method that doesn’t delve deep enough to solve today’s
problem
● Still have the unified data model challenge
Data warehousing can’t solve today’s integration challenge
5
● Thousands of databases per enterprise (siloing)
● Thousands of applications (code sprawl)
● Data models buried in the app code
● Every app a special snowflake with its own data model
How did we get here? By selling the old as new
6
How data warehousing stopped scaling
“They recognized that these themes ended up in all these legacy apps. Sales rolled up against a
geographic and a product hierarchy, and an organizational hierarchy…. They said, Let’s have
those conformed dimensions and a small number of facts. Let’s bring the facts from all the
different systems and snap them together according to these conformed dimensions….
Brilliant idea, but I think what actually happened over time is the workload just got greater and
greater. The ability of people to actually conform those dimensions kept eroding….”
–Dave McComb, President, Semantic Arts
“Disambiguation of Data Mesh, Fabric, Centric, Driven, and Everything!” YouTube video, https://www.youtube.com/watch?v=M5XlGloj4UY&t=564s, 2021
7
Data warehousing model conformance doesn’t scale
“I spent a good 15 years working in financial services at some
pretty big banks. Half of the IT change budget is spent on
integration and the by-products of integration….I saw as the
technology was advancing that the percentage wasn’t going
down – in fact, it was going up. At some point, is the integration
tax going to be 100 percent?”
– Dan DeMers, CEO of Cinchy
“Disambiguation of Data Mesh, Fabric, Centric, Driven, and Everything!” YouTube video,
https://www.youtube.com/watch?v=M5XlGloj4UY&t=564s, 2021
8
An effective data model describes and unifies the contexts necessary for true data
integration. It gives machines enough clues to detect and discover layered context.
“What is data integration?
Let's start with a short list of what data integration is not:
● It's not shoveling data around between systems.
● It's not calling an API.
● It's not creating a data connection to a source system.
It can include one or more of the jobs in the list here above, but what is the ingredient
that cannot be missing?
It's connecting data from different source systems together in a consistent and
coherent data model.”
–Wouter Trappers, CDAO
What’s a data model? What is data integration?
9
Why large-scale integration?
10
Large scale integration is essential to
avoiding observational bias. The drunk
looking for his money under the lamppost
analogy describes the nature of this bias.
The drunk is looking for his money where
the light is, even though he knows the
money is in the shadows.
To manage today’s business at scale,
enterprises need light and visibility
across departments, organizations and
supply networks
Solution: Scale FAIR data development
using data-centric architecture,
semantics and knowledge graph
methods
11
Simple web hosting + legacy Client-Server
storage
Early Web (on Client-Server)
Compute and storage more loosely coupled,
virtualized, controlled and data-centric
“Decoupled” and “Decentralized” Cloud
Application Distribution via Proprietary
and IP Networking
Client-Server and Desktops
Commodity servers + storage + some
virtualization
Distributed Cloud and Mobile Devices
1st
2nd
3rd
4th
5th
Centralized storage and compute, with
minimal networking
Mainframe and Green Screens
The Five Commingled Phases of Compute, Networking and Storage
12
Less
centralized
Time
More
centralized
Application
Centric
Data
Centric
All phases are
still active and
evolving
Data-centric knowledge graphs allow desiloed visibility and interoperation at scale
13
Opportunity: Unitary data + description logic = knowledge
14
“Data management” (structured data,
mostly)
Knowledge management (internally
shared)
Content management (externally
shared)
Learning management (internal
coursework)
FAIR data and
associated
description
logic
FAIR data is data users can
have confidence in for
many purposes.
Data becomes FAIR when
it disambiguates concepts,
individuals and roles and
how they interact and relate
to one another.
In a knowledge graph
context, documented
knowledge = FAIR data.
FAIR stands for findable, accessible, interoperable,
and reusable. Under the FAIR data umbrella are all
heterogeneous types of data/content.
Semantics is the path to FAIR, smart, siloless data sharing
15
James Kobelius, 2016
Association of European Libraries, 2017
Compare FAIR and TRUST principles
16
Lin, D., Crabtree, J., Dillo, I. et al. The TRUST Principles for digital repositories. Sci Data 7, 144 (2020).
https://doi.org/10.1038/s41597-020-0486-7
FAIR data leads to TRUSTed data
repositories.
Who’s behind the FAIR data movement? Big pharma, for
one.
“From 2023, drug submissions to the European Medicines Agency (EMA) must
comply with select Identification of Medicinal Products (IDMP) standards. By
developing an IDMP-compliant ontology with machine-ready data, the Alliance will
support the move to automate this process, improving efficiency and patient
safety, reducing costs and time burden, and driving innovation in the drug
development pipeline.
“The project is managed by the Pistoia Alliance, with a project team of
experts from Bayer, Novartis, Roche, Merck KGaA, and GSK.”
17
–Erik Schultes, et al., ”FAIR Digital Twins for Data-Intensive Research,”
PERSPECTIVE article
Front. Big Data, 11 May 2022
Sec. Data Science
Volume 5 - 2022 | https://doi.org/10.3389/fdata.2022.883341
To create FAIR data, users can start with a single triple
18
Linked Open Data Cloud, 2022
Starter triple for a knowledge graph
A standard knowledge graph consists of triplified, relationship-rich
data. The data model, or ontology, is also described in triples and
lives with the rest of the data. Ontologies can also be managed as
data. Linking triples merely requires a verb (or predicate, or
described edge) to link them.
Simple way to start a business knowledge graph
● “Use JSON-LD to atomise your enterprise data down into three-part statements and voila!
You get a connected graph!
● ✨ Decentralize the process by having each team publish their own JSON-LD, for example,
let the sales team publish the sales data and ask them to link each sale to the correct product
and client.
● 🤖 Connect GPT to the JSON-LD that your teams have published. Then, harness the power
of GPT to assist new teams in publishing their JSON-LD and integrating it back into your
enterprise-wide Knowledge Graph.”
Key to scaling external/internal integration: use the schema.org modeled JSON-LD from websites
GPT is trained on and connect it with internal data also modeled with schema.org
–#HT Tony Seale, UBS
https://www.linkedin.com/posts/tonyseale_mlops-dataintegration-ai-activity-7052551060237819904-bAZc
19
To scale FAIR data, use an assisted, hybrid AI approach
20
Amit Sheth, From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge (Neuro-symbolic AI),” USC Information Sciences Institute on
YouTube, March 2023, https://www.youtube.com/watch?v=xyxQXka6dRY&t=2377s
21
How hybrid AI helps in research
“LLMs have amazing abilities in
manipulating natural language text,
but generating timely and factually
verified recommendations is one
thing LLMs are not naturally great
at.”
–Mike Tung, CEO of Diffbot
Diffbot Blog, April 2023,
https://blog.diffbot.com/generating-company-recommendations-usi
ng-large-language-models-and-knowledge-graphs/
LLMs aren’t a reliable research tool
alone because they hallucinate. you
can’t trust the answers unless you know
the answer already.
Mike Tung recommends more precise
prompting on the query side and answer
verification via a knowledge graph such
as Diffbot. Both of these capabilities
harness precise logical description
missing in current LLM Q&As.
KGs and data-centric architecture
22
Semantic standards allow a desiloed data landscape
23
How shared graph semantics helps
● Boosts meaningful results (result of lack of data and logic transparency and
cohesiveness) and relevancy
● Contextualizes data for management and reuse with relationship logic
● Scales meaningful connections between contexts (relevant
relationships living with entities)
● Enables Metcalfe’s network of networks effect (network_effectN
)
● Enables model-driven development (code once, reuse anywhere)
● Scale efficiencies and economies so that energy consumption is reduced
24
Case study examples
25
IKEA’s product knowledge graph
26
Katariina Kari, “IKEA’s Knowledge Graph and Why It Has Three Layers,” August 2022,
“https://medium.com/flat-pack-tech/ikeas-knowledge-graph-and-why-it-has-three-layers-a38fca436349
Currently
designed to
be customer
facing; can
evolve for
logistics
purposes with
more detailed
product data
Blue Brain Nexus–graph-based Bioinformatics collaboration
27
Montefiore Health’s Patient-centered Analytical Learning
Machine – (“PALM”) – Personalized medicine at scale
28
Enterprise decentralized app environment: OriginTrail.io
29
https://origintrail.io/
OriginTrail + BSI’s supply chain tracking and tracing
30
OriginTrail and the British Standards Institute (BSI), https://twitter.com/origin_trail/status/1339606640887152642?s=20, Dec. 2020
The Monasteriven
whiskey produced in
Ireland is tracked and
traced from “grain to
glass” with the
OriginTrail.io
approach.
OT uses
decentralized
knowledge graph that
connects to one of
several different
blockchains.
This method enables
shared data reuse
and other synergies
across the supply
chain.
SOLID shared, federated XaaS: Construction industry
31
“TrinPod™: World's first conceptually indexed space-time
digital twin using Solid,” Graphmetrix, 2022,
https://graphmetrix.com/trinpod
Company-specific SOLID storage pods and access
control can be managed by each supply chain partner.
Graphmetrix as digital twin provider manages the
system and system-level apps.
Digital twins and agents: Better data sharing than APIs?
32
Autonomous agents
Digital twins
Locale: Portsmouth, UK
Sensor nets
Iotics, 2019
and 2023
Final thoughts
33
Organic data when nurtured grows from seeds into trees
34
Rich data ecosystems evolve naturally by
comparison with underdescribed, fragmented
data assets
Zero-copy integration becomes possible,
reducing complexity, labor and energy waste by
up to 90 percent
Second-order cybernetics (humans in the loop)
and precise facts and contextualization
complement probabilistic methods
Seven obstacles to adoption of FAIR data development at scale
35
Thoughts and Reactions?
Feel free to ping me anytime with questions, etc.
Alan Morrison
Data Science Central
LinkedIn | Twitter | Quora | Slideshare
+1 408 205 5109
a.s.morrison@gmail.com
36

FAIR data_ Superior data visibility and reuse without warehousing.pdf

  • 1.
    FAIR data: Superiordata visibility and reuse without warehousing Alan Morrison Data Architecture Best Practices Summit April 18, 2023 1 Alain Audet at https://pixabay.com/photos/lake-foggy-lake-nature-landscape-6839357/
  • 2.
    Outline of today’stalk 2 ● Problem ○ Lack of desiloed, high quality, well-integrated data and logic at scale ○ Shortfalls of data warehousing ● Solution ○ FAIR data and knowledge graphs ■ Blended data + logic centered infrastructure ● Result ○ Case study examples ○ Organic, data-centric systems ○ Zero-copy integration feasibility
  • 3.
    Problem: Data quality,siloing and poor integration 3
  • 4.
    Yes, data warehousingfocused on the integration problem 4 ● Pro: Identified the critical problem to solve ● Con: Advocated a method that doesn’t delve deep enough to solve today’s problem ● Still have the unified data model challenge
  • 5.
    Data warehousing can’tsolve today’s integration challenge 5 ● Thousands of databases per enterprise (siloing) ● Thousands of applications (code sprawl) ● Data models buried in the app code ● Every app a special snowflake with its own data model
  • 6.
    How did weget here? By selling the old as new 6
  • 7.
    How data warehousingstopped scaling “They recognized that these themes ended up in all these legacy apps. Sales rolled up against a geographic and a product hierarchy, and an organizational hierarchy…. They said, Let’s have those conformed dimensions and a small number of facts. Let’s bring the facts from all the different systems and snap them together according to these conformed dimensions…. Brilliant idea, but I think what actually happened over time is the workload just got greater and greater. The ability of people to actually conform those dimensions kept eroding….” –Dave McComb, President, Semantic Arts “Disambiguation of Data Mesh, Fabric, Centric, Driven, and Everything!” YouTube video, https://www.youtube.com/watch?v=M5XlGloj4UY&t=564s, 2021 7
  • 8.
    Data warehousing modelconformance doesn’t scale “I spent a good 15 years working in financial services at some pretty big banks. Half of the IT change budget is spent on integration and the by-products of integration….I saw as the technology was advancing that the percentage wasn’t going down – in fact, it was going up. At some point, is the integration tax going to be 100 percent?” – Dan DeMers, CEO of Cinchy “Disambiguation of Data Mesh, Fabric, Centric, Driven, and Everything!” YouTube video, https://www.youtube.com/watch?v=M5XlGloj4UY&t=564s, 2021 8
  • 9.
    An effective datamodel describes and unifies the contexts necessary for true data integration. It gives machines enough clues to detect and discover layered context. “What is data integration? Let's start with a short list of what data integration is not: ● It's not shoveling data around between systems. ● It's not calling an API. ● It's not creating a data connection to a source system. It can include one or more of the jobs in the list here above, but what is the ingredient that cannot be missing? It's connecting data from different source systems together in a consistent and coherent data model.” –Wouter Trappers, CDAO What’s a data model? What is data integration? 9
  • 10.
    Why large-scale integration? 10 Largescale integration is essential to avoiding observational bias. The drunk looking for his money under the lamppost analogy describes the nature of this bias. The drunk is looking for his money where the light is, even though he knows the money is in the shadows. To manage today’s business at scale, enterprises need light and visibility across departments, organizations and supply networks
  • 11.
    Solution: Scale FAIRdata development using data-centric architecture, semantics and knowledge graph methods 11
  • 12.
    Simple web hosting+ legacy Client-Server storage Early Web (on Client-Server) Compute and storage more loosely coupled, virtualized, controlled and data-centric “Decoupled” and “Decentralized” Cloud Application Distribution via Proprietary and IP Networking Client-Server and Desktops Commodity servers + storage + some virtualization Distributed Cloud and Mobile Devices 1st 2nd 3rd 4th 5th Centralized storage and compute, with minimal networking Mainframe and Green Screens The Five Commingled Phases of Compute, Networking and Storage 12 Less centralized Time More centralized Application Centric Data Centric All phases are still active and evolving
  • 13.
    Data-centric knowledge graphsallow desiloed visibility and interoperation at scale 13
  • 14.
    Opportunity: Unitary data+ description logic = knowledge 14 “Data management” (structured data, mostly) Knowledge management (internally shared) Content management (externally shared) Learning management (internal coursework) FAIR data and associated description logic FAIR data is data users can have confidence in for many purposes. Data becomes FAIR when it disambiguates concepts, individuals and roles and how they interact and relate to one another. In a knowledge graph context, documented knowledge = FAIR data. FAIR stands for findable, accessible, interoperable, and reusable. Under the FAIR data umbrella are all heterogeneous types of data/content.
  • 15.
    Semantics is thepath to FAIR, smart, siloless data sharing 15 James Kobelius, 2016 Association of European Libraries, 2017
  • 16.
    Compare FAIR andTRUST principles 16 Lin, D., Crabtree, J., Dillo, I. et al. The TRUST Principles for digital repositories. Sci Data 7, 144 (2020). https://doi.org/10.1038/s41597-020-0486-7 FAIR data leads to TRUSTed data repositories.
  • 17.
    Who’s behind theFAIR data movement? Big pharma, for one. “From 2023, drug submissions to the European Medicines Agency (EMA) must comply with select Identification of Medicinal Products (IDMP) standards. By developing an IDMP-compliant ontology with machine-ready data, the Alliance will support the move to automate this process, improving efficiency and patient safety, reducing costs and time burden, and driving innovation in the drug development pipeline. “The project is managed by the Pistoia Alliance, with a project team of experts from Bayer, Novartis, Roche, Merck KGaA, and GSK.” 17 –Erik Schultes, et al., ”FAIR Digital Twins for Data-Intensive Research,” PERSPECTIVE article Front. Big Data, 11 May 2022 Sec. Data Science Volume 5 - 2022 | https://doi.org/10.3389/fdata.2022.883341
  • 18.
    To create FAIRdata, users can start with a single triple 18 Linked Open Data Cloud, 2022 Starter triple for a knowledge graph A standard knowledge graph consists of triplified, relationship-rich data. The data model, or ontology, is also described in triples and lives with the rest of the data. Ontologies can also be managed as data. Linking triples merely requires a verb (or predicate, or described edge) to link them.
  • 19.
    Simple way tostart a business knowledge graph ● “Use JSON-LD to atomise your enterprise data down into three-part statements and voila! You get a connected graph! ● ✨ Decentralize the process by having each team publish their own JSON-LD, for example, let the sales team publish the sales data and ask them to link each sale to the correct product and client. ● 🤖 Connect GPT to the JSON-LD that your teams have published. Then, harness the power of GPT to assist new teams in publishing their JSON-LD and integrating it back into your enterprise-wide Knowledge Graph.” Key to scaling external/internal integration: use the schema.org modeled JSON-LD from websites GPT is trained on and connect it with internal data also modeled with schema.org –#HT Tony Seale, UBS https://www.linkedin.com/posts/tonyseale_mlops-dataintegration-ai-activity-7052551060237819904-bAZc 19
  • 20.
    To scale FAIRdata, use an assisted, hybrid AI approach 20 Amit Sheth, From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge (Neuro-symbolic AI),” USC Information Sciences Institute on YouTube, March 2023, https://www.youtube.com/watch?v=xyxQXka6dRY&t=2377s
  • 21.
    21 How hybrid AIhelps in research “LLMs have amazing abilities in manipulating natural language text, but generating timely and factually verified recommendations is one thing LLMs are not naturally great at.” –Mike Tung, CEO of Diffbot Diffbot Blog, April 2023, https://blog.diffbot.com/generating-company-recommendations-usi ng-large-language-models-and-knowledge-graphs/ LLMs aren’t a reliable research tool alone because they hallucinate. you can’t trust the answers unless you know the answer already. Mike Tung recommends more precise prompting on the query side and answer verification via a knowledge graph such as Diffbot. Both of these capabilities harness precise logical description missing in current LLM Q&As.
  • 22.
    KGs and data-centricarchitecture 22
  • 23.
    Semantic standards allowa desiloed data landscape 23
  • 24.
    How shared graphsemantics helps ● Boosts meaningful results (result of lack of data and logic transparency and cohesiveness) and relevancy ● Contextualizes data for management and reuse with relationship logic ● Scales meaningful connections between contexts (relevant relationships living with entities) ● Enables Metcalfe’s network of networks effect (network_effectN ) ● Enables model-driven development (code once, reuse anywhere) ● Scale efficiencies and economies so that energy consumption is reduced 24
  • 25.
  • 26.
    IKEA’s product knowledgegraph 26 Katariina Kari, “IKEA’s Knowledge Graph and Why It Has Three Layers,” August 2022, “https://medium.com/flat-pack-tech/ikeas-knowledge-graph-and-why-it-has-three-layers-a38fca436349 Currently designed to be customer facing; can evolve for logistics purposes with more detailed product data
  • 27.
    Blue Brain Nexus–graph-basedBioinformatics collaboration 27
  • 28.
    Montefiore Health’s Patient-centeredAnalytical Learning Machine – (“PALM”) – Personalized medicine at scale 28
  • 29.
    Enterprise decentralized appenvironment: OriginTrail.io 29 https://origintrail.io/
  • 30.
    OriginTrail + BSI’ssupply chain tracking and tracing 30 OriginTrail and the British Standards Institute (BSI), https://twitter.com/origin_trail/status/1339606640887152642?s=20, Dec. 2020 The Monasteriven whiskey produced in Ireland is tracked and traced from “grain to glass” with the OriginTrail.io approach. OT uses decentralized knowledge graph that connects to one of several different blockchains. This method enables shared data reuse and other synergies across the supply chain.
  • 31.
    SOLID shared, federatedXaaS: Construction industry 31 “TrinPod™: World's first conceptually indexed space-time digital twin using Solid,” Graphmetrix, 2022, https://graphmetrix.com/trinpod Company-specific SOLID storage pods and access control can be managed by each supply chain partner. Graphmetrix as digital twin provider manages the system and system-level apps.
  • 32.
    Digital twins andagents: Better data sharing than APIs? 32 Autonomous agents Digital twins Locale: Portsmouth, UK Sensor nets Iotics, 2019 and 2023
  • 33.
  • 34.
    Organic data whennurtured grows from seeds into trees 34 Rich data ecosystems evolve naturally by comparison with underdescribed, fragmented data assets Zero-copy integration becomes possible, reducing complexity, labor and energy waste by up to 90 percent Second-order cybernetics (humans in the loop) and precise facts and contextualization complement probabilistic methods
  • 35.
    Seven obstacles toadoption of FAIR data development at scale 35
  • 36.
    Thoughts and Reactions? Feelfree to ping me anytime with questions, etc. Alan Morrison Data Science Central LinkedIn | Twitter | Quora | Slideshare +1 408 205 5109 a.s.morrison@gmail.com 36