FAIR data_ Superior data visibility and reuse without warehousing.pdf

FAIR data: Superior data visibility
and reuse without warehousing
Alan Morrison
Data Architecture Best Practices Summit
April 18, 2023
1
Alain
Audet
at
https://pixabay.com/photos/lake-foggy-lake-nature-landscape-6839357/

Outline of today’s talk
2
● Problem
○ Lack of desiloed, high quality, well-integrated data and logic at scale
○ Shortfalls of data warehousing
● Solution
○ FAIR data and knowledge graphs
■ Blended data + logic centered infrastructure
● Result
○ Case study examples
○ Organic, data-centric systems
○ Zero-copy integration feasibility

Problem: Data quality, siloing and
poor integration
3

Yes, data warehousing focused on the integration problem
4
● Pro: Identified the critical problem to solve
● Con: Advocated a method that doesn’t delve deep enough to solve today’s
problem
● Still have the unified data model challenge

Data warehousing can’t solve today’s integration challenge
5
● Thousands of databases per enterprise (siloing)
● Thousands of applications (code sprawl)
● Data models buried in the app code
● Every app a special snowflake with its own data model

How did we get here? By selling the old as new
6

How data warehousing stopped scaling
“They recognized that these themes ended up in all these legacy apps. Sales rolled up against a
geographic and a product hierarchy, and an organizational hierarchy…. They said, Let’s have
those conformed dimensions and a small number of facts. Let’s bring the facts from all the
different systems and snap them together according to these conformed dimensions….
Brilliant idea, but I think what actually happened over time is the workload just got greater and
greater. The ability of people to actually conform those dimensions kept eroding….”
–Dave McComb, President, Semantic Arts
“Disambiguation of Data Mesh, Fabric, Centric, Driven, and Everything!” YouTube video, https://www.youtube.com/watch?v=M5XlGloj4UY&t=564s, 2021
7

Data warehousing model conformance doesn’t scale
“I spent a good 15 years working in financial services at some
pretty big banks. Half of the IT change budget is spent on
integration and the by-products of integration….I saw as the
technology was advancing that the percentage wasn’t going
down – in fact, it was going up. At some point, is the integration
tax going to be 100 percent?”
– Dan DeMers, CEO of Cinchy
“Disambiguation of Data Mesh, Fabric, Centric, Driven, and Everything!” YouTube video,
https://www.youtube.com/watch?v=M5XlGloj4UY&t=564s, 2021
8

An effective data model describes and unifies the contexts necessary for true data
integration. It gives machines enough clues to detect and discover layered context.
“What is data integration?
Let's start with a short list of what data integration is not:
● It's not shoveling data around between systems.
● It's not calling an API.
● It's not creating a data connection to a source system.
It can include one or more of the jobs in the list here above, but what is the ingredient
that cannot be missing?
It's connecting data from different source systems together in a consistent and
coherent data model.”
–Wouter Trappers, CDAO
What’s a data model? What is data integration?
9

Why large-scale integration?
10
Large scale integration is essential to
avoiding observational bias. The drunk
looking for his money under the lamppost
analogy describes the nature of this bias.
The drunk is looking for his money where
the light is, even though he knows the
money is in the shadows.
To manage today’s business at scale,
enterprises need light and visibility
across departments, organizations and
supply networks

Solution: Scale FAIR data development
using data-centric architecture,
semantics and knowledge graph
methods
11

Simple web hosting + legacy Client-Server
storage
Early Web (on Client-Server)
Compute and storage more loosely coupled,
virtualized, controlled and data-centric
“Decoupled” and “Decentralized” Cloud
Application Distribution via Proprietary
and IP Networking
Client-Server and Desktops
Commodity servers + storage + some
virtualization
Distributed Cloud and Mobile Devices
1st
2nd
3rd
4th
5th
Centralized storage and compute, with
minimal networking
Mainframe and Green Screens
The Five Commingled Phases of Compute, Networking and Storage
12
Less
centralized
Time
More
centralized
Application
Centric
Data
Centric
All phases are
still active and
evolving

Data-centric knowledge graphs allow desiloed visibility and interoperation at scale
13

Opportunity: Unitary data + description logic = knowledge
14
“Data management” (structured data,
mostly)
Knowledge management (internally
shared)
Content management (externally
shared)
Learning management (internal
coursework)
FAIR data and
associated
description
logic
FAIR data is data users can
have confidence in for
many purposes.
Data becomes FAIR when
it disambiguates concepts,
individuals and roles and
how they interact and relate
to one another.
In a knowledge graph
context, documented
knowledge = FAIR data.
FAIR stands for findable, accessible, interoperable,
and reusable. Under the FAIR data umbrella are all
heterogeneous types of data/content.

Semantics is the path to FAIR, smart, siloless data sharing
15
James Kobelius, 2016
Association of European Libraries, 2017

Compare FAIR and TRUST principles
16
Lin, D., Crabtree, J., Dillo, I. et al. The TRUST Principles for digital repositories. Sci Data 7, 144 (2020).
https://doi.org/10.1038/s41597-020-0486-7
FAIR data leads to TRUSTed data
repositories.

Who’s behind the FAIR data movement? Big pharma, for
one.
“From 2023, drug submissions to the European Medicines Agency (EMA) must
comply with select Identification of Medicinal Products (IDMP) standards. By
developing an IDMP-compliant ontology with machine-ready data, the Alliance will
support the move to automate this process, improving efficiency and patient
safety, reducing costs and time burden, and driving innovation in the drug
development pipeline.
“The project is managed by the Pistoia Alliance, with a project team of
experts from Bayer, Novartis, Roche, Merck KGaA, and GSK.”
17
–Erik Schultes, et al., ”FAIR Digital Twins for Data-Intensive Research,”
PERSPECTIVE article
Front. Big Data, 11 May 2022
Sec. Data Science
Volume 5 - 2022 | https://doi.org/10.3389/fdata.2022.883341

To create FAIR data, users can start with a single triple
18
Linked Open Data Cloud, 2022
Starter triple for a knowledge graph
A standard knowledge graph consists of triplified, relationship-rich
data. The data model, or ontology, is also described in triples and
lives with the rest of the data. Ontologies can also be managed as
data. Linking triples merely requires a verb (or predicate, or
described edge) to link them.

Simple way to start a business knowledge graph
● “Use JSON-LD to atomise your enterprise data down into three-part statements and voila!
You get a connected graph!
● ✨ Decentralize the process by having each team publish their own JSON-LD, for example,
let the sales team publish the sales data and ask them to link each sale to the correct product
and client.
● 🤖 Connect GPT to the JSON-LD that your teams have published. Then, harness the power
of GPT to assist new teams in publishing their JSON-LD and integrating it back into your
enterprise-wide Knowledge Graph.”
Key to scaling external/internal integration: use the schema.org modeled JSON-LD from websites
GPT is trained on and connect it with internal data also modeled with schema.org
–#HT Tony Seale, UBS
https://www.linkedin.com/posts/tonyseale_mlops-dataintegration-ai-activity-7052551060237819904-bAZc
19

To scale FAIR data, use an assisted, hybrid AI approach
20
Amit Sheth, From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge (Neuro-symbolic AI),” USC Information Sciences Institute on
YouTube, March 2023, https://www.youtube.com/watch?v=xyxQXka6dRY&t=2377s

21
How hybrid AI helps in research
“LLMs have amazing abilities in
manipulating natural language text,
but generating timely and factually
verified recommendations is one
thing LLMs are not naturally great
at.”
–Mike Tung, CEO of Diffbot
Diffbot Blog, April 2023,
https://blog.diffbot.com/generating-company-recommendations-usi
ng-large-language-models-and-knowledge-graphs/
LLMs aren’t a reliable research tool
alone because they hallucinate. you
can’t trust the answers unless you know
the answer already.
Mike Tung recommends more precise
prompting on the query side and answer
verification via a knowledge graph such
as Diffbot. Both of these capabilities
harness precise logical description
missing in current LLM Q&As.

KGs and data-centric architecture
22

Semantic standards allow a desiloed data landscape
23

How shared graph semantics helps
● Boosts meaningful results (result of lack of data and logic transparency and
cohesiveness) and relevancy
● Contextualizes data for management and reuse with relationship logic
● Scales meaningful connections between contexts (relevant
relationships living with entities)
● Enables Metcalfe’s network of networks effect (network_effectN
)
● Enables model-driven development (code once, reuse anywhere)
● Scale efficiencies and economies so that energy consumption is reduced
24

IKEA’s product knowledge graph
26
Katariina Kari, “IKEA’s Knowledge Graph and Why It Has Three Layers,” August 2022,
“https://medium.com/flat-pack-tech/ikeas-knowledge-graph-and-why-it-has-three-layers-a38fca436349
Currently
designed to
be customer
facing; can
evolve for
logistics
purposes with
more detailed
product data

Blue Brain Nexus–graph-based Bioinformatics collaboration
27

Montefiore Health’s Patient-centered Analytical Learning
Machine – (“PALM”) – Personalized medicine at scale
28

Enterprise decentralized app environment: OriginTrail.io
29
https://origintrail.io/

OriginTrail + BSI’s supply chain tracking and tracing
30
OriginTrail and the British Standards Institute (BSI), https://twitter.com/origin_trail/status/1339606640887152642?s=20, Dec. 2020
The Monasteriven
whiskey produced in
Ireland is tracked and
traced from “grain to
glass” with the
OriginTrail.io
approach.
OT uses
decentralized
knowledge graph that
connects to one of
several different
blockchains.
This method enables
shared data reuse
and other synergies
across the supply
chain.

SOLID shared, federated XaaS: Construction industry
31
“TrinPod™: World's first conceptually indexed space-time
digital twin using Solid,” Graphmetrix, 2022,
https://graphmetrix.com/trinpod
Company-specific SOLID storage pods and access
control can be managed by each supply chain partner.
Graphmetrix as digital twin provider manages the
system and system-level apps.

Digital twins and agents: Better data sharing than APIs?
32
Autonomous agents
Digital twins
Locale: Portsmouth, UK
Sensor nets
Iotics, 2019
and 2023

Organic data when nurtured grows from seeds into trees
34
Rich data ecosystems evolve naturally by
comparison with underdescribed, fragmented
data assets
Zero-copy integration becomes possible,
reducing complexity, labor and energy waste by
up to 90 percent
Second-order cybernetics (humans in the loop)
and precise facts and contextualization
complement probabilistic methods

Seven obstacles to adoption of FAIR data development at scale
35

Thoughts and Reactions?
Feel free to ping me anytime with questions, etc.
Alan Morrison
Data Science Central
LinkedIn | Twitter | Quora | Slideshare
+1 408 205 5109
a.s.morrison@gmail.com
36

FAIR data_ Superior data visibility and reuse without warehousing.pdf

Recommended

Recommended

More Related Content

Similar to FAIR data_ Superior data visibility and reuse without warehousing.pdf

Similar to FAIR data_ Superior data visibility and reuse without warehousing.pdf (20)

More from Alan Morrison

More from Alan Morrison (7)

Recently uploaded

Recently uploaded (20)

FAIR data_ Superior data visibility and reuse without warehousing.pdf