SlideShare a Scribd company logo
To Graph or Not to Graph
Knowledge Graph Architectures
and LLMs
Prof. Paul Groth | @pgroth | pgroth.com | indelab.org
Text2KG Workshop - ESWC - 2024
E
ff
y Xue Li Stefan Grafberger Corey Harper
Daniel Daza
Prof. Paul Groth
James Nevin
Melika Ayoughi
Dr. Frank Nack
Dr. Jacobijn Sandberg Dr. Sebastian Schelter
Stian Soiland-Reyes Thiviyan
Thanapalasingam
Shubha Guha
Dr. Victoria Degeler
Pengyu Zhang
Zeyu Zhang
Fina Polat Erkan Karabulut Danru Xu
Bradley Allen
Dr. Jan-Christoph Kalo
Dr. Hazar Harmouch
22
Teresa Liberatore Yichun Wang
Thanks
Building knowledge graphs
What does the knowledge graph development cycle look like?
Gytė Tamašauskaitė and Paul Groth. 2023. De
fi
ning a
Knowledge Graph Development Process Through a
Systematic Review. ACM Trans. Softw. Eng. Methodol.
32, 1, Article 27 (January 2023)
https://doi.org/10.1145/3522586
Timely question
• Dagstuhl seminar 22372, 11-14.09.2022
• Organised by
• Paul Groth (University ofAmsterdam, NL)
• Elena Simperl (King's College London, UK)
• Marieke van Erp (KNAW Humanities Cluster -Amsterdam,
NL)
• Denny Vrandecic (Wikimedia - San Francisco, US)
• More information at: https://www.dagstuhl.de/
seminars/seminar-calendar/seminar-details/22372
• Other places too: aaai-make.info
8
Knowledge engineering: before
• Gathering highly curated knowledge
from experts and encoding it into
computational representations in
knowledge bases.
• Mostly manual process, focusing on
how knowledge was structured and
organised rather than the domain data.
• Results used in expert systems,
requiring considerable up-front
investment.
9
Knowledge engineering: today
Automatic process with human-in-the-loop
Large knowledge bases, drawn from heterogeneous data, using a mix of data
management, machine learning, knowledge representation, crowdsourcing
Provided access to data and (off-the-shelf)AI capabilities, costs are a fraction from what
they were decades ago.
This has led to mainstream adoption in search, intelligent assistants, digital twins, supply
chain management, legal compliance etc.
10
KE Requirements over time
11
See Allen et. al 2023
https://arxiv.org/abs/2306.15124
LLMs have changed our thinking
Knowledge Engineering Using Large Language Models
Bradley P. Allen1
�
University of Amsterdam, Amsterdam, The Netherlands
Lise Stork �
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Paul Groth �
University of Amsterdam, Amsterdam, The Netherlands
Abstract
Knowledge engineering is a discipline that focuses
on the creation and maintenance of processes that
generate and apply knowledge. Traditionally, know-
ledge engineering approaches have focused on know-
ledge expressed in formal languages. The emergence
of large language models and their capabilities to
effectively work with natural language, in its broad-
est sense, raises questions about the foundations
and practice of knowledge engineering. Here, we
outline the potential role of LLMs in knowledge
engineering, identifying two central directions: 1)
creating hybrid neuro-symbolic knowledge systems;
and 2) enabling knowledge engineering in natural
language. Additionally, we formulate key open re-
search questions to tackle these directions.
2012 ACM Subject Classification Computing methodologies æ Natural language processing; Computing
methodologies æ Machine learning; Computing methodologies æ Philosophical/theoretical foundations
of artificial intelligence; Software and its engineering æ Software development methods
Keywords and phrases knowledge engineering, large language models
Digital Object Identifier 10.4230/TGDK.1.1.3
Category Vision
Related Version Previous Version: https://doi.org/10.48550/arXiv.2310.00637
Funding Lise Stork: EU’s Horizon Europe research and innovation programme, the MUHAI project
(grant agreement no. 951846).
Paul Groth: EU’s Horizon Europe research and innovation programme, the ENEXA project (grant
Agreement no. 101070305).
Acknowledgements This work has benefited from Dagstuhl Seminar 22372 “Knowledge Graphs and
Their Role in the Knowledge Engineering of the 21st Century.” We also thank Frank van Harmelen for
conversations on this topic.
Received 2023-06-30 Accepted 2023-08-31 Published 2023-12-19
Editors Aidan Hogan, Ian Horrocks, Andreas Hotho, and Lalana Kagal
Special Issue Trends in Graph Data and Knowledge
1 Introduction
Knowledge engineering (KE) is a discipline concerned with the development and maintenance of
automated processes that generate and apply knowledge [4, 93]. Knowledge engineering rose to
prominence in the nineteen-seventies, when Edward Feigenbaum and others became convinced that
automating knowledge production through the application of research into artificial intelligence
required a domain-specific focus [32]. From the mid-nineteen-seventies to the nineteen-eighties,
knowledge engineering was mainly defined as the development of expert systems for automated
decision-making. By the early nineteen-nineties, however, it became clear that the expert systems
approach, given its dependence on manual knowledge acquisition and rule-based representation
1
Corresponding author
© Bradley P. Allen, Lise Stork, and Paul Groth;
licensed under Creative Commons License CC-BY 4.0
Transactions on Graph Data and Knowledge, Vol. 1, Issue 1, Article No. 3, pp. 3:1–3:19
Transactions on Graph Data and Knowledge
T G D K Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
Bradley P. Allen, Lise Stork, and Paul Groth. Knowledge Engineering Using Large Language Models. In
Special Issue on Trends in Graph Data and Knowledge. Transactions on Graph Data and Knowledge
(TGDK), Volume 1, Issue 1, pp. 3:1-3:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/TGDK.1.1.3
https://king-s-knowledge-graph-lab.github.io/knowledge-prompting-hackathon//
The Multimodal Nature of Knowledge
Embracing the diversity of knowledge forms
• domain knowledge is often best represented in a variety of modalities, i.e., images, taxonomies, or free text,
• each modality with its own data structure and characteristics which should be preserved,
• no easy way of integrating, interfacing with or reasoning over multimodal knowledge in a federated way exists;
• provenance of data is paramount in understanding knowledge within the context in which it was produced;
• fuzzy, incomplete, or complex knowledge is not easily systematized;
• Data standards
• using data standards for describing and reasoning over data can aid in countering unwanted biases via transparency;
• making data comply with data standards can lead to oversimpli
fi
cation or reinterpretation;
• the production of structured domain knowledge, for instance from images or free text, requires domain expertise, and is
therefore labor intensive and costly;
• knowledge evolves, and knowledge-based systems are required to deal with updates in their knowledge bases.
LLMs for KB and LLMs as KB
LLMs for KBs
LLMs for Information Extraction
27.09.23 19
Relation Extraction & Instruction Tuning
Do Instruction-tuned Large Language Models Help with Relation Extraction?
Xue Li, Fina Polat and Paul Groth. LM-AKBC Workshop at ISWC 2023
Results on REBEL dataset
Results on Post-Hoc Human Eval
Can we preserve relation extraction performance
while perserving in-context capabliities?
Method: Instruction Tune Dolly LLM with
LORA using a relation extraction dataset
(REBEL)
Language Models as Encoders
https://github.com/dfdazac/blp
Inductive Entity Representations from Text via Link Prediction
Daza, Daniel, Cochez, Michael, and Groth, Paul
In Proceedings of The Web Conference 2021.
DOI:10.1145/3442381.3450141
BioBLP: Domain Specific Attribute Encoders
Daza, Daniel, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael
Cochez, and Paul Groth. BioBLP: a modular framework for learning on
multimodal biomedical knowledge graphs. J Biomed Semant 14, 20 (2023).
https://doi.org/10.1186/s13326-023-00301-y
Daza, Daniel, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael
Cochez, and Paul Groth. BioBLP: a modular framework for learning on
multimodal biomedical knowledge graphs. J Biomed Semant 14, 20 (2023).
https://doi.org/10.1186/s13326-023-00301-y
LLMs for Data Wrangling
Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter (2024). Directions Towards
E
ffi
cient and Automated Data Wrangling with Large Language Models. Databases and
Machine Learning workshop at ICDE.
Code and experimental results available at https://github.com/Jantory/cpwrangle
Data Wrangling with Large Language Models (LLMs)
1
• Huge potential of LLMs for long-standing data wrangling tasks such as entity matching,
missing value imputation and error detection [1, 2]
• Automation and scalability challenges (e.g. for data wrangling services in the cloud)
• Manual few-prompt selection from [1] not automatable and scalable
• Disadvantages of automatable alternatives such as fully fine-tuning a model per customer
• High storage costs (for copies of model parameters)
• High computational costs (for model training)
→ We need parameter- and compute-efficient ways to employ LLMs for data wrangling
[1] Narayan et al.: Can Foundation Models Wrangle Your Data?, VLDB’22
[2] Fernandez et al.: How large language models will disrupt data management, VLDB’23
Parameter Efficient Fine-Tuning
Transfer Learning techniques for LLMs
• Manual prompt engineering -- no training (+), hard to automate (-)
• Full finetuning (FT) -- high performance (+), requires substantial computational resources (-)
• Parameter Efficient Tuning (PEFT) -- fewer parameters trained (+), on par performance (+)
Prefix-tuning[1]
LoRA adapter[2]
3
[1] Li et al., “Prefix-tuning: Optimizing continuous prompts for generation,” ACL’21.
[2] Huetal.,“LoRA:Low-RankAdaptationofLargeLanguageModels,” ICLR’22.
Results on Prediction Quality
How does prediction quality vary among different PEFT methods and base models?
4
LLM Method # of Parameter Updates Mean PredictiveScore
GPT3 (175B) Zero-Shot - 66.71
- AutoML - 76.88
T5-small (60.5M) Prompt 48K 81.94
P-tune 212K 80.11
Prefix 309K 67.66
LoRA 296K 90.96
Finetune 60,500K 89.95
T5-base (223M) Prompt 67K 81.22
P-tune 312K 85.09
Prefix 914K 84.49
LoRA 892K 92.03
Finetune 223,000K 90.36
T5-large (783M) Prompt 74K 82.04
P-tune 369K 76.62
Prefix 2,435K 88.65
LoRA 2,362K 92.24
Finetune 770,000K Train Failed
Evaluated four PEFT methods (Prompt, P-tune,
Prefix, LoRA) on three variants of Google’s T5
model on benchmark data from Narayan et al.
Findings:
• PEFT methods outperform GPT3 baseline and
AutoML in many settings
• LoRA provides highest performance
• Applying PEFT methods to larger models
provides higher performance
Results on Computational Efficiency
How does computational efficiency vary among different PEFT methods and base
models?
5
Training time per epoch on AMGO dataset
Mean inference throughput over all datasets
Training Times for FT on AMGO Dataset: 38s, 109s, and 312s,
respectively.
Findings:
• Only minor differences in training and inference times
between PEFT methods, parameter size has highest impact
• PEFT methods designed for parameter efficiency (two
orders of magnitude less parameters than full finetuning),
but not for compute efficiency!
Even the fastest method Prefix-Tuning is only twice as fast as
full fine-tuning on t5-base
LMs as KBs
27.09.23 29
Prompt-contexts for obtaining knowledge from LLMs
Knowledge-centric Prompt Composition for Knowledge Base Construction from Pre-trained
Language Models. Xue Li, Anthony Hughes, Majlinda Llugiqi, Fina Polat, Paul Groth and
Fajar J. Ekaputra. LM-KBC Workshop at ISWC 2023
https://github.com/effyli/lm-kbc/
Task: retrieve the object of a triple
given the subject and object
LM-AKBC challenge 2023
2nd Place!
▫ SemEval competition
SHROOM
▿ https://helsinki-nlp.github.io/
shroom/
▫ Goal: detect whether a given
LLM output contains
hallucination or not
▫ 4th Best Performing system
LLMs as Curators:
Dealing with hallucinations
SHROOM-INDElab at SemEval-2024 Task 6: Zero-and Few-Shot LLM-Based Classi
fi
cation for Hallucination Detection
BP Allen, F Polat, P Groth
18th International Workshop on Semantic Evaluation (SemEval-2024)
LLMs as Curators: Class membership relation evaluation
by an LLM
domain
knowledge in
natural language
corpus C
= arg max L (
𝑇
| (e, instance-of, o) )
knowledge
graph G
pre-training
sampling
(e, instance-of, c)
decision
For more details: Wednesday 14:40 LLM-KE Track
Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models
by Bradley Allen and Paul Groth
Architectures
RAG
Source: The Future of Work With AI - Microsoft March 2023 Event
https://www.youtube.com/watch?v=Bf-dbS9CcRU&ab_channel=Microsoft
Faculty of Science
SPARQL Queries over Text
Groth, P., Scerri, A., Daniel, R., & Allen, B. (2019). End-to-end learning for answering structured queries directly over text.
Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG2019) (Vol. 2377, pp. 57–70). CEUR.
https://ceur-ws.org/Vol-2377/#paper7
Querying text like a DB
Mohammed Saeed, Nicola De Cao, and Paolo Papotti. 2023. Querying Large Language Models with SQL. https://doi.org/10.48550/arXiv.2304.00472
Rethinking DB architecture
Matthias Urban and Carsten Binnig: "CAESURA:
Language Models as Multi-Modal Query Planners",
CIDR’2024
https://github.com/DataManagementLab/caesura
Adaptive Source Architecture
Input
Documents
Input Schema
Generative LLM
Prompt
Templates
Assembled
Prompt (s) Knowledge
Graph
Elements
Training
Data
Queries
Conclusion
• LLMs through the notion of encoders allow us to take advantage of
more of what’s in our KGs
• LLMs as robust information extractors can be live components in
systems. Break away from pipeline view of information extraction.
• LLMs can be e
ff
ectively treated information sources and curators.
• This allows new
fl
exible architectures that take advantages of the
di
ff
erent formats of knowledge
Paul Groth | @pgroth | pgroth.com | indelab.org

More Related Content

Similar to To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemKnowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
IRJET Journal
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Francesco Osborne
 
AI Powered Campus Resource Assistance using Google Dialog Flow
AI Powered Campus Resource Assistance using Google Dialog FlowAI Powered Campus Resource Assistance using Google Dialog Flow
AI Powered Campus Resource Assistance using Google Dialog Flow
YaswantAY
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdf
GeethaPratyusha
 
Information entanglement
Information entanglementInformation entanglement
Information entanglement
Willard Van De Bogart
 
An efficient educational data mining approach to support e-learning
An efficient educational data mining approach to support e-learningAn efficient educational data mining approach to support e-learning
An efficient educational data mining approach to support e-learning
Venu Madhav
 
Futuristic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mbaFuturistic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mba
Babasab Patil
 
Computing curriculum design workshop
Computing curriculum design workshopComputing curriculum design workshop
Computing curriculum design workshop
Miles Berry
 
Deep Learning: The Impact on Future eLearning
Deep Learning: The Impact on Future eLearningDeep Learning: The Impact on Future eLearning
Deep Learning: The Impact on Future eLearning
IRJET Journal
 
Y3 ssp 12 13 l12
Y3 ssp 12 13 l12Y3 ssp 12 13 l12
Y3 ssp 12 13 l12
Miles Berry
 
Roehampton computing workshop 1
Roehampton computing workshop 1Roehampton computing workshop 1
Roehampton computing workshop 1
Miles Berry
 
tlad2014_complete_proceedings
tlad2014_complete_proceedingstlad2014_complete_proceedings
tlad2014_complete_proceedings
Sage Lal
 
2309.08491.pdf
2309.08491.pdf2309.08491.pdf
2309.08491.pdf
TatianaAlmeida496085
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web Engineering
IRJET Journal
 
SyncMeta: Near Real-time Collaborative Conceptual Modeling on the Web
SyncMeta: Near Real-time Collaborative Conceptual Modeling on the WebSyncMeta: Near Real-time Collaborative Conceptual Modeling on the Web
SyncMeta: Near Real-time Collaborative Conceptual Modeling on the Web
Nicolaescu Petru
 
Unifying an Introduction to Artificial Intelligence Course ...
Unifying an Introduction to Artificial Intelligence Course ...Unifying an Introduction to Artificial Intelligence Course ...
Unifying an Introduction to Artificial Intelligence Course ...
butest
 
Data+Science : A First Course
Data+Science : A First CourseData+Science : A First Course
Data+Science : A First Course
Arnab Majumdar
 
Next generation of data scientist
Next generation of data scientistNext generation of data scientist
Next generation of data scientist
TanujaSomvanshi1
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
Paul Groth
 
Hypertxt
HypertxtHypertxt

Similar to To Graph or Not to Graph Knowledge Graph Architectures and LLMs (20)

Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemKnowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
AI Powered Campus Resource Assistance using Google Dialog Flow
AI Powered Campus Resource Assistance using Google Dialog FlowAI Powered Campus Resource Assistance using Google Dialog Flow
AI Powered Campus Resource Assistance using Google Dialog Flow
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdf
 
Information entanglement
Information entanglementInformation entanglement
Information entanglement
 
An efficient educational data mining approach to support e-learning
An efficient educational data mining approach to support e-learningAn efficient educational data mining approach to support e-learning
An efficient educational data mining approach to support e-learning
 
Futuristic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mbaFuturistic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mba
 
Computing curriculum design workshop
Computing curriculum design workshopComputing curriculum design workshop
Computing curriculum design workshop
 
Deep Learning: The Impact on Future eLearning
Deep Learning: The Impact on Future eLearningDeep Learning: The Impact on Future eLearning
Deep Learning: The Impact on Future eLearning
 
Y3 ssp 12 13 l12
Y3 ssp 12 13 l12Y3 ssp 12 13 l12
Y3 ssp 12 13 l12
 
Roehampton computing workshop 1
Roehampton computing workshop 1Roehampton computing workshop 1
Roehampton computing workshop 1
 
tlad2014_complete_proceedings
tlad2014_complete_proceedingstlad2014_complete_proceedings
tlad2014_complete_proceedings
 
2309.08491.pdf
2309.08491.pdf2309.08491.pdf
2309.08491.pdf
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web Engineering
 
SyncMeta: Near Real-time Collaborative Conceptual Modeling on the Web
SyncMeta: Near Real-time Collaborative Conceptual Modeling on the WebSyncMeta: Near Real-time Collaborative Conceptual Modeling on the Web
SyncMeta: Near Real-time Collaborative Conceptual Modeling on the Web
 
Unifying an Introduction to Artificial Intelligence Course ...
Unifying an Introduction to Artificial Intelligence Course ...Unifying an Introduction to Artificial Intelligence Course ...
Unifying an Introduction to Artificial Intelligence Course ...
 
Data+Science : A First Course
Data+Science : A First CourseData+Science : A First Course
Data+Science : A First Course
 
Next generation of data scientist
Next generation of data scientistNext generation of data scientist
Next generation of data scientist
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Hypertxt
HypertxtHypertxt
Hypertxt
 

More from Paul Groth

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
Paul Groth
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
Paul Groth
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
Paul Groth
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
Paul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
Paul Groth
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
Paul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
Paul Groth
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
Paul Groth
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
Paul Groth
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data Showcasing
Paul Groth
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
Paul Groth
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
Paul Groth
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
Paul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
Paul Groth
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
Paul Groth
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
Paul Groth
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Paul Groth
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
Paul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
Paul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
Paul Groth
 

More from Paul Groth (20)

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data Showcasing
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 

Recently uploaded

Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 

Recently uploaded (20)

Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

  • 1. To Graph or Not to Graph Knowledge Graph Architectures and LLMs Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Text2KG Workshop - ESWC - 2024
  • 2. E ff y Xue Li Stefan Grafberger Corey Harper Daniel Daza Prof. Paul Groth James Nevin Melika Ayoughi Dr. Frank Nack Dr. Jacobijn Sandberg Dr. Sebastian Schelter Stian Soiland-Reyes Thiviyan Thanapalasingam Shubha Guha Dr. Victoria Degeler Pengyu Zhang Zeyu Zhang Fina Polat Erkan Karabulut Danru Xu Bradley Allen Dr. Jan-Christoph Kalo Dr. Hazar Harmouch 22 Teresa Liberatore Yichun Wang Thanks
  • 4. What does the knowledge graph development cycle look like?
  • 5.
  • 6. Gytė Tamašauskaitė and Paul Groth. 2023. De fi ning a Knowledge Graph Development Process Through a Systematic Review. ACM Trans. Softw. Eng. Methodol. 32, 1, Article 27 (January 2023) https://doi.org/10.1145/3522586
  • 7.
  • 8. Timely question • Dagstuhl seminar 22372, 11-14.09.2022 • Organised by • Paul Groth (University ofAmsterdam, NL) • Elena Simperl (King's College London, UK) • Marieke van Erp (KNAW Humanities Cluster -Amsterdam, NL) • Denny Vrandecic (Wikimedia - San Francisco, US) • More information at: https://www.dagstuhl.de/ seminars/seminar-calendar/seminar-details/22372 • Other places too: aaai-make.info 8
  • 9. Knowledge engineering: before • Gathering highly curated knowledge from experts and encoding it into computational representations in knowledge bases. • Mostly manual process, focusing on how knowledge was structured and organised rather than the domain data. • Results used in expert systems, requiring considerable up-front investment. 9
  • 10. Knowledge engineering: today Automatic process with human-in-the-loop Large knowledge bases, drawn from heterogeneous data, using a mix of data management, machine learning, knowledge representation, crowdsourcing Provided access to data and (off-the-shelf)AI capabilities, costs are a fraction from what they were decades ago. This has led to mainstream adoption in search, intelligent assistants, digital twins, supply chain management, legal compliance etc. 10
  • 11. KE Requirements over time 11 See Allen et. al 2023 https://arxiv.org/abs/2306.15124
  • 12. LLMs have changed our thinking Knowledge Engineering Using Large Language Models Bradley P. Allen1 � University of Amsterdam, Amsterdam, The Netherlands Lise Stork � Vrije Universiteit Amsterdam, Amsterdam, The Netherlands Paul Groth � University of Amsterdam, Amsterdam, The Netherlands Abstract Knowledge engineering is a discipline that focuses on the creation and maintenance of processes that generate and apply knowledge. Traditionally, know- ledge engineering approaches have focused on know- ledge expressed in formal languages. The emergence of large language models and their capabilities to effectively work with natural language, in its broad- est sense, raises questions about the foundations and practice of knowledge engineering. Here, we outline the potential role of LLMs in knowledge engineering, identifying two central directions: 1) creating hybrid neuro-symbolic knowledge systems; and 2) enabling knowledge engineering in natural language. Additionally, we formulate key open re- search questions to tackle these directions. 2012 ACM Subject Classification Computing methodologies æ Natural language processing; Computing methodologies æ Machine learning; Computing methodologies æ Philosophical/theoretical foundations of artificial intelligence; Software and its engineering æ Software development methods Keywords and phrases knowledge engineering, large language models Digital Object Identifier 10.4230/TGDK.1.1.3 Category Vision Related Version Previous Version: https://doi.org/10.48550/arXiv.2310.00637 Funding Lise Stork: EU’s Horizon Europe research and innovation programme, the MUHAI project (grant agreement no. 951846). Paul Groth: EU’s Horizon Europe research and innovation programme, the ENEXA project (grant Agreement no. 101070305). Acknowledgements This work has benefited from Dagstuhl Seminar 22372 “Knowledge Graphs and Their Role in the Knowledge Engineering of the 21st Century.” We also thank Frank van Harmelen for conversations on this topic. Received 2023-06-30 Accepted 2023-08-31 Published 2023-12-19 Editors Aidan Hogan, Ian Horrocks, Andreas Hotho, and Lalana Kagal Special Issue Trends in Graph Data and Knowledge 1 Introduction Knowledge engineering (KE) is a discipline concerned with the development and maintenance of automated processes that generate and apply knowledge [4, 93]. Knowledge engineering rose to prominence in the nineteen-seventies, when Edward Feigenbaum and others became convinced that automating knowledge production through the application of research into artificial intelligence required a domain-specific focus [32]. From the mid-nineteen-seventies to the nineteen-eighties, knowledge engineering was mainly defined as the development of expert systems for automated decision-making. By the early nineteen-nineties, however, it became clear that the expert systems approach, given its dependence on manual knowledge acquisition and rule-based representation 1 Corresponding author © Bradley P. Allen, Lise Stork, and Paul Groth; licensed under Creative Commons License CC-BY 4.0 Transactions on Graph Data and Knowledge, Vol. 1, Issue 1, Article No. 3, pp. 3:1–3:19 Transactions on Graph Data and Knowledge T G D K Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany Bradley P. Allen, Lise Stork, and Paul Groth. Knowledge Engineering Using Large Language Models. In Special Issue on Trends in Graph Data and Knowledge. Transactions on Graph Data and Knowledge (TGDK), Volume 1, Issue 1, pp. 3:1-3:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/TGDK.1.1.3
  • 14. The Multimodal Nature of Knowledge
  • 15. Embracing the diversity of knowledge forms • domain knowledge is often best represented in a variety of modalities, i.e., images, taxonomies, or free text, • each modality with its own data structure and characteristics which should be preserved, • no easy way of integrating, interfacing with or reasoning over multimodal knowledge in a federated way exists; • provenance of data is paramount in understanding knowledge within the context in which it was produced; • fuzzy, incomplete, or complex knowledge is not easily systematized; • Data standards • using data standards for describing and reasoning over data can aid in countering unwanted biases via transparency; • making data comply with data standards can lead to oversimpli fi cation or reinterpretation; • the production of structured domain knowledge, for instance from images or free text, requires domain expertise, and is therefore labor intensive and costly; • knowledge evolves, and knowledge-based systems are required to deal with updates in their knowledge bases.
  • 16. LLMs for KB and LLMs as KB
  • 18. LLMs for Information Extraction
  • 19. 27.09.23 19 Relation Extraction & Instruction Tuning Do Instruction-tuned Large Language Models Help with Relation Extraction? Xue Li, Fina Polat and Paul Groth. LM-AKBC Workshop at ISWC 2023 Results on REBEL dataset Results on Post-Hoc Human Eval Can we preserve relation extraction performance while perserving in-context capabliities? Method: Instruction Tune Dolly LLM with LORA using a relation extraction dataset (REBEL)
  • 20. Language Models as Encoders https://github.com/dfdazac/blp Inductive Entity Representations from Text via Link Prediction Daza, Daniel, Cochez, Michael, and Groth, Paul In Proceedings of The Web Conference 2021. DOI:10.1145/3442381.3450141
  • 21. BioBLP: Domain Specific Attribute Encoders Daza, Daniel, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael Cochez, and Paul Groth. BioBLP: a modular framework for learning on multimodal biomedical knowledge graphs. J Biomed Semant 14, 20 (2023). https://doi.org/10.1186/s13326-023-00301-y
  • 22. Daza, Daniel, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael Cochez, and Paul Groth. BioBLP: a modular framework for learning on multimodal biomedical knowledge graphs. J Biomed Semant 14, 20 (2023). https://doi.org/10.1186/s13326-023-00301-y
  • 23. LLMs for Data Wrangling Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter (2024). Directions Towards E ffi cient and Automated Data Wrangling with Large Language Models. Databases and Machine Learning workshop at ICDE. Code and experimental results available at https://github.com/Jantory/cpwrangle
  • 24. Data Wrangling with Large Language Models (LLMs) 1 • Huge potential of LLMs for long-standing data wrangling tasks such as entity matching, missing value imputation and error detection [1, 2] • Automation and scalability challenges (e.g. for data wrangling services in the cloud) • Manual few-prompt selection from [1] not automatable and scalable • Disadvantages of automatable alternatives such as fully fine-tuning a model per customer • High storage costs (for copies of model parameters) • High computational costs (for model training) → We need parameter- and compute-efficient ways to employ LLMs for data wrangling [1] Narayan et al.: Can Foundation Models Wrangle Your Data?, VLDB’22 [2] Fernandez et al.: How large language models will disrupt data management, VLDB’23
  • 25. Parameter Efficient Fine-Tuning Transfer Learning techniques for LLMs • Manual prompt engineering -- no training (+), hard to automate (-) • Full finetuning (FT) -- high performance (+), requires substantial computational resources (-) • Parameter Efficient Tuning (PEFT) -- fewer parameters trained (+), on par performance (+) Prefix-tuning[1] LoRA adapter[2] 3 [1] Li et al., “Prefix-tuning: Optimizing continuous prompts for generation,” ACL’21. [2] Huetal.,“LoRA:Low-RankAdaptationofLargeLanguageModels,” ICLR’22.
  • 26. Results on Prediction Quality How does prediction quality vary among different PEFT methods and base models? 4 LLM Method # of Parameter Updates Mean PredictiveScore GPT3 (175B) Zero-Shot - 66.71 - AutoML - 76.88 T5-small (60.5M) Prompt 48K 81.94 P-tune 212K 80.11 Prefix 309K 67.66 LoRA 296K 90.96 Finetune 60,500K 89.95 T5-base (223M) Prompt 67K 81.22 P-tune 312K 85.09 Prefix 914K 84.49 LoRA 892K 92.03 Finetune 223,000K 90.36 T5-large (783M) Prompt 74K 82.04 P-tune 369K 76.62 Prefix 2,435K 88.65 LoRA 2,362K 92.24 Finetune 770,000K Train Failed Evaluated four PEFT methods (Prompt, P-tune, Prefix, LoRA) on three variants of Google’s T5 model on benchmark data from Narayan et al. Findings: • PEFT methods outperform GPT3 baseline and AutoML in many settings • LoRA provides highest performance • Applying PEFT methods to larger models provides higher performance
  • 27. Results on Computational Efficiency How does computational efficiency vary among different PEFT methods and base models? 5 Training time per epoch on AMGO dataset Mean inference throughput over all datasets Training Times for FT on AMGO Dataset: 38s, 109s, and 312s, respectively. Findings: • Only minor differences in training and inference times between PEFT methods, parameter size has highest impact • PEFT methods designed for parameter efficiency (two orders of magnitude less parameters than full finetuning), but not for compute efficiency! Even the fastest method Prefix-Tuning is only twice as fast as full fine-tuning on t5-base
  • 29. 27.09.23 29 Prompt-contexts for obtaining knowledge from LLMs Knowledge-centric Prompt Composition for Knowledge Base Construction from Pre-trained Language Models. Xue Li, Anthony Hughes, Majlinda Llugiqi, Fina Polat, Paul Groth and Fajar J. Ekaputra. LM-KBC Workshop at ISWC 2023 https://github.com/effyli/lm-kbc/ Task: retrieve the object of a triple given the subject and object LM-AKBC challenge 2023 2nd Place!
  • 30. ▫ SemEval competition SHROOM ▿ https://helsinki-nlp.github.io/ shroom/ ▫ Goal: detect whether a given LLM output contains hallucination or not ▫ 4th Best Performing system LLMs as Curators: Dealing with hallucinations SHROOM-INDElab at SemEval-2024 Task 6: Zero-and Few-Shot LLM-Based Classi fi cation for Hallucination Detection BP Allen, F Polat, P Groth 18th International Workshop on Semantic Evaluation (SemEval-2024)
  • 31. LLMs as Curators: Class membership relation evaluation by an LLM domain knowledge in natural language corpus C = arg max L ( 𝑇 | (e, instance-of, o) ) knowledge graph G pre-training sampling (e, instance-of, c) decision For more details: Wednesday 14:40 LLM-KE Track Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models by Bradley Allen and Paul Groth
  • 33. RAG Source: The Future of Work With AI - Microsoft March 2023 Event https://www.youtube.com/watch?v=Bf-dbS9CcRU&ab_channel=Microsoft
  • 34. Faculty of Science SPARQL Queries over Text Groth, P., Scerri, A., Daniel, R., & Allen, B. (2019). End-to-end learning for answering structured queries directly over text. Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG2019) (Vol. 2377, pp. 57–70). CEUR. https://ceur-ws.org/Vol-2377/#paper7
  • 35. Querying text like a DB Mohammed Saeed, Nicola De Cao, and Paolo Papotti. 2023. Querying Large Language Models with SQL. https://doi.org/10.48550/arXiv.2304.00472
  • 36. Rethinking DB architecture Matthias Urban and Carsten Binnig: "CAESURA: Language Models as Multi-Modal Query Planners", CIDR’2024 https://github.com/DataManagementLab/caesura
  • 37. Adaptive Source Architecture Input Documents Input Schema Generative LLM Prompt Templates Assembled Prompt (s) Knowledge Graph Elements Training Data Queries
  • 38. Conclusion • LLMs through the notion of encoders allow us to take advantage of more of what’s in our KGs • LLMs as robust information extractors can be live components in systems. Break away from pipeline view of information extraction. • LLMs can be e ff ectively treated information sources and curators. • This allows new fl exible architectures that take advantages of the di ff erent formats of knowledge Paul Groth | @pgroth | pgroth.com | indelab.org