SlideShare a Scribd company logo
Data and Web Science Group
GPT4 versus BERT
Which Model Is More Suitable for Web Data Integration?
November 16, 2023
Prof. Dr. Christian Bizer
19th International Conference on
Web Information Systems and Technologies
Data and Web Science Group
Hello
– Prof. Dr. Christian Bizer
– Chair of Information Systems:
Web-based Systems
– Research Areas:
– Large-scale data integration
– Information extraction from semi-structured sources
– Knowledge base construction
– Analysis of the adoption of semantic web technologies
– Email: christian.bizer@uni-mannheim.de
2
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Structured Data on the Web
Web APIs
Web of Data
Data Portals
3
Data and Web Science Group
The (Web) Data Integration Process
4
Data Discovery
Schema Matching
Data Translation
Entity Matching
Data Fusion
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Clean and
Complete
Data and Web Science Group
Outline
1. Entity Matching
– BERT-based Methods
– GPT-based Methods
2. Table Annotation
– BERT-based Methods
– GPT-based Methods
3. Conclusions
5
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data Discovery
Schema Matching
Data Translation
Entity Matching
Data Fusion
Clean and
Complete
Data and Web Science Group
1. Entity Matching
Goal: Find all records that refer to the same real-world entity.
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Release
Color
RAM
Model No.
Product
Brand
2021/1/29
Blue
64
S21
Galaxy
Samsung
Feb. 2021
blau
64 GB
S 21 TGB12
Gal.
Samung
2020/1/29
NULL
64000
NULL
Galaxy S20 Blue
TGB12 64GB
NULL
Vassilis, et al.: End-to-End Entity Resolution for Big Data. ACM Surveys, 2020.
Barlaug and Gulla: Neural Networks for Entity Matching: A Survey. TKDD, 2021.
6
Data and Web Science Group
50 Years of Entity Matching
7
Luna Dong: ML for Entity Linkage. Data Integration and Machine Learning: A Natural Synergy. Tutorial at SIGMOD 2018.
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Entity Matching Benchmarks
8
Anna Primpeli and Christian Bizer: Profiling Entity Matching Benchmark Tasks. CIKM 2020.
Papadakis, et al.: A Critical Re-evaluation of Benchmark Datasets for Learning-Based Matching Algorithms. Arxiv, 2023.
# Sources
# Attrib.
# Matches
# Pairs
Topic
Dataset
Type
2
8
132
539
Music
iTunes-Amazon
Structured
2
4
2,220
12,363
Bibliographic
DBLP-ACM
2
4
5,347
28,707
Bibliographic
DBLP-Scholar
2
5
962
10,242
Products
Walmart-Amazon
2
3
1,028
9,575
Products
Abt-Buy
Textual
2
3
1,167
11,460
Products
Amazon-Google
745
4
6,146
33,359
Products
WDC Computers
3259
4
9,471
28,000
Products
WDC Products
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
DeepMatcher (2018)
– Embeddings: FastText
– Summarization: Bi-RNN with attention
– Similarity computation: element-wise difference and multiplication, concatenation
– Classification: Fully connected neural net, cross entropy loss
9
Mudgal, Sidharth, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD, 2018.
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Evaluation: DeepMatcher versus Magellan
– DeepMatcher outperforms traditional methods on textual data
– mixed results on structured data
10
Difference
F1
Magellan
F1
DeepMatcher
F1
Dataset
Type
-2.7
91.2
88.5
iTunes-Amazon
Structured +0.0
98.4
98.4
DBLP-ACM
+2.4
94.7
92.3
DBLP-Scholar
-5.0
71.9
66.9
Walmart-Amazon
+19.2
43.6
62.8
Abt-Buy
Textual
+20.1
49.1
69.3
Amazon-Google
+25.0
64.5
89.5
WDC Computer - Large
+12.9
57.6
70.5
WDC Computer - Small
Konda, et al.: Magellan: Toward Building Entity Matching Management. PVLDB, 2016.
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Transformers started to win
all benchmarks in NLP
– Self-supervised pre-training on large text corpora
– Fine-tuning for downstream tasks
11
https://huggingface.co/docs/transformers/index
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
DITTO (2021)
– applies BERT, DistilBERT, RoBERTa for entity matching
– Entity serialization for BERT
– Pair of entity descriptions are turned into single sequence
– [CLS] Entity Description 1 [SEP] Entity Description 2 [SEP]
– Entity Description = [COL] attr1 [VAL] val1 . . . [COL] attrk [VAL] valk
12
Yuliang, et al: Deep entity matching with pre-trained language models. PVLDB, 2021.
[CLS][COL] Title [VAL] DYMO D1 - Glossy tape [COL] Price [VAL] 1,99 €
[SEP][COL] Title [VAL] DYMO 45017 D1 Tape [COL] Price [VAL] 2,19 € [SEP]
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
DITTO: Architecture
13
– [CLS] token summarizes the pair of entities
– linear layer on top of [CLS] token for matching decision
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
DITTO: Evaluation
– constant improvement for structured data
– large performance gain for textual data
14
Magellan
F1
DeepMatcher
F1
DITTO
F1
Dataset
Type
91.2 +5.8
88.5 +8.5
97.0
iTunes-Amazon
Structured 98.4 +0.6
98.4 +0.6
99.0
DBLP-ACM
94.7 +0.9
92.3 +3.3
95.6
DBLP-Scholar
71.9 +14.9
66.9 +19.9
86.8
Walmart-Amazon
43.6 +45.7
62.8 +26.5
89.3
Abt-Buy
Textual
49.1 +26.5
69.3 +6.3
75.6
Amazon-Google
64.5 +27.2
89.5 +3.2
91.7
WDC Computer - Large
57.6 +23.2
70.5 +10.3
80.8
WDC Computer - Small
Zeakis, et al.: Pre-trained Embeddings for Entity Resolution: An Experimental Analysis. PVLDB, 2023.
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Contrastive Pretraining in Vision
15
– maximizes distance between classes in the embedding space
– uses large batches containing many positive and negative examples
Khosla, et al.: Supervised Contrastive Learning. NeurIPS 2020.
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Supervised Contrastive Pretraining
for Entity Matching (2022)
16
Peeters, Bizer: Supervised Contrastive Learning for Product Matching. WWW Companion 2022.
(Frozen)
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Evaluation: Supervised Contrastive
Pretraining SupCon
17
Large improvements for smaller training sets
WDC Computers
Amazon-
Google
Abt-Buy
~68K
(xlarge)
~23K
(large)
~8K
(medium)
~3K
(small)
~9K
~7.5K
# Training Pairs
88.95
84.32
69.85
61.22
70.70
62.80
DeepMatcher
94.73
94.68
91.90
86.37
74.10
91.05
RoBERTa
95.45
91.70
88.62
80.76
75.58
89.33
Ditto
98.33
98.16
97.66
93.18
79.28
93.70
R-SupCon
98.33
98.50
98.50
95.21
76.14
94.29
R-SupCon+augmen
+ 0.84
+ 1.60
+ 6.60
+ 8.84
+ 3.70
+ 3.24
Δ to best baseline
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Potential Reasons for the Good
Performance of BERT-based Methods
– Serialization allows to pay attention to all attributes
– no strict separation between attributes
– WordPiece tokenizer breaks unknown terms into pieces
– no problems with out of vocabulary terms
– Transfer learning from pre-training texts
– different surface forms may already be close in embedding space
– Contextualization of the embeddings
– potentially more suited for capturing differing semantics
18
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Drawbacks of BERT-based Methods:
Overfitting to Seen Entities
– Benchmark: WDC Products, training set: 9500 pairs
– Test setseen: offers for the same products as in training, 4500 pairs
– Test setunseen: offers for different products, 4500 pairs
19
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Peeters, Bizer: WDC Products: A Multi-dimensional Entity Matching Benchmark. EDBT 2024.
- 24.65% F1
- 8.92% F1
- 7.44% F1
- 6.43% F1
+ 0.78% F1
- 28.73% F1
Data and Web Science Group
Drawbacks of BERT-based Methods:
Require Thousands of Training Examples
– WDC Products, Small training set: 5,000 pairs
– WDC Products, Large training set: 24,335 pairs
–
– significant effort for acquiring training labels
– continuous labeling and retraining necessary to cover new entities
20
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
- 6.73% F1
- 13.69% F1
- 13.77% F1
- 19.77% F1
- 4.03% F1
- 12.29% F1
Data and Web Science Group
Can Large Language Models (LLMs)
address these drawbacks?
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 21
Data and Web Science Group
Entity Matching using Large Language Models
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 22
Peeters, Bizer: Entity Matching using Large Language Models. Arxiv, 2023.
Narayan, et al.: Can Foundation Models Wrangle Your Data? PVLDB, 2022.
Data and Web Science Group
Variations in the Prompt Formulation
Variations
– general vs. domain-specific wording
– complex vs. simple task description
– free-form vs. forced (restricted) answering
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 23
Peeters, Bizer: Entity Matching using Large Language Models. Arxiv, 2023.
Data and Web Science Group
– Models: gpt3.5-turbo-0301, gpt3.5-turbo-0613, gpt4-0613
– Benchmark: WDC Products, test set: 1250 pairs
Impact of Prompt Variations
GPT4
Turbo06
Turbo03
Prompt/Model
88.35
74.96
75.55
domain-complex-force
89.61
64.93
68.66
domain-complex-free
83.72
38.24
79.17
domain-simple-force
84.50
72.52
75.17
domain-simple-free
85.83
60.62
76.51
general-complex-force
86.72
67.83
65.87
general-complex-free
77.39
14.02
78.33
general-simple-force
83.41
69.71
79.70
general-simple-free
84.27
56.43
74.47
Mean
3.42
17.87
4.28
Standard deviation
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 24
Data and Web Science Group
– Models: SOLAR-0-70B-16Bit, StableBeluga2-70B (4 GPUs, 275GB VRAM)
– Benchmark: WDC Products, test set: 1250 pairs
Open-Source Models
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Beluga2
SOLAR
GPT4
Turbo03
Prompt/Model
63.61
67.93
88.35
75.55
domain-complex-force
54.97
72.95
89.61
68.66
domain-complex-free
44.19
26.71
83.72
79.17
domain-simple-force
43.79
53.44
84.50
75.17
domain-simple-free
54.97
56.52
85.83
76.51
general-complex-force
51.38
71.98
86.72
65.87
general-complex-free
40.00
11.28
77.39
78.33
general-simple-force
30.16
31.02
83.41
79.70
general-simple-free
47.58
51.43
84.27
74.47
Mean
8.78
20.13
3.42
4.28
Standard deviation
- 32.84% F1
- 36.69% F1
- 16.66% F1
- 34.64% F1
25
Data and Web Science Group
Prompt as Hyperparmeter
26
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
No single prompt works best for all model/dataset combinations
Data and Web Science Group
GPT-based versus BERT-based
Entity Matching Methods
– GPT results are zero-shot: No task-specific training data!
– RoBERTa and DITTO are fine-tuned using 5K to 22K training
pairs
27
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
DBLP-Sch
Ama-Goog
Wal-Ama
Abt-Buy
WDC
Model
84.13
63.72
74.81
87.39
79.70
Turbo03
89.82
76.38
89.67
95.78
89.61
GPT4
93.88
79.27
87.02
91.21
77.53
RoBERTa
94.31
80.07
86.39
91.31
84.90
Ditto
-4.49
-3.69
2.65
4.47
4.71
Δ Best GPT/BERT
Peeters, Bizer: Entity Matching using Large Language Models. Arxiv, 2023.
Data and Web Science Group
BERT-based Methods
Generalization to other Datasets
– Roberta and DITTO trained on WDC and applied to other datasets
– BERT-based methods: Hardly any transfer between datasets
– GPT-based methods: Transfer from pre-training plus emergent
effects
28
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
DBLP-Sch
Ama-Goog
Wal-Ama
Abt-Buy
WDC
Model
93.88
79.27
87.02
91.21
77.53
RoBERTaSeen
94.31
80.07
86.39
91.31
84.90
DITTOSeen
29.64
31.00
36.46
55.52
-
RoBERTaUnseen
-64.24
-48.27
-50.56
-35.69
-
Δ RoBERTaUnseen
32.82
33.12
31.55
48.74
-
DITTOUnseen
-61.49
-46.95
-54.84
-42.57
-
Δ DITTOUnseen
Data and Web Science Group
In-Context Learning via Demonstrations
29
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
USER: Do the following two product descriptions match?
Product 1: ‘DYMO D1 19 mm x 7 m’
Product 2: ‘Dymo D1 (19mm x 7m – BoW)’
ASSISTANT: Yes.
USER: Do the following two product descriptions match?
Product 1: ‘DYMO D1 Tape 24mm’
Product 2: ‘Dymo D1 19mm x 7m’
ASSISTANT: No.
USER: Do the following two product descriptions match?
Answer with 'Yes' if they do and 'No' if they do not.
Product 1: 'Title: DYMO D1 - Glossy tape - black on white - Roll (1.9cm x 7m) - 1 roll(s)’
Product 2: 'Title: DYMO 45017 D1 Tape 12mm x 7m sort p rd, S0720570'
ASSISTANT: No.
Data and Web Science Group
In-Context Learning via Matching Rules
30
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
SYSTEM: Your task is to decide if two product descriptions match.
The following rules need to be observed:
1. The brand of matching products must be the same if available
2. Model numbers of matching products must be the same if available
3. Additional features of matching products must be the same if available
4. Matching attributes may not have the exact same surface form.
5. If an attribute is missing for one description, it is likely still a match if the existing
attributes match.
USER: Do the following two product descriptions match?
Answer with 'Yes' if they do and 'No' if they do not.
Product 1: 'Title: DYMO D1 - Glossy tape - black on white - Roll (1.9cm x 7m) - 1 roll(s)’
Product 2: 'Title: DYMO 45017 D1 Tape 12mm x 7m sort p rd, S0720570'
ASSISTANT: No.
Data and Web Science Group
Mean F1 over all 5 benchmark datasets
– GPT3.5 and open-source models benefit from in-context learning
– For GPT4 the additional guidance is harmful!
Results In-Context Learning
31
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Beluga2
SOLAR
GPT4
Turbo06
Turbo03
Shots
Prompt/Model
68.27
63.10
85.22
70.28
71.87
6
Fewshot-related
69.23
64.64
86.64
69.93
71.89
10
77.33
79.73
85.11
75.75
79.25
6
Fewshot-random
77.13
80.71
85.77
77.32
80.62
10
75.49
77.24
85.77
70.95
78.59
0
Hand-written rules
75.29
75.69
85.04
67.33
57.17
0
Learned rules
69.80
76.97
88.25
74.95
77.95
0
Best zero-shot
7.53
3.74
-1.61
2.37
2.67
-
Δ Best zero-shot
Data and Web Science Group
Impact of In-Context-Learning for GPT4
and the Amazon-Google Dataset
– For some datasets, in-context learning is needed
to notch GPT4 into the right direction!
32
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Amazon-Google
Model
80.07
Ditto
76.38
GPT4Zeroshot
-3.69
Δ GPT4Zeroshot/Ditto
78.76
GPT4Random10
85.21
GPT4Related10
+5.21
Δ GPT4Related10/Ditto
Peeters, Bizer: Entity Matching using Large Language Models. Arxiv, 2023.
Data and Web Science Group
Confidence Scores
33
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
USER: Do the following two product descriptions refer to the same product?
Provide a confidence score for your decision, 100% referring to full confidence. …
ASSISTANT: No. Confidence: 95%
The ROC curve shows the
quality of the confidence
scores
Data and Web Science Group
Explanations
– Free Text
34
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
ASSISTANT: Yes.
attribute=brand; importance=0.20; values=ZEBRA, Zebra; similarity=1.00
attribute=model; importance=0.40; values=800033-801, 800011-101; similarity=0.00
attribute=color; importance=0.10; values=Black, Black; similarity=1.00
attribute=prints; importance=0.20; values=1000, 1000; similarity=1.00
attribute=price; importance=0.10; values=GBP 9.5, GBP £12.99; similarity=0.00
USER: Do the following two product descriptions refer to the same product?
Please explain your decision using the following format …
Product 1: ZEBRA 800033-801 BLACK RIBBON (1000 PRINTS) GBP 9.5
Product 2: Zebra Black Monochrome Ribbon 800011-101 - 1000 Prints GBP £12.99
False positive!
Explanations for BERT-based entity matchers see:
Paganelli, Matteo et al.: Analyzing How BERT Performs Entity Matching. PVLDB, 2022.
Data and Web Science Group
Error Analysis (and Prompt Breeding?)
35
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
ASSISTANT:
False Positives Error Classes:
1. Model Number Confusion
2. Semantic Misunderstanding
3. Minor Differences Overlooked
4. Attribute Importance Misalignment
False Negatives Error Classes:
1. Description Misinterpretation
2. Price and Currency Differences
3. Missing Information
4. Different Measurement Units
USER: The following list contains false positive and false negative product pairs from
the output of a product matching system. Given the product pairs and associated
attribute and importance information, come up with a set of error classes that explain
why the systems fails on these examples. …. 10 false positives and 10 negatives …
Fernando, et al.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution. ArXiv, 2023.
Data and Web Science Group
Summary:
GPT-based versus BERT-based Matching Methods
1. GPT-based matchers require less task-specific training data
– GPT4zeroshot outperforms fine-tuned BERT models in many cases
2. GPT-based matchers are more robust to unseen entities
– important for Web use cases that often involve unseen entities
3. Both approaches reduce the feature engineering effort
– no information extraction necessary
– less value normalization necessary due to pre-training
4. GPT-based matchers can explain matching decisions
36
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
2. Table Annotation
Goal: Annotate table columns with
terms from a shared vocabulary.
Use Cases:
1. data lake indexing for search
2. schema matching via global schema
37
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data Discovery
Schema Matching
Data Translation
Entity Matching
Data Fusion
Clean and
Complete
annotate
Data and Web Science Group
Column Type Annotation (CTA)
38
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Column Property Annotation (CPA)
39
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Table Annotation Benchmarks
40
SemTab Table Annotation Evaluation Campaign: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
Deng, et al.: TURL: Table Understanding through Representation Learning. PVLDB 2020.
Korini, et al.: SOTAB: The WDC Schema. org table annotation benchmark. SemTab Proceedings, 2022.
# Sources
# Terms
Vocabulary
# Tables
Dataset
Task
1
255
Freebase
410,000
WikiTables
CTA
1
122
DBpedia
6,892
GitTables SemTab
44,268
82
Schema.org
45,378
WDC SOTAB V2
103
32
Schema.org
103
WDC SOTAB Small
1
121
Freebase
53,000
WikiTables
CPA
29,540
110
Schema.org
29,723
WDC SOTAB V2
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
TURL (2020)
– aims at learning generic table representations that
are useful across a wide range of tasks
– Pre-training: Self-supervised table representation learning
– Fine-tuning: For 6 specific downstream tasks
41
Deng, et al.: TURL: Table Understanding through Representation Learning. PVLDB 2020.
Pujara, et al.: From Tables to Knowledge: Recent Advances in Table Understanding. Tutorial at KDD2021.
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Text in table cells Wikipedia entities in table cells
Data and Web Science Group
DoDuo (2022)
– directly fine-tunes BERT for column and relation annotation tasks
– a table cell can pay attention to all neighboring cells
– exploits synergies between CTA and CPA task using multi-task learning
42
Suhara, et al.: Annotating Columns with Pre-trained Language Models. SIGMOD 2022.
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Evaluation Results: Table Annotation
– Column Type Annotation (CTA): WikiTables
– Column Property Annotation (CTA): WikiTables
– good results around 90% F1 for both tasks
– use lots of training data for pre-training and fine-tuning
– all labels are covered in the training data, no unseen ones
R
P
F1
Method
87.23
90.54
88.86
TURL (TinyBERT)
92.21
92.45
92.45
DoDuo (BERT)
R
P
F1
Method
90.69
91.18
90.94
TURL (TinyBERT)
91.47
91.97
91.72
DoDuo (BERT)
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 43
Data and Web Science Group
Can Large Language Models (LLMs)
do better for Column Type Annotation?
44
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Korini, Bizer: Column Type Annotation using ChatGPT. VLDB Workshops, 2023.
Feuer: ArcheType: A Novel Framework for Column Type Annotation using Large Language Models. Arxiv, 2023.
Data and Web Science Group
CTA as Column Classification
45
SYSTEM: Classify the column given to you into one of these types that are separated
by comma: RestaurantName, ArtistName, AlbumName, EventName, PriceRange,
AddressRegion, Country, Telephone, PaymentAccepted, PostalCode, Coordinate,
DayOfWeek, Time, RestaurantDescription, Review, Date, DateTime, Organization,
EventDescription, EventStatusType, EventAttendanceModeEnumeration, Currency,
Telephone, MusicRecordingName, Duration
ASSISTANT: Time
USER: Column: 7:30 AM 7:00 AM 10:00 AM 5:00 PM 11:00 AM
Type:
– Benchmark: SOTABSmall
– Topics: Restaurants, Events,
Albums, Artists
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Using the Whole Table as Context
for Disambiguation
46
SYSTEM: Classify the columns of a given table with only one of the following classes
that are separated with comma: RestaurantName, ArtistName, AlbumName,
PostalCode, AddressRegion, … {32 semantic types are listed here}
ASSISTANT: RestaurantName, PostalCode, PaymentAccepted, Time
USER: Table: Column 1 || Column 2 || Column 3 || Column 4 n
Friends Pizza || 2525|| Cash Visa MasterCard || 7:30 AM n …
Classes:
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
Providing Explicit Instructions
47
– we instruct the model by providing reasoning steps
– we explicitly specify that the input is a table
SYSTEM: Classify the columns of a given table with only one of the following classes
that are separated with comma: {32 semantic types are listed here}
Instructions: 1. Look at the input given to you and make a table out of it.
2. Look at the cell values in detail.
3. For each column, select a class that best represents the meaning of all cells.
4. Answer with the selected class for each columns with the format Column1: class.
ASSISTANT: Column1: RestaurantNamen Column2: PostalCoden Column3:
PaymentAcceptedn Column4: Time
USER: Table: Column 1 || Column 2 || Column 3 || Column 4 n Friends Pizza ||
2525|| Cash Visa MasterCard || 7:30 AM n …
Classes:
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
– Benchmark: SOTAB Small, 32 terms, zero-shot
– Table approach with instructions works best for OpenAI models.
– open-source models are confused by complete tables
– this are zero-shot results without using task-specific training data
 The models “know” the terms from pre-training.
Column Type Annotation Results
48
Falcon40B
Stable
Beluga2
GPT4
GPT03
Prompt / Model
21.42
75.55
86.31
45.85
Column
-
20.63
94.19
37.90
Table
11.67
74.84
92.36
78.61
Column+Instructions
2.7
53.82
95.14
85.25
Table+Instructions
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Korini, Bizer: Column Type Annotation using ChatGPT. VLDB Workshops, 2023.
Data and Web Science Group
GPT-based versus BERT-based
Table Annotation Methods
– Training example (shot): Annotated column
– RoBERTa fine-tuned using concatenated cell values
– DODUO fine-tuned by embedding complete tables with column labels
– RoBERTa using 1600 examples performs worse than GPT4 zero-shot
– DoDuo confused due to low number of training tables
49
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Δ F1
F1
Shots
Model
-
95.14
0
GPT4
- 5,41
89.73
356
RoBERTa
- 8,35
86.79
1600
RoBERTa
- 88,77
6.37
356
DoDuo
- 41,54
53.6
1600
DoDuo
Data and Web Science Group
Challenge: Large Vocabularies
• Idea: Split CTA into two steps
1. predict topic of complete table
2. perform CTA using reduced set of topic-specific labels
• Advantages:
1. save token space for large vocabularies
2. simplify the annotation task as the model chooses from
smaller set of labels
Table topic prediction
prompt
ChatGPT Model Answer
CTA prompt with topic-
specific labels
ChatGPT Model Answer
Step 1:
Step 2:
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 50
Data and Web Science Group
Step 1: Table Topic Prediction
SYSTEM: Your task is to classify if a table describes restaurants, events, music
recordings or hotels.
SYSTEM: Your instructions are: 1. Look at the input given to you and make a table
out of it. 2. Look at the cell values in detail. 3. Decide if the table describes a
Restaurant, Event, Music Recording or Hotel. 4. Answer with Restaurant, Event,
Music Recording or Hotel.
USER: Classify this table: Column 1 || Column 2 || Column 3 || Column 4 n
Friends Pizza || 2525|| Cash Visa MasterCard || 7:30 AM n …
ASISSTANT: Restaurant
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 51
Data and Web Science Group
Step 2: Column Type Annotation
– the first system message uses only relevant subset of all labels
– e.g. only 11 out of 32 labels belonging to the “Restaurant” topic
SYSTEM: Your task is to classify the columns of a given table with only one of the
following classes that are separated with comma: {relevant subset of all labels}
Your instructions are: 1. Look at the input given to you and make a table out of it. 2.
Look at the cell values in detail. 3. For each column, select a class …
USER: Classify these table columns: Column 1 || Column 2 || Column 3 || Column
4 n Friends Pizza || 2525|| Cash Visa MasterCard || 7:30 AM n …
ASSISTANT: Column1: RestaurantNamen Column2: PostalCoden Column3:
PaymentAcceptedn Column4: Time
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 52
Data and Web Science Group
Results: Two-Step Approach
53
Falcon40B
Stable
Beluga2
GPT4
GPT03
F1
21.42
75.55
86.31
45.85
Column
-
20.63
94.19
37.90
Table
11.67
74.84
92.36
78.61
Column+instructions
2.7
53.82
95.14
85.25
Table+instructions
-
31.57
94.95
89.47
Two-step Pipeline
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
– Two-step approach helps GPT03 to handle label space
– GPT4 does not require additional guidance for SOTABsmall
Data and Web Science Group
3. Conclusions
54
1. GPT-based methods require less task-specific training data
– high zero-shot performance of GPT4
2. GPT-based methods are more robust to unseen entities
– important for Web use cases that often involve unseen entities
3. BERT-based methods are cheaper to run
– no API usage fees, less GPU required
4. GPTs ability to generate explanations might increase the
trust of the user into the integration results
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Data and Web Science Group
– Papers with Code collects results for all discussed benchmarks
Staying Up To Date
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
https://paperswithcode.com/task/entity-resolution/
https://paperswithcode.com/task/table-annotation/
55
Data and Web Science Group
Thank you.
56
Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
Email: christian.bizer@uni-mannheim.de
Web: https://www.uni-mannheim.de/dws/people/professors/prof-dr-christian-bizer/

More Related Content

Similar to GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?

Clustering
ClusteringClustering
Clustering
Kiran Bhowmick
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Jongwook Woo
 
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Taniya Fansupkar
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
tuxette
 
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Debmalya Biswas
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Nishant Gandhi
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
DATAVERSITY
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale Analytics
Neo4j
 
13 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v313 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v3
Aravindharamanan S
 
Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data Mining
IIRindia
 
Big Data
Big DataBig Data
Big Data
Vinayak Kamath
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Business Intelligence.pptx
Business Intelligence.pptxBusiness Intelligence.pptx
Business Intelligence.pptx
CindyDVUOWMalaysia
 
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptFDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.ppt
PerumalPitchandi
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
Semantic Web Company
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
Nicolle Dammann
 
DAS Slides: Graph Databases — Practical Use Cases
DAS Slides: Graph Databases — Practical Use CasesDAS Slides: Graph Databases — Practical Use Cases
DAS Slides: Graph Databases — Practical Use Cases
DATAVERSITY
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
Andre Freitas
 
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVA
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVABDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVA
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVA
BigData_Europe
 

Similar to GPT4 versus BERT: Which Foundation Model is better for Web Data Integration? (20)

Clustering
ClusteringClustering
Clustering
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
 
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale Analytics
 
13 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v313 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v3
 
Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data Mining
 
Big Data
Big DataBig Data
Big Data
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Business Intelligence.pptx
Business Intelligence.pptxBusiness Intelligence.pptx
Business Intelligence.pptx
 
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptFDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.ppt
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
 
DAS Slides: Graph Databases — Practical Use Cases
DAS Slides: Graph Databases — Practical Use CasesDAS Slides: Graph Databases — Practical Use Cases
DAS Slides: Graph Databases — Practical Use Cases
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVA
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVABDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVA
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVA
 

More from Chris Bizer

Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product Matching
Chris Bizer
 
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
Chris Bizer
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Chris Bizer
 
Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)
Chris Bizer
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
Chris Bizer
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
Chris Bizer
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million Websites
Chris Bizer
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
Chris Bizer
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
Chris Bizer
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Chris Bizer
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
Chris Bizer
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
Chris Bizer
 

More from Chris Bizer (12)

Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product Matching
 
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
 
Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million Websites
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
 

Recently uploaded

Bitcoin vs Ethereum Which Crypto Performed Better in Q2, 2024.docx
Bitcoin vs Ethereum Which Crypto Performed Better in Q2, 2024.docxBitcoin vs Ethereum Which Crypto Performed Better in Q2, 2024.docx
Bitcoin vs Ethereum Which Crypto Performed Better in Q2, 2024.docx
SFC Today
 
How Salesforce Development in the UK is Driving Digital Transformation
How Salesforce Development in the UK is Driving Digital TransformationHow Salesforce Development in the UK is Driving Digital Transformation
How Salesforce Development in the UK is Driving Digital Transformation
Sweet Potato Tec
 
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
elbertablack
 
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
samyanvichadda
 
UMN degree offer diploma Transcript
UMN degree offer diploma TranscriptUMN degree offer diploma Transcript
UMN degree offer diploma Transcript
cenocb
 
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
ffg01100
 
Top 50 Data Science Jobs on LinkedIn.docx
Top 50 Data Science Jobs on LinkedIn.docxTop 50 Data Science Jobs on LinkedIn.docx
Top 50 Data Science Jobs on LinkedIn.docx
analyticsinsightmaga
 
Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...
Edward Blurock
 
Web development Platform Constraints.pptx
Web development Platform Constraints.pptxWeb development Platform Constraints.pptx
Web development Platform Constraints.pptx
ssuser2f6682
 
Portugal Dreamin 24 - How to easily use an API with Flows
Portugal Dreamin 24  - How to easily use an API with FlowsPortugal Dreamin 24  - How to easily use an API with Flows
Portugal Dreamin 24 - How to easily use an API with Flows
Thierry TROUIN ☁
 
Team Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public servicesTeam Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public services
Bangladesh Network Operators Group
 
Open Source TCP or Netflow Log Server Using Graylog
Open Source TCP or Netflow Log Server Using GraylogOpen Source TCP or Netflow Log Server Using Graylog
Open Source TCP or Netflow Log Server Using Graylog
Bangladesh Network Operators Group
 
Trump Assassination Shirt Trump Assassination Shirt
Trump Assassination Shirt Trump Assassination ShirtTrump Assassination Shirt Trump Assassination Shirt
Trump Assassination Shirt Trump Assassination Shirt
exgf28
 
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
shamrisumri
 
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptxDraya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
ashishkumarrana9
 
Use of Ontologies in Chemical Kinetic Database CHEMCONNECT
Use of Ontologies in Chemical Kinetic Database CHEMCONNECTUse of Ontologies in Chemical Kinetic Database CHEMCONNECT
Use of Ontologies in Chemical Kinetic Database CHEMCONNECT
Edward Blurock
 
IPv6 Deployment Planning and Security Considerations
IPv6 Deployment Planning and Security ConsiderationsIPv6 Deployment Planning and Security Considerations
IPv6 Deployment Planning and Security Considerations
Bangladesh Network Operators Group
 
2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage
Zsolt Nemeth
 
University of California, Riverside diploma
University of California, Riverside diplomaUniversity of California, Riverside diploma
University of California, Riverside diploma
eufdev
 
Study of international anticancer research trends.pdf
Study of international anticancer research trends.pdfStudy of international anticancer research trends.pdf
Study of international anticancer research trends.pdf
Preston University
 

Recently uploaded (20)

Bitcoin vs Ethereum Which Crypto Performed Better in Q2, 2024.docx
Bitcoin vs Ethereum Which Crypto Performed Better in Q2, 2024.docxBitcoin vs Ethereum Which Crypto Performed Better in Q2, 2024.docx
Bitcoin vs Ethereum Which Crypto Performed Better in Q2, 2024.docx
 
How Salesforce Development in the UK is Driving Digital Transformation
How Salesforce Development in the UK is Driving Digital TransformationHow Salesforce Development in the UK is Driving Digital Transformation
How Salesforce Development in the UK is Driving Digital Transformation
 
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
 
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
 
UMN degree offer diploma Transcript
UMN degree offer diploma TranscriptUMN degree offer diploma Transcript
UMN degree offer diploma Transcript
 
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
 
Top 50 Data Science Jobs on LinkedIn.docx
Top 50 Data Science Jobs on LinkedIn.docxTop 50 Data Science Jobs on LinkedIn.docx
Top 50 Data Science Jobs on LinkedIn.docx
 
Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...
 
Web development Platform Constraints.pptx
Web development Platform Constraints.pptxWeb development Platform Constraints.pptx
Web development Platform Constraints.pptx
 
Portugal Dreamin 24 - How to easily use an API with Flows
Portugal Dreamin 24  - How to easily use an API with FlowsPortugal Dreamin 24  - How to easily use an API with Flows
Portugal Dreamin 24 - How to easily use an API with Flows
 
Team Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public servicesTeam Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public services
 
Open Source TCP or Netflow Log Server Using Graylog
Open Source TCP or Netflow Log Server Using GraylogOpen Source TCP or Netflow Log Server Using Graylog
Open Source TCP or Netflow Log Server Using Graylog
 
Trump Assassination Shirt Trump Assassination Shirt
Trump Assassination Shirt Trump Assassination ShirtTrump Assassination Shirt Trump Assassination Shirt
Trump Assassination Shirt Trump Assassination Shirt
 
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
 
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptxDraya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
 
Use of Ontologies in Chemical Kinetic Database CHEMCONNECT
Use of Ontologies in Chemical Kinetic Database CHEMCONNECTUse of Ontologies in Chemical Kinetic Database CHEMCONNECT
Use of Ontologies in Chemical Kinetic Database CHEMCONNECT
 
IPv6 Deployment Planning and Security Considerations
IPv6 Deployment Planning and Security ConsiderationsIPv6 Deployment Planning and Security Considerations
IPv6 Deployment Planning and Security Considerations
 
2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage
 
University of California, Riverside diploma
University of California, Riverside diplomaUniversity of California, Riverside diploma
University of California, Riverside diploma
 
Study of international anticancer research trends.pdf
Study of international anticancer research trends.pdfStudy of international anticancer research trends.pdf
Study of international anticancer research trends.pdf
 

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?

  • 1. Data and Web Science Group GPT4 versus BERT Which Model Is More Suitable for Web Data Integration? November 16, 2023 Prof. Dr. Christian Bizer 19th International Conference on Web Information Systems and Technologies
  • 2. Data and Web Science Group Hello – Prof. Dr. Christian Bizer – Chair of Information Systems: Web-based Systems – Research Areas: – Large-scale data integration – Information extraction from semi-structured sources – Knowledge base construction – Analysis of the adoption of semantic web technologies – Email: christian.bizer@uni-mannheim.de 2 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 3. Data and Web Science Group Structured Data on the Web Web APIs Web of Data Data Portals 3
  • 4. Data and Web Science Group The (Web) Data Integration Process 4 Data Discovery Schema Matching Data Translation Entity Matching Data Fusion Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Clean and Complete
  • 5. Data and Web Science Group Outline 1. Entity Matching – BERT-based Methods – GPT-based Methods 2. Table Annotation – BERT-based Methods – GPT-based Methods 3. Conclusions 5 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Data Discovery Schema Matching Data Translation Entity Matching Data Fusion Clean and Complete
  • 6. Data and Web Science Group 1. Entity Matching Goal: Find all records that refer to the same real-world entity. Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Release Color RAM Model No. Product Brand 2021/1/29 Blue 64 S21 Galaxy Samsung Feb. 2021 blau 64 GB S 21 TGB12 Gal. Samung 2020/1/29 NULL 64000 NULL Galaxy S20 Blue TGB12 64GB NULL Vassilis, et al.: End-to-End Entity Resolution for Big Data. ACM Surveys, 2020. Barlaug and Gulla: Neural Networks for Entity Matching: A Survey. TKDD, 2021. 6
  • 7. Data and Web Science Group 50 Years of Entity Matching 7 Luna Dong: ML for Entity Linkage. Data Integration and Machine Learning: A Natural Synergy. Tutorial at SIGMOD 2018. Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 8. Data and Web Science Group Entity Matching Benchmarks 8 Anna Primpeli and Christian Bizer: Profiling Entity Matching Benchmark Tasks. CIKM 2020. Papadakis, et al.: A Critical Re-evaluation of Benchmark Datasets for Learning-Based Matching Algorithms. Arxiv, 2023. # Sources # Attrib. # Matches # Pairs Topic Dataset Type 2 8 132 539 Music iTunes-Amazon Structured 2 4 2,220 12,363 Bibliographic DBLP-ACM 2 4 5,347 28,707 Bibliographic DBLP-Scholar 2 5 962 10,242 Products Walmart-Amazon 2 3 1,028 9,575 Products Abt-Buy Textual 2 3 1,167 11,460 Products Amazon-Google 745 4 6,146 33,359 Products WDC Computers 3259 4 9,471 28,000 Products WDC Products Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 9. Data and Web Science Group DeepMatcher (2018) – Embeddings: FastText – Summarization: Bi-RNN with attention – Similarity computation: element-wise difference and multiplication, concatenation – Classification: Fully connected neural net, cross entropy loss 9 Mudgal, Sidharth, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD, 2018. Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 10. Data and Web Science Group Evaluation: DeepMatcher versus Magellan – DeepMatcher outperforms traditional methods on textual data – mixed results on structured data 10 Difference F1 Magellan F1 DeepMatcher F1 Dataset Type -2.7 91.2 88.5 iTunes-Amazon Structured +0.0 98.4 98.4 DBLP-ACM +2.4 94.7 92.3 DBLP-Scholar -5.0 71.9 66.9 Walmart-Amazon +19.2 43.6 62.8 Abt-Buy Textual +20.1 49.1 69.3 Amazon-Google +25.0 64.5 89.5 WDC Computer - Large +12.9 57.6 70.5 WDC Computer - Small Konda, et al.: Magellan: Toward Building Entity Matching Management. PVLDB, 2016. Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 11. Data and Web Science Group Transformers started to win all benchmarks in NLP – Self-supervised pre-training on large text corpora – Fine-tuning for downstream tasks 11 https://huggingface.co/docs/transformers/index Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 12. Data and Web Science Group DITTO (2021) – applies BERT, DistilBERT, RoBERTa for entity matching – Entity serialization for BERT – Pair of entity descriptions are turned into single sequence – [CLS] Entity Description 1 [SEP] Entity Description 2 [SEP] – Entity Description = [COL] attr1 [VAL] val1 . . . [COL] attrk [VAL] valk 12 Yuliang, et al: Deep entity matching with pre-trained language models. PVLDB, 2021. [CLS][COL] Title [VAL] DYMO D1 - Glossy tape [COL] Price [VAL] 1,99 € [SEP][COL] Title [VAL] DYMO 45017 D1 Tape [COL] Price [VAL] 2,19 € [SEP] Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 13. Data and Web Science Group DITTO: Architecture 13 – [CLS] token summarizes the pair of entities – linear layer on top of [CLS] token for matching decision Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 14. Data and Web Science Group DITTO: Evaluation – constant improvement for structured data – large performance gain for textual data 14 Magellan F1 DeepMatcher F1 DITTO F1 Dataset Type 91.2 +5.8 88.5 +8.5 97.0 iTunes-Amazon Structured 98.4 +0.6 98.4 +0.6 99.0 DBLP-ACM 94.7 +0.9 92.3 +3.3 95.6 DBLP-Scholar 71.9 +14.9 66.9 +19.9 86.8 Walmart-Amazon 43.6 +45.7 62.8 +26.5 89.3 Abt-Buy Textual 49.1 +26.5 69.3 +6.3 75.6 Amazon-Google 64.5 +27.2 89.5 +3.2 91.7 WDC Computer - Large 57.6 +23.2 70.5 +10.3 80.8 WDC Computer - Small Zeakis, et al.: Pre-trained Embeddings for Entity Resolution: An Experimental Analysis. PVLDB, 2023. Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 15. Data and Web Science Group Contrastive Pretraining in Vision 15 – maximizes distance between classes in the embedding space – uses large batches containing many positive and negative examples Khosla, et al.: Supervised Contrastive Learning. NeurIPS 2020. Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 16. Data and Web Science Group Supervised Contrastive Pretraining for Entity Matching (2022) 16 Peeters, Bizer: Supervised Contrastive Learning for Product Matching. WWW Companion 2022. (Frozen) Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 17. Data and Web Science Group Evaluation: Supervised Contrastive Pretraining SupCon 17 Large improvements for smaller training sets WDC Computers Amazon- Google Abt-Buy ~68K (xlarge) ~23K (large) ~8K (medium) ~3K (small) ~9K ~7.5K # Training Pairs 88.95 84.32 69.85 61.22 70.70 62.80 DeepMatcher 94.73 94.68 91.90 86.37 74.10 91.05 RoBERTa 95.45 91.70 88.62 80.76 75.58 89.33 Ditto 98.33 98.16 97.66 93.18 79.28 93.70 R-SupCon 98.33 98.50 98.50 95.21 76.14 94.29 R-SupCon+augmen + 0.84 + 1.60 + 6.60 + 8.84 + 3.70 + 3.24 Δ to best baseline Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 18. Data and Web Science Group Potential Reasons for the Good Performance of BERT-based Methods – Serialization allows to pay attention to all attributes – no strict separation between attributes – WordPiece tokenizer breaks unknown terms into pieces – no problems with out of vocabulary terms – Transfer learning from pre-training texts – different surface forms may already be close in embedding space – Contextualization of the embeddings – potentially more suited for capturing differing semantics 18 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 19. Data and Web Science Group Drawbacks of BERT-based Methods: Overfitting to Seen Entities – Benchmark: WDC Products, training set: 9500 pairs – Test setseen: offers for the same products as in training, 4500 pairs – Test setunseen: offers for different products, 4500 pairs 19 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Peeters, Bizer: WDC Products: A Multi-dimensional Entity Matching Benchmark. EDBT 2024. - 24.65% F1 - 8.92% F1 - 7.44% F1 - 6.43% F1 + 0.78% F1 - 28.73% F1
  • 20. Data and Web Science Group Drawbacks of BERT-based Methods: Require Thousands of Training Examples – WDC Products, Small training set: 5,000 pairs – WDC Products, Large training set: 24,335 pairs – – significant effort for acquiring training labels – continuous labeling and retraining necessary to cover new entities 20 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 - 6.73% F1 - 13.69% F1 - 13.77% F1 - 19.77% F1 - 4.03% F1 - 12.29% F1
  • 21. Data and Web Science Group Can Large Language Models (LLMs) address these drawbacks? Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 21
  • 22. Data and Web Science Group Entity Matching using Large Language Models Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 22 Peeters, Bizer: Entity Matching using Large Language Models. Arxiv, 2023. Narayan, et al.: Can Foundation Models Wrangle Your Data? PVLDB, 2022.
  • 23. Data and Web Science Group Variations in the Prompt Formulation Variations – general vs. domain-specific wording – complex vs. simple task description – free-form vs. forced (restricted) answering Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 23 Peeters, Bizer: Entity Matching using Large Language Models. Arxiv, 2023.
  • 24. Data and Web Science Group – Models: gpt3.5-turbo-0301, gpt3.5-turbo-0613, gpt4-0613 – Benchmark: WDC Products, test set: 1250 pairs Impact of Prompt Variations GPT4 Turbo06 Turbo03 Prompt/Model 88.35 74.96 75.55 domain-complex-force 89.61 64.93 68.66 domain-complex-free 83.72 38.24 79.17 domain-simple-force 84.50 72.52 75.17 domain-simple-free 85.83 60.62 76.51 general-complex-force 86.72 67.83 65.87 general-complex-free 77.39 14.02 78.33 general-simple-force 83.41 69.71 79.70 general-simple-free 84.27 56.43 74.47 Mean 3.42 17.87 4.28 Standard deviation Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 24
  • 25. Data and Web Science Group – Models: SOLAR-0-70B-16Bit, StableBeluga2-70B (4 GPUs, 275GB VRAM) – Benchmark: WDC Products, test set: 1250 pairs Open-Source Models Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Beluga2 SOLAR GPT4 Turbo03 Prompt/Model 63.61 67.93 88.35 75.55 domain-complex-force 54.97 72.95 89.61 68.66 domain-complex-free 44.19 26.71 83.72 79.17 domain-simple-force 43.79 53.44 84.50 75.17 domain-simple-free 54.97 56.52 85.83 76.51 general-complex-force 51.38 71.98 86.72 65.87 general-complex-free 40.00 11.28 77.39 78.33 general-simple-force 30.16 31.02 83.41 79.70 general-simple-free 47.58 51.43 84.27 74.47 Mean 8.78 20.13 3.42 4.28 Standard deviation - 32.84% F1 - 36.69% F1 - 16.66% F1 - 34.64% F1 25
  • 26. Data and Web Science Group Prompt as Hyperparmeter 26 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 No single prompt works best for all model/dataset combinations
  • 27. Data and Web Science Group GPT-based versus BERT-based Entity Matching Methods – GPT results are zero-shot: No task-specific training data! – RoBERTa and DITTO are fine-tuned using 5K to 22K training pairs 27 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 DBLP-Sch Ama-Goog Wal-Ama Abt-Buy WDC Model 84.13 63.72 74.81 87.39 79.70 Turbo03 89.82 76.38 89.67 95.78 89.61 GPT4 93.88 79.27 87.02 91.21 77.53 RoBERTa 94.31 80.07 86.39 91.31 84.90 Ditto -4.49 -3.69 2.65 4.47 4.71 Δ Best GPT/BERT Peeters, Bizer: Entity Matching using Large Language Models. Arxiv, 2023.
  • 28. Data and Web Science Group BERT-based Methods Generalization to other Datasets – Roberta and DITTO trained on WDC and applied to other datasets – BERT-based methods: Hardly any transfer between datasets – GPT-based methods: Transfer from pre-training plus emergent effects 28 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 DBLP-Sch Ama-Goog Wal-Ama Abt-Buy WDC Model 93.88 79.27 87.02 91.21 77.53 RoBERTaSeen 94.31 80.07 86.39 91.31 84.90 DITTOSeen 29.64 31.00 36.46 55.52 - RoBERTaUnseen -64.24 -48.27 -50.56 -35.69 - Δ RoBERTaUnseen 32.82 33.12 31.55 48.74 - DITTOUnseen -61.49 -46.95 -54.84 -42.57 - Δ DITTOUnseen
  • 29. Data and Web Science Group In-Context Learning via Demonstrations 29 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 USER: Do the following two product descriptions match? Product 1: ‘DYMO D1 19 mm x 7 m’ Product 2: ‘Dymo D1 (19mm x 7m – BoW)’ ASSISTANT: Yes. USER: Do the following two product descriptions match? Product 1: ‘DYMO D1 Tape 24mm’ Product 2: ‘Dymo D1 19mm x 7m’ ASSISTANT: No. USER: Do the following two product descriptions match? Answer with 'Yes' if they do and 'No' if they do not. Product 1: 'Title: DYMO D1 - Glossy tape - black on white - Roll (1.9cm x 7m) - 1 roll(s)’ Product 2: 'Title: DYMO 45017 D1 Tape 12mm x 7m sort p rd, S0720570' ASSISTANT: No.
  • 30. Data and Web Science Group In-Context Learning via Matching Rules 30 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 SYSTEM: Your task is to decide if two product descriptions match. The following rules need to be observed: 1. The brand of matching products must be the same if available 2. Model numbers of matching products must be the same if available 3. Additional features of matching products must be the same if available 4. Matching attributes may not have the exact same surface form. 5. If an attribute is missing for one description, it is likely still a match if the existing attributes match. USER: Do the following two product descriptions match? Answer with 'Yes' if they do and 'No' if they do not. Product 1: 'Title: DYMO D1 - Glossy tape - black on white - Roll (1.9cm x 7m) - 1 roll(s)’ Product 2: 'Title: DYMO 45017 D1 Tape 12mm x 7m sort p rd, S0720570' ASSISTANT: No.
  • 31. Data and Web Science Group Mean F1 over all 5 benchmark datasets – GPT3.5 and open-source models benefit from in-context learning – For GPT4 the additional guidance is harmful! Results In-Context Learning 31 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Beluga2 SOLAR GPT4 Turbo06 Turbo03 Shots Prompt/Model 68.27 63.10 85.22 70.28 71.87 6 Fewshot-related 69.23 64.64 86.64 69.93 71.89 10 77.33 79.73 85.11 75.75 79.25 6 Fewshot-random 77.13 80.71 85.77 77.32 80.62 10 75.49 77.24 85.77 70.95 78.59 0 Hand-written rules 75.29 75.69 85.04 67.33 57.17 0 Learned rules 69.80 76.97 88.25 74.95 77.95 0 Best zero-shot 7.53 3.74 -1.61 2.37 2.67 - Δ Best zero-shot
  • 32. Data and Web Science Group Impact of In-Context-Learning for GPT4 and the Amazon-Google Dataset – For some datasets, in-context learning is needed to notch GPT4 into the right direction! 32 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Amazon-Google Model 80.07 Ditto 76.38 GPT4Zeroshot -3.69 Δ GPT4Zeroshot/Ditto 78.76 GPT4Random10 85.21 GPT4Related10 +5.21 Δ GPT4Related10/Ditto Peeters, Bizer: Entity Matching using Large Language Models. Arxiv, 2023.
  • 33. Data and Web Science Group Confidence Scores 33 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 USER: Do the following two product descriptions refer to the same product? Provide a confidence score for your decision, 100% referring to full confidence. … ASSISTANT: No. Confidence: 95% The ROC curve shows the quality of the confidence scores
  • 34. Data and Web Science Group Explanations – Free Text 34 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 ASSISTANT: Yes. attribute=brand; importance=0.20; values=ZEBRA, Zebra; similarity=1.00 attribute=model; importance=0.40; values=800033-801, 800011-101; similarity=0.00 attribute=color; importance=0.10; values=Black, Black; similarity=1.00 attribute=prints; importance=0.20; values=1000, 1000; similarity=1.00 attribute=price; importance=0.10; values=GBP 9.5, GBP £12.99; similarity=0.00 USER: Do the following two product descriptions refer to the same product? Please explain your decision using the following format … Product 1: ZEBRA 800033-801 BLACK RIBBON (1000 PRINTS) GBP 9.5 Product 2: Zebra Black Monochrome Ribbon 800011-101 - 1000 Prints GBP £12.99 False positive! Explanations for BERT-based entity matchers see: Paganelli, Matteo et al.: Analyzing How BERT Performs Entity Matching. PVLDB, 2022.
  • 35. Data and Web Science Group Error Analysis (and Prompt Breeding?) 35 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 ASSISTANT: False Positives Error Classes: 1. Model Number Confusion 2. Semantic Misunderstanding 3. Minor Differences Overlooked 4. Attribute Importance Misalignment False Negatives Error Classes: 1. Description Misinterpretation 2. Price and Currency Differences 3. Missing Information 4. Different Measurement Units USER: The following list contains false positive and false negative product pairs from the output of a product matching system. Given the product pairs and associated attribute and importance information, come up with a set of error classes that explain why the systems fails on these examples. …. 10 false positives and 10 negatives … Fernando, et al.: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution. ArXiv, 2023.
  • 36. Data and Web Science Group Summary: GPT-based versus BERT-based Matching Methods 1. GPT-based matchers require less task-specific training data – GPT4zeroshot outperforms fine-tuned BERT models in many cases 2. GPT-based matchers are more robust to unseen entities – important for Web use cases that often involve unseen entities 3. Both approaches reduce the feature engineering effort – no information extraction necessary – less value normalization necessary due to pre-training 4. GPT-based matchers can explain matching decisions 36 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 37. Data and Web Science Group 2. Table Annotation Goal: Annotate table columns with terms from a shared vocabulary. Use Cases: 1. data lake indexing for search 2. schema matching via global schema 37 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Data Discovery Schema Matching Data Translation Entity Matching Data Fusion Clean and Complete annotate
  • 38. Data and Web Science Group Column Type Annotation (CTA) 38 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 39. Data and Web Science Group Column Property Annotation (CPA) 39 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 40. Data and Web Science Group Table Annotation Benchmarks 40 SemTab Table Annotation Evaluation Campaign: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/ Deng, et al.: TURL: Table Understanding through Representation Learning. PVLDB 2020. Korini, et al.: SOTAB: The WDC Schema. org table annotation benchmark. SemTab Proceedings, 2022. # Sources # Terms Vocabulary # Tables Dataset Task 1 255 Freebase 410,000 WikiTables CTA 1 122 DBpedia 6,892 GitTables SemTab 44,268 82 Schema.org 45,378 WDC SOTAB V2 103 32 Schema.org 103 WDC SOTAB Small 1 121 Freebase 53,000 WikiTables CPA 29,540 110 Schema.org 29,723 WDC SOTAB V2 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 41. Data and Web Science Group TURL (2020) – aims at learning generic table representations that are useful across a wide range of tasks – Pre-training: Self-supervised table representation learning – Fine-tuning: For 6 specific downstream tasks 41 Deng, et al.: TURL: Table Understanding through Representation Learning. PVLDB 2020. Pujara, et al.: From Tables to Knowledge: Recent Advances in Table Understanding. Tutorial at KDD2021. Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Text in table cells Wikipedia entities in table cells
  • 42. Data and Web Science Group DoDuo (2022) – directly fine-tunes BERT for column and relation annotation tasks – a table cell can pay attention to all neighboring cells – exploits synergies between CTA and CPA task using multi-task learning 42 Suhara, et al.: Annotating Columns with Pre-trained Language Models. SIGMOD 2022. Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 43. Data and Web Science Group Evaluation Results: Table Annotation – Column Type Annotation (CTA): WikiTables – Column Property Annotation (CTA): WikiTables – good results around 90% F1 for both tasks – use lots of training data for pre-training and fine-tuning – all labels are covered in the training data, no unseen ones R P F1 Method 87.23 90.54 88.86 TURL (TinyBERT) 92.21 92.45 92.45 DoDuo (BERT) R P F1 Method 90.69 91.18 90.94 TURL (TinyBERT) 91.47 91.97 91.72 DoDuo (BERT) Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 43
  • 44. Data and Web Science Group Can Large Language Models (LLMs) do better for Column Type Annotation? 44 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Korini, Bizer: Column Type Annotation using ChatGPT. VLDB Workshops, 2023. Feuer: ArcheType: A Novel Framework for Column Type Annotation using Large Language Models. Arxiv, 2023.
  • 45. Data and Web Science Group CTA as Column Classification 45 SYSTEM: Classify the column given to you into one of these types that are separated by comma: RestaurantName, ArtistName, AlbumName, EventName, PriceRange, AddressRegion, Country, Telephone, PaymentAccepted, PostalCode, Coordinate, DayOfWeek, Time, RestaurantDescription, Review, Date, DateTime, Organization, EventDescription, EventStatusType, EventAttendanceModeEnumeration, Currency, Telephone, MusicRecordingName, Duration ASSISTANT: Time USER: Column: 7:30 AM 7:00 AM 10:00 AM 5:00 PM 11:00 AM Type: – Benchmark: SOTABSmall – Topics: Restaurants, Events, Albums, Artists Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 46. Data and Web Science Group Using the Whole Table as Context for Disambiguation 46 SYSTEM: Classify the columns of a given table with only one of the following classes that are separated with comma: RestaurantName, ArtistName, AlbumName, PostalCode, AddressRegion, … {32 semantic types are listed here} ASSISTANT: RestaurantName, PostalCode, PaymentAccepted, Time USER: Table: Column 1 || Column 2 || Column 3 || Column 4 n Friends Pizza || 2525|| Cash Visa MasterCard || 7:30 AM n … Classes: Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 47. Data and Web Science Group Providing Explicit Instructions 47 – we instruct the model by providing reasoning steps – we explicitly specify that the input is a table SYSTEM: Classify the columns of a given table with only one of the following classes that are separated with comma: {32 semantic types are listed here} Instructions: 1. Look at the input given to you and make a table out of it. 2. Look at the cell values in detail. 3. For each column, select a class that best represents the meaning of all cells. 4. Answer with the selected class for each columns with the format Column1: class. ASSISTANT: Column1: RestaurantNamen Column2: PostalCoden Column3: PaymentAcceptedn Column4: Time USER: Table: Column 1 || Column 2 || Column 3 || Column 4 n Friends Pizza || 2525|| Cash Visa MasterCard || 7:30 AM n … Classes: Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 48. Data and Web Science Group – Benchmark: SOTAB Small, 32 terms, zero-shot – Table approach with instructions works best for OpenAI models. – open-source models are confused by complete tables – this are zero-shot results without using task-specific training data  The models “know” the terms from pre-training. Column Type Annotation Results 48 Falcon40B Stable Beluga2 GPT4 GPT03 Prompt / Model 21.42 75.55 86.31 45.85 Column - 20.63 94.19 37.90 Table 11.67 74.84 92.36 78.61 Column+Instructions 2.7 53.82 95.14 85.25 Table+Instructions Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Korini, Bizer: Column Type Annotation using ChatGPT. VLDB Workshops, 2023.
  • 49. Data and Web Science Group GPT-based versus BERT-based Table Annotation Methods – Training example (shot): Annotated column – RoBERTa fine-tuned using concatenated cell values – DODUO fine-tuned by embedding complete tables with column labels – RoBERTa using 1600 examples performs worse than GPT4 zero-shot – DoDuo confused due to low number of training tables 49 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Δ F1 F1 Shots Model - 95.14 0 GPT4 - 5,41 89.73 356 RoBERTa - 8,35 86.79 1600 RoBERTa - 88,77 6.37 356 DoDuo - 41,54 53.6 1600 DoDuo
  • 50. Data and Web Science Group Challenge: Large Vocabularies • Idea: Split CTA into two steps 1. predict topic of complete table 2. perform CTA using reduced set of topic-specific labels • Advantages: 1. save token space for large vocabularies 2. simplify the annotation task as the model chooses from smaller set of labels Table topic prediction prompt ChatGPT Model Answer CTA prompt with topic- specific labels ChatGPT Model Answer Step 1: Step 2: Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 50
  • 51. Data and Web Science Group Step 1: Table Topic Prediction SYSTEM: Your task is to classify if a table describes restaurants, events, music recordings or hotels. SYSTEM: Your instructions are: 1. Look at the input given to you and make a table out of it. 2. Look at the cell values in detail. 3. Decide if the table describes a Restaurant, Event, Music Recording or Hotel. 4. Answer with Restaurant, Event, Music Recording or Hotel. USER: Classify this table: Column 1 || Column 2 || Column 3 || Column 4 n Friends Pizza || 2525|| Cash Visa MasterCard || 7:30 AM n … ASISSTANT: Restaurant Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 51
  • 52. Data and Web Science Group Step 2: Column Type Annotation – the first system message uses only relevant subset of all labels – e.g. only 11 out of 32 labels belonging to the “Restaurant” topic SYSTEM: Your task is to classify the columns of a given table with only one of the following classes that are separated with comma: {relevant subset of all labels} Your instructions are: 1. Look at the input given to you and make a table out of it. 2. Look at the cell values in detail. 3. For each column, select a class … USER: Classify these table columns: Column 1 || Column 2 || Column 3 || Column 4 n Friends Pizza || 2525|| Cash Visa MasterCard || 7:30 AM n … ASSISTANT: Column1: RestaurantNamen Column2: PostalCoden Column3: PaymentAcceptedn Column4: Time Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 52
  • 53. Data and Web Science Group Results: Two-Step Approach 53 Falcon40B Stable Beluga2 GPT4 GPT03 F1 21.42 75.55 86.31 45.85 Column - 20.63 94.19 37.90 Table 11.67 74.84 92.36 78.61 Column+instructions 2.7 53.82 95.14 85.25 Table+instructions - 31.57 94.95 89.47 Two-step Pipeline Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 – Two-step approach helps GPT03 to handle label space – GPT4 does not require additional guidance for SOTABsmall
  • 54. Data and Web Science Group 3. Conclusions 54 1. GPT-based methods require less task-specific training data – high zero-shot performance of GPT4 2. GPT-based methods are more robust to unseen entities – important for Web use cases that often involve unseen entities 3. BERT-based methods are cheaper to run – no API usage fees, less GPU required 4. GPTs ability to generate explanations might increase the trust of the user into the integration results Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023
  • 55. Data and Web Science Group – Papers with Code collects results for all discussed benchmarks Staying Up To Date Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 https://paperswithcode.com/task/entity-resolution/ https://paperswithcode.com/task/table-annotation/ 55
  • 56. Data and Web Science Group Thank you. 56 Christian Bizer: GPT versus BERT for Data Integration. WEBIST, November 16, 2023 Email: christian.bizer@uni-mannheim.de Web: https://www.uni-mannheim.de/dws/people/professors/prof-dr-christian-bizer/