SlideShare a Scribd company logo
1 of 26
Download to read offline
Data and Web Science GroupUsing the Semantic Web as Training Data
for Product Matching
1Nov. 25, 2019, Hangzhou, China
Prof. Dr. Christian Bizer
OpenKG Forum
Data and Web Science Group
Product Matching
– Does an offer on one website refer to the same product as 
another offer on a different website?
– Core Challenge in E‐Commerce
– Necessary for building 
price comparison portals
– Necessary for building 
product knowledge graphs
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 2
Data and Web Science Group
Why is the task hard?
– For marketing reasons, merchants present the same 
product differently 
– Heterogeneous 
product title
– Heterogeneous 
product description
– Heterogeneous 
specification tables
– Heterogeneous 
categorization
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 3
Data and Web Science Group
State of the Art: Product Matching
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 4
Mudgal, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018.
Dong: ML for Entity Linkage. DI&ML tutorial at SIGMOD 2018. 
– Deep Learning 
– combining embeddings and RNNs
– Large Training Sets >100K examples
– owned by large companies such as Walmart, Amazon, Alibaba
– Matching Performance: F1 >90%
Data and Web Science Group
Question
Can we achieve the same matching 
performance using the Semantic Web 
as a source of training data?
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 5
Data and Web Science Group
Schema.org Annotations
6
– ask site owners since 2011 to 
annotate data for enriching 
search results
– 675 Types: Event, Place, Local 
Business, Product, Review, Person  
– Encoding: Microdata, RDFa, JSON‐LD
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
Data and Web Science Group
Annotation Example 
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 7
Data and Web Science Group
Web Data Commons – Structured Data
8
– extracts all Microformat, Microdata, 
RDFa, JSON‐LD data from the Common Crawl
– analyzes and provides the extracted data for download
– statistics of some extraction runs
– 2018 CC Corpus: 2.5 billion HTML pages  31.5 billion RDF triples
– 2017 CC Corpus: 3.1 billion HTML pages  38.2 billion RDF triples
– 2014 CC Corpus: 2.0 billion HTML pages  20.4 billion RDF triples
– 2010 CC Corpus: 2.8 billion HTML pages  5.1 billion RDF triples
– Download
– http://webdatacommons.org/structureddata/
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
Data and Web Science Group
Language and Top‐Level‐Domain 
Distribution of Common Crawl
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25 9
English
44%
Chinese
8%
Russian
7%
German
5%
Japanese
5%
French
4%
Spanish
4%
Other
23%
com
52%
org
6%
net
4%
uk
2%
de
4%
jp
2%
ru
6%
cn
1%
other
23%
Data and Web Science Group
Overall Adoption 2018
http://webdatacommons.org/structureddata/2018‐12/
10
944 million HTML pages out of the 2.5 billion pages
provide semantic annotations (37.1%).
9.6 million pay-level-domains (PLDs) out of the
32.8 million pay-level-domains covered by the crawl
provide semantic annotations (29.3%).
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
Data and Web Science Group
Frequently used Schema.org Classes
11
Top	Classes #	Websites	(PLDs)	
Microdata JSON‐LD
schema:WebPage 1,124,583 121,393
schema:Product 812,205 40,169
schema:Offer 676,899 57,756
schema:BreadcrumbList 621,344 205,971
schema:Article 612,361 57,082
schema:Organization 510,069 1,349,775
schema:PostalAddress 502,615 176,500
schema:ImageObject 360,875 111,946
schema:Blog 337,843 12,174
schema:Person 324,349 335,784
schema:LocalBusiness 294,390 249,017
schema:AggregateRating 258,078 23,105
schema:Review 124,022 6,622
schema:Place 92,127 66,396
schema:Event 88,130 63,605
http://webdatacommons.org/structureddata/2018‐12/ 
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
Data and Web Science Group
Attributes used to Describe Products
12
Top	Attributes PLDs	Microdata
# %
schema:Product/name 754,812 92	%
schema:Product/offers 645,994 79	%
schema:Offer/price 639,598	 78	%
schema:Offer/priceCurrency 606,990 74	%
schema:Product/image 573,614 70	%
schema:Product/description 520,307 64	%
schema:Offer/availability 477,170 58	%
schema:Product/url 364,889 44	%
schema:Product/sku 160,343	 19	%
schema:Product/aggregateRating 141,194 17	%
schema:Product/brand 113,209 13	%
schema:Product/category 62,170	 7	%
schema:Product/productID 47,088 5	%
… … …
http://webdatacommons.org/structureddata/2018‐12/stats/html‐md.xlsx
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
Das Samsung Galaxy S4 ist der
unterhaltsame und hilfreiche Begleiter
für Ihr mobiles Leben. Es verbindet Sie
mit Ihren Liebsten. Es lässt Sie
gemeinsam unvergessliche Momente
erleben und festhalten. Es vereinfacht
Ihren Alltag.
UPC 610214632623
000214632623
Data and Web Science Group
Using Product ID Annotations
as Supervision for Product Matching
– Some e‐shops annotate product IDs
– Most e‐shops do not 
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 13
Properties PLDs
# %
schema:Product/name 754,812 92	%
schema:Product/description 520,307 64	%
schema:Product/sku 160,343	 19	%
schema:Product/productID 47,088 5	%
schema:Product/mpn 12,882 1.6%
schema:Product/gtin13 7,994 1%
Data and Web Science Group
Learn How to Match Products using
Schema.org Data as Supervision 
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 14
Product
offer
Product
offer
Product
offer
Product
offerProduct
offerProduct
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Clusters of offers
having the same product ID
Unseen offers
without product IDs
Learn
Matcher
Matcher Same product?Product
offer
Product
offer
Product
offer
Data and Web Science Group
Data Cleansing
for Building the Clusters
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 15
Filtering of product 
offers with annotated 
product identifiers
Removal of 
listing pages
Filtering by 
identifier value 
length
Cluster creation 
based on identifier 
value co‐occurrence
Split wrong
clusters due to
category IDs
121M offers
out of 812M
58M offers 26M offers 16.4M clusters
All Languages
26.5M offers
79K distinct websites
16.6M clusters (products)
English Offers
16M offers
43K distinct websites
10M clusters (products)
16.6M clusters
Data and Web Science Group
Cluster Size Distribution by Category
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 16
A. Primpeli, R. Peeters, C. Bizer: The WDC Training Dataset and Gold Standard for Large‐Scale Product Matching. 
ECNLP 2019 Workshop @ WWW2019.
Data and Web Science Group
Pre‐Assembled Training Sets
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 17
http://webdatacommons.org/largescaleproductcorpus/v2/
– Four categories: computers, cameras, watches, shoes
– Four sizes: small, medium, large, xlarge
– 9,000 to 214,000 examples
– 93,4 % of the pairs are correct (evaluation sample: 900 pairs)
– Statistics about xlarge training set
Data and Web Science Group
Gold Standard
– Mixture of random and difficult borderline pairs.
– All pairs are manually verified.
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 18
http://webdatacommons.org/largescaleproductcorpus/v2/
Data and Web Science Group
Comparison to Existing Benchmark 
Datasets
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 19
1,200
Bizer, Primpeli, Peeters: Using the Semantic Web as a Source of Training Data. Datenbank Spektrum, 2019.
Not public 
Our Datasets
Data and Web Science Group
Results: Traditional Learning Methods
20
Magellan: xlarge set
Category Method and Features Precision Recall F1
Computers XGBoost title‐description‐brand+specs 0.74 0.55 0.62
Cameras XGBoost title‐description‐brand+specs 0.72 0.58 0.64
Watches XGBoost title‐description‐brand+specs 0.76 0.50 0.60
Shoes RandomForest title‐description‐brand+specs 0.74 0.51 0.60
All categories RandomForest title‐description‐brand+specs 0.48 0.77 0.59
Word Co‐Occurrence: xlarge set
Category Method and Features Precision Recall F1
Computers LinearSVC title‐description‐brand+specs 0.86 0.80 0.83
Cameras LinearSVC title‐description‐brand+specs 0.83 0.65 0.73
Watches LogisticReg title‐description‐brand+specs 0.87 0.68 0.77
Shoes LogisticReg title 0.87 0.61 0.72
All categories RandomForest title‐description‐brand+specs 0.85 0.65 0.75
Data and Web Science Group
DeepMatcher Framework
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 21
‐ word‐based, character based
‐ pre‐trained (word2vec, GloVe, fastText), learned
‐ attribute summarization: SIF, RNN, Attention, Hybrid
‐ attribute comparison: fixed distance, learnable distance 
‐ classification: fully connected neural net
Mudgal, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018.
Data and Web Science Group
Results: DeepMatcher
– Near human‐level matching performance just using web data.
– The 6.6% errors in the training data are averaged out in the learning.
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 22
Category Features Precision Recall F1
Computers title‐description‐brand 0.91 0.97 0.95
Cameras title 0.94 0.93 0.92
Watches title 0.95 0.97 0.96
Shoes title 0.96 0.99 0.95
All categories title‐description‐brand 0.91 0.93 0.92
Data and Web Science Group
Learning Curves
Deep Matcher vs. Baselines
– Results get really good starting at 100K training examples
– Gap Deep Matcher vs. Random Forest @200K: 17% F1
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 23
Data and Web Science Group
Learning Curves:
Deep Matcher Configurations
– RNN, fasttest embeddings pre‐trained on Wikipedia
– end2end training which adjusts embeddings: +2‐3% F1
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 24
Data and Web Science Group
Back to Our Question
Can we achieve the same matching performance using 
the Semantic Web instead of manually labeled training data?
Answer: Yes
Implications:
– Potential to save money on building matchers
– Potential to save money on maintain matchers
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 25
Data and Web Science Group
Thank you.
– Paper:
Bizer, Primpeli, Peeters: Using the Semantic Web as a source of training data. 
Datenbank‐Spektrum, 19, 127‐135, 2019.
– Training Data and Goldstandard
http://webdatacommons.org/largescaleproductcorpus/v2/
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 26

More Related Content

Similar to Using the Semantic Web as Training Data for Product Matching

What Publishers Need to Know About Web Scale Discovery
What Publishers Need to Know About Web Scale DiscoveryWhat Publishers Need to Know About Web Scale Discovery
What Publishers Need to Know About Web Scale DiscoveryRinggold Inc
 
SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0Bill Slawski
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIDenodo
 
AR to Increase Product Value and Brand Activation Through Exhibition
AR to Increase Product Value and Brand Activation Through ExhibitionAR to Increase Product Value and Brand Activation Through Exhibition
AR to Increase Product Value and Brand Activation Through ExhibitionFat'hah Noor Prawita
 
Data-centric market status, case studies and outlook
Data-centric market status, case studies and outlookData-centric market status, case studies and outlook
Data-centric market status, case studies and outlookAlan Morrison
 
Google Analytics Konferenz 2019_Google Cloud Platform_Carl Fernandes & Ksenia...
Google Analytics Konferenz 2019_Google Cloud Platform_Carl Fernandes & Ksenia...Google Analytics Konferenz 2019_Google Cloud Platform_Carl Fernandes & Ksenia...
Google Analytics Konferenz 2019_Google Cloud Platform_Carl Fernandes & Ksenia...e-dialog GmbH
 
PoolParty Suite @ LOTICO Meetup
PoolParty Suite @ LOTICO MeetupPoolParty Suite @ LOTICO Meetup
PoolParty Suite @ LOTICO MeetupFlorian Kondert
 
How to Influence Using Data by Microsoft 365 Product Manager
How to Influence Using Data by Microsoft 365 Product ManagerHow to Influence Using Data by Microsoft 365 Product Manager
How to Influence Using Data by Microsoft 365 Product ManagerProduct School
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricCambridge Semantics
 
Rio SEO Webinar Deck - Guiding your Local Search Strategy to Drive Customers ...
Rio SEO Webinar Deck - Guiding your Local Search Strategy to Drive Customers ...Rio SEO Webinar Deck - Guiding your Local Search Strategy to Drive Customers ...
Rio SEO Webinar Deck - Guiding your Local Search Strategy to Drive Customers ...Kristi Hedin
 
6 Ways to Leverage the Google Search Appliance in your Enterprise
6 Ways to Leverage the Google Search Appliance in your Enterprise6 Ways to Leverage the Google Search Appliance in your Enterprise
6 Ways to Leverage the Google Search Appliance in your EnterpriseEd Laczynski
 
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Denodo
 
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data AssetsEnterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data AssetsDenodo
 
Strategic Industry Analysis
Strategic Industry AnalysisStrategic Industry Analysis
Strategic Industry AnalysisDebra Askanase
 
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?Chris Bizer
 

Similar to Using the Semantic Web as Training Data for Product Matching (20)

What Publishers Need to Know About Web Scale Discovery
What Publishers Need to Know About Web Scale DiscoveryWhat Publishers Need to Know About Web Scale Discovery
What Publishers Need to Know About Web Scale Discovery
 
SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
 
AR to Increase Product Value and Brand Activation Through Exhibition
AR to Increase Product Value and Brand Activation Through ExhibitionAR to Increase Product Value and Brand Activation Through Exhibition
AR to Increase Product Value and Brand Activation Through Exhibition
 
Web 2008
Web 2008Web 2008
Web 2008
 
Data-centric market status, case studies and outlook
Data-centric market status, case studies and outlookData-centric market status, case studies and outlook
Data-centric market status, case studies and outlook
 
The influence of search engine optimization on Google's results: A multi-dime...
The influence of search engine optimization on Google's results: A multi-dime...The influence of search engine optimization on Google's results: A multi-dime...
The influence of search engine optimization on Google's results: A multi-dime...
 
Google Analytics Konferenz 2019_Google Cloud Platform_Carl Fernandes & Ksenia...
Google Analytics Konferenz 2019_Google Cloud Platform_Carl Fernandes & Ksenia...Google Analytics Konferenz 2019_Google Cloud Platform_Carl Fernandes & Ksenia...
Google Analytics Konferenz 2019_Google Cloud Platform_Carl Fernandes & Ksenia...
 
Workshop Report Benchmarking Linked Data
Workshop Report Benchmarking Linked DataWorkshop Report Benchmarking Linked Data
Workshop Report Benchmarking Linked Data
 
PoolParty Suite @ LOTICO Meetup
PoolParty Suite @ LOTICO MeetupPoolParty Suite @ LOTICO Meetup
PoolParty Suite @ LOTICO Meetup
 
How to Influence Using Data by Microsoft 365 Product Manager
How to Influence Using Data by Microsoft 365 Product ManagerHow to Influence Using Data by Microsoft 365 Product Manager
How to Influence Using Data by Microsoft 365 Product Manager
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
 
Rio SEO Webinar Deck - Guiding your Local Search Strategy to Drive Customers ...
Rio SEO Webinar Deck - Guiding your Local Search Strategy to Drive Customers ...Rio SEO Webinar Deck - Guiding your Local Search Strategy to Drive Customers ...
Rio SEO Webinar Deck - Guiding your Local Search Strategy to Drive Customers ...
 
6 Ways to Leverage the Google Search Appliance in your Enterprise
6 Ways to Leverage the Google Search Appliance in your Enterprise6 Ways to Leverage the Google Search Appliance in your Enterprise
6 Ways to Leverage the Google Search Appliance in your Enterprise
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
 
Linked data big data
Linked data   big dataLinked data   big data
Linked data big data
 
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data AssetsEnterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
 
Strategic Industry Analysis
Strategic Industry AnalysisStrategic Industry Analysis
Strategic Industry Analysis
 
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
 

More from Chris Bizer

JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebChris Bizer
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
 
Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Chris Bizer
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesChris Bizer
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsChris Bizer
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million WebsitesChris Bizer
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsChris Bizer
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Chris Bizer
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataChris Bizer
 

More from Chris Bizer (11)

JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
 
Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million Websites
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Using the Semantic Web as Training Data for Product Matching

  • 1. Data and Web Science GroupUsing the Semantic Web as Training Data for Product Matching 1Nov. 25, 2019, Hangzhou, China Prof. Dr. Christian Bizer OpenKG Forum
  • 2. Data and Web Science Group Product Matching – Does an offer on one website refer to the same product as  another offer on a different website? – Core Challenge in E‐Commerce – Necessary for building  price comparison portals – Necessary for building  product knowledge graphs Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 2
  • 3. Data and Web Science Group Why is the task hard? – For marketing reasons, merchants present the same  product differently  – Heterogeneous  product title – Heterogeneous  product description – Heterogeneous  specification tables – Heterogeneous  categorization Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 3
  • 4. Data and Web Science Group State of the Art: Product Matching Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 4 Mudgal, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018. Dong: ML for Entity Linkage. DI&ML tutorial at SIGMOD 2018.  – Deep Learning  – combining embeddings and RNNs – Large Training Sets >100K examples – owned by large companies such as Walmart, Amazon, Alibaba – Matching Performance: F1 >90%
  • 5. Data and Web Science Group Question Can we achieve the same matching  performance using the Semantic Web  as a source of training data? Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 5
  • 6. Data and Web Science Group Schema.org Annotations 6 – ask site owners since 2011 to  annotate data for enriching  search results – 675 Types: Event, Place, Local  Business, Product, Review, Person   – Encoding: Microdata, RDFa, JSON‐LD Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
  • 7. Data and Web Science Group Annotation Example  Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 7
  • 8. Data and Web Science Group Web Data Commons – Structured Data 8 – extracts all Microformat, Microdata,  RDFa, JSON‐LD data from the Common Crawl – analyzes and provides the extracted data for download – statistics of some extraction runs – 2018 CC Corpus: 2.5 billion HTML pages  31.5 billion RDF triples – 2017 CC Corpus: 3.1 billion HTML pages  38.2 billion RDF triples – 2014 CC Corpus: 2.0 billion HTML pages  20.4 billion RDF triples – 2010 CC Corpus: 2.8 billion HTML pages  5.1 billion RDF triples – Download – http://webdatacommons.org/structureddata/ Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
  • 9. Data and Web Science Group Language and Top‐Level‐Domain  Distribution of Common Crawl Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25 9 English 44% Chinese 8% Russian 7% German 5% Japanese 5% French 4% Spanish 4% Other 23% com 52% org 6% net 4% uk 2% de 4% jp 2% ru 6% cn 1% other 23%
  • 10. Data and Web Science Group Overall Adoption 2018 http://webdatacommons.org/structureddata/2018‐12/ 10 944 million HTML pages out of the 2.5 billion pages provide semantic annotations (37.1%). 9.6 million pay-level-domains (PLDs) out of the 32.8 million pay-level-domains covered by the crawl provide semantic annotations (29.3%). Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
  • 11. Data and Web Science Group Frequently used Schema.org Classes 11 Top Classes # Websites (PLDs) Microdata JSON‐LD schema:WebPage 1,124,583 121,393 schema:Product 812,205 40,169 schema:Offer 676,899 57,756 schema:BreadcrumbList 621,344 205,971 schema:Article 612,361 57,082 schema:Organization 510,069 1,349,775 schema:PostalAddress 502,615 176,500 schema:ImageObject 360,875 111,946 schema:Blog 337,843 12,174 schema:Person 324,349 335,784 schema:LocalBusiness 294,390 249,017 schema:AggregateRating 258,078 23,105 schema:Review 124,022 6,622 schema:Place 92,127 66,396 schema:Event 88,130 63,605 http://webdatacommons.org/structureddata/2018‐12/  Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
  • 12. Data and Web Science Group Attributes used to Describe Products 12 Top Attributes PLDs Microdata # % schema:Product/name 754,812 92 % schema:Product/offers 645,994 79 % schema:Offer/price 639,598 78 % schema:Offer/priceCurrency 606,990 74 % schema:Product/image 573,614 70 % schema:Product/description 520,307 64 % schema:Offer/availability 477,170 58 % schema:Product/url 364,889 44 % schema:Product/sku 160,343 19 % schema:Product/aggregateRating 141,194 17 % schema:Product/brand 113,209 13 % schema:Product/category 62,170 7 % schema:Product/productID 47,088 5 % … … … http://webdatacommons.org/structureddata/2018‐12/stats/html‐md.xlsx Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25 Das Samsung Galaxy S4 ist der unterhaltsame und hilfreiche Begleiter für Ihr mobiles Leben. Es verbindet Sie mit Ihren Liebsten. Es lässt Sie gemeinsam unvergessliche Momente erleben und festhalten. Es vereinfacht Ihren Alltag. UPC 610214632623 000214632623
  • 13. Data and Web Science Group Using Product ID Annotations as Supervision for Product Matching – Some e‐shops annotate product IDs – Most e‐shops do not  Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 13 Properties PLDs # % schema:Product/name 754,812 92 % schema:Product/description 520,307 64 % schema:Product/sku 160,343 19 % schema:Product/productID 47,088 5 % schema:Product/mpn 12,882 1.6% schema:Product/gtin13 7,994 1%
  • 14. Data and Web Science Group Learn How to Match Products using Schema.org Data as Supervision  Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 14 Product offer Product offer Product offer Product offerProduct offerProduct offer Product offer Product offer Product offer Product offer Product offer Product offer Clusters of offers having the same product ID Unseen offers without product IDs Learn Matcher Matcher Same product?Product offer Product offer Product offer
  • 15. Data and Web Science Group Data Cleansing for Building the Clusters Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 15 Filtering of product  offers with annotated  product identifiers Removal of  listing pages Filtering by  identifier value  length Cluster creation  based on identifier  value co‐occurrence Split wrong clusters due to category IDs 121M offers out of 812M 58M offers 26M offers 16.4M clusters All Languages 26.5M offers 79K distinct websites 16.6M clusters (products) English Offers 16M offers 43K distinct websites 10M clusters (products) 16.6M clusters
  • 16. Data and Web Science Group Cluster Size Distribution by Category Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 16 A. Primpeli, R. Peeters, C. Bizer: The WDC Training Dataset and Gold Standard for Large‐Scale Product Matching.  ECNLP 2019 Workshop @ WWW2019.
  • 17. Data and Web Science Group Pre‐Assembled Training Sets Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 17 http://webdatacommons.org/largescaleproductcorpus/v2/ – Four categories: computers, cameras, watches, shoes – Four sizes: small, medium, large, xlarge – 9,000 to 214,000 examples – 93,4 % of the pairs are correct (evaluation sample: 900 pairs) – Statistics about xlarge training set
  • 18. Data and Web Science Group Gold Standard – Mixture of random and difficult borderline pairs. – All pairs are manually verified. Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 18 http://webdatacommons.org/largescaleproductcorpus/v2/
  • 19. Data and Web Science Group Comparison to Existing Benchmark  Datasets Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 19 1,200 Bizer, Primpeli, Peeters: Using the Semantic Web as a Source of Training Data. Datenbank Spektrum, 2019. Not public  Our Datasets
  • 20. Data and Web Science Group Results: Traditional Learning Methods 20 Magellan: xlarge set Category Method and Features Precision Recall F1 Computers XGBoost title‐description‐brand+specs 0.74 0.55 0.62 Cameras XGBoost title‐description‐brand+specs 0.72 0.58 0.64 Watches XGBoost title‐description‐brand+specs 0.76 0.50 0.60 Shoes RandomForest title‐description‐brand+specs 0.74 0.51 0.60 All categories RandomForest title‐description‐brand+specs 0.48 0.77 0.59 Word Co‐Occurrence: xlarge set Category Method and Features Precision Recall F1 Computers LinearSVC title‐description‐brand+specs 0.86 0.80 0.83 Cameras LinearSVC title‐description‐brand+specs 0.83 0.65 0.73 Watches LogisticReg title‐description‐brand+specs 0.87 0.68 0.77 Shoes LogisticReg title 0.87 0.61 0.72 All categories RandomForest title‐description‐brand+specs 0.85 0.65 0.75
  • 21. Data and Web Science Group DeepMatcher Framework Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 21 ‐ word‐based, character based ‐ pre‐trained (word2vec, GloVe, fastText), learned ‐ attribute summarization: SIF, RNN, Attention, Hybrid ‐ attribute comparison: fixed distance, learnable distance  ‐ classification: fully connected neural net Mudgal, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018.
  • 22. Data and Web Science Group Results: DeepMatcher – Near human‐level matching performance just using web data. – The 6.6% errors in the training data are averaged out in the learning. Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 22 Category Features Precision Recall F1 Computers title‐description‐brand 0.91 0.97 0.95 Cameras title 0.94 0.93 0.92 Watches title 0.95 0.97 0.96 Shoes title 0.96 0.99 0.95 All categories title‐description‐brand 0.91 0.93 0.92
  • 23. Data and Web Science Group Learning Curves Deep Matcher vs. Baselines – Results get really good starting at 100K training examples – Gap Deep Matcher vs. Random Forest @200K: 17% F1 Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 23
  • 24. Data and Web Science Group Learning Curves: Deep Matcher Configurations – RNN, fasttest embeddings pre‐trained on Wikipedia – end2end training which adjusts embeddings: +2‐3% F1 Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 24
  • 25. Data and Web Science Group Back to Our Question Can we achieve the same matching performance using  the Semantic Web instead of manually labeled training data? Answer: Yes Implications: – Potential to save money on building matchers – Potential to save money on maintain matchers Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 25
  • 26. Data and Web Science Group Thank you. – Paper: Bizer, Primpeli, Peeters: Using the Semantic Web as a source of training data.  Datenbank‐Spektrum, 19, 127‐135, 2019. – Training Data and Goldstandard http://webdatacommons.org/largescaleproductcorpus/v2/ Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 26