Using the Semantic Web as Training Data for Product Matching

Data and Web Science GroupUsing the Semantic Web as Training Data
for Product Matching
1Nov. 25, 2019, Hangzhou, China
Prof. Dr. Christian Bizer
OpenKG Forum

Data and Web Science Group
Product Matching
– Does an offer on one website refer to the same product as
another offer on a different website?
– Core Challenge in E‐Commerce
– Necessary for building
price comparison portals
– Necessary for building
product knowledge graphs
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 2

Why is the task hard?
– For marketing reasons, merchants present the same
product differently
– Heterogeneous
product title
– Heterogeneous
product description
– Heterogeneous
specification tables
– Heterogeneous
categorization

State of the Art: Product Matching
Mudgal, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018.
Dong: ML for Entity Linkage. DI&ML tutorial at SIGMOD 2018.
– Deep Learning
– combining embeddings and RNNs
– Large Training Sets >100K examples
– owned by large companies such as Walmart, Amazon, Alibaba
– Matching Performance: F1 >90%

Question
Can we achieve the same matching
performance using the Semantic Web
as a source of training data?

Schema.org Annotations
6
– ask site owners since 2011 to
annotate data for enriching
search results
– 675 Types: Event, Place, Local
Business, Product, Review, Person
– Encoding: Microdata, RDFa, JSON‐LD
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25

Annotation Example

Web Data Commons – Structured Data
8
– extracts all Microformat, Microdata,
RDFa, JSON‐LD data from the Common Crawl
– analyzes and provides the extracted data for download
– statistics of some extraction runs
– 2018 CC Corpus: 2.5 billion HTML pages  31.5 billion RDF triples
– Download
– http://webdatacommons.org/structureddata/

Language and Top‐Level‐Domain
Distribution of Common Crawl
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25 9
English
44%
Chinese
8%
Russian
7%
German
5%
Japanese
5%
French
4%
Spanish
4%
Other
23%
com
52%
org
6%
net
4%
uk
2%
de
4%
jp
2%
ru
6%
cn
1%
other
23%

Overall Adoption 2018
http://webdatacommons.org/structureddata/2018‐12/
10
944 million HTML pages out of the 2.5 billion pages
provide semantic annotations (37.1%).
9.6 million pay-level-domains (PLDs) out of the
32.8 million pay-level-domains covered by the crawl
provide semantic annotations (29.3%).

Frequently used Schema.org Classes
11
Top Classes # Websites (PLDs)
Microdata JSON‐LD
schema:WebPage 1,124,583 121,393
schema:Product 812,205 40,169
schema:Offer 676,899 57,756
schema:BreadcrumbList 621,344 205,971
schema:Article 612,361 57,082
schema:Organization 510,069 1,349,775
schema:PostalAddress 502,615 176,500
schema:ImageObject 360,875 111,946
schema:Blog 337,843 12,174
schema:Person 324,349 335,784
schema:LocalBusiness 294,390 249,017
schema:AggregateRating 258,078 23,105
schema:Review 124,022 6,622
schema:Place 92,127 66,396
schema:Event 88,130 63,605
http://webdatacommons.org/structureddata/2018‐12/

Attributes used to Describe Products
12
Top Attributes PLDs Microdata
# %
schema:Product/name 754,812 92 %
schema:Product/offers 645,994 79 %
schema:Offer/price 639,598 78 %
schema:Offer/priceCurrency 606,990 74 %
schema:Product/image 573,614 70 %
schema:Product/description 520,307 64 %
schema:Offer/availability 477,170 58 %
schema:Product/url 364,889 44 %
schema:Product/sku 160,343 19 %
schema:Product/aggregateRating 141,194 17 %
schema:Product/brand 113,209 13 %
schema:Product/category 62,170 7 %
schema:Product/productID 47,088 5 %
… … …
http://webdatacommons.org/structureddata/2018‐12/stats/html‐md.xlsx
Das Samsung Galaxy S4 ist der
unterhaltsame und hilfreiche Begleiter
für Ihr mobiles Leben. Es verbindet Sie
mit Ihren Liebsten. Es lässt Sie
gemeinsam unvergessliche Momente
erleben und festhalten. Es vereinfacht
Ihren Alltag.
UPC 610214632623
000214632623

Using Product ID Annotations
as Supervision for Product Matching
– Some e‐shops annotate product IDs
– Most e‐shops do not 
Properties PLDs
# %
schema:Product/name 754,812 92 %
schema:Product/description 520,307 64 %
schema:Product/sku 160,343 19 %
schema:Product/productID 47,088 5 %
schema:Product/mpn 12,882 1.6%
schema:Product/gtin13 7,994 1%

Learn How to Match Products using
Schema.org Data as Supervision
Product
offer
Product
offer
Product
offer
Product
offerProduct
offerProduct
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Clusters of offers
having the same product ID
Unseen offers
without product IDs
Learn
Matcher
Matcher Same product?Product
offer
Product
offer
Product
offer

Data Cleansing
for Building the Clusters
Filtering of product
offers with annotated
product identifiers
Removal of
listing pages
Filtering by
identifier value
length
Cluster creation
based on identifier
value co‐occurrence
Split wrong
clusters due to
category IDs
121M offers
out of 812M
58M offers 26M offers 16.4M clusters
All Languages
26.5M offers
79K distinct websites
16.6M clusters (products)
English Offers
16M offers
43K distinct websites
10M clusters (products)
16.6M clusters

Cluster Size Distribution by Category
A. Primpeli, R. Peeters, C. Bizer: The WDC Training Dataset and Gold Standard for Large‐Scale Product Matching.
ECNLP 2019 Workshop @ WWW2019.

Pre‐Assembled Training Sets
http://webdatacommons.org/largescaleproductcorpus/v2/
– Four categories: computers, cameras, watches, shoes
– Four sizes: small, medium, large, xlarge
– 9,000 to 214,000 examples
– 93,4 % of the pairs are correct (evaluation sample: 900 pairs)
– Statistics about xlarge training set

Gold Standard
– Mixture of random and difficult borderline pairs.
– All pairs are manually verified.

Comparison to Existing Benchmark
Datasets
1,200
Bizer, Primpeli, Peeters: Using the Semantic Web as a Source of Training Data. Datenbank Spektrum, 2019.
Not public 
Our Datasets

Results: Traditional Learning Methods
20
Magellan: xlarge set
Category Method and Features Precision Recall F1
Computers XGBoost title‐description‐brand+specs 0.74 0.55 0.62
Cameras XGBoost title‐description‐brand+specs 0.72 0.58 0.64
Watches XGBoost title‐description‐brand+specs 0.76 0.50 0.60
Shoes RandomForest title‐description‐brand+specs 0.74 0.51 0.60
All categories RandomForest title‐description‐brand+specs 0.48 0.77 0.59
Word Co‐Occurrence: xlarge set
Category Method and Features Precision Recall F1
Computers LinearSVC title‐description‐brand+specs 0.86 0.80 0.83
Cameras LinearSVC title‐description‐brand+specs 0.83 0.65 0.73
Watches LogisticReg title‐description‐brand+specs 0.87 0.68 0.77
Shoes LogisticReg title 0.87 0.61 0.72
All categories RandomForest title‐description‐brand+specs 0.85 0.65 0.75

DeepMatcher Framework
‐ word‐based, character based
‐ pre‐trained (word2vec, GloVe, fastText), learned
‐ attribute summarization: SIF, RNN, Attention, Hybrid
‐ attribute comparison: fixed distance, learnable distance
‐ classification: fully connected neural net
Mudgal, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018.

Results: DeepMatcher
– Near human‐level matching performance just using web data.
– The 6.6% errors in the training data are averaged out in the learning.
Category Features Precision Recall F1
Computers title‐description‐brand 0.91 0.97 0.95
Cameras title 0.94 0.93 0.92
Watches title 0.95 0.97 0.96
Shoes title 0.96 0.99 0.95
All categories title‐description‐brand 0.91 0.93 0.92

Learning Curves
Deep Matcher vs. Baselines
– Results get really good starting at 100K training examples
– Gap Deep Matcher vs. Random Forest @200K: 17% F1

Learning Curves:
Deep Matcher Configurations
– RNN, fasttest embeddings pre‐trained on Wikipedia
– end2end training which adjusts embeddings: +2‐3% F1

Back to Our Question
Can we achieve the same matching performance using
the Semantic Web instead of manually labeled training data?
Answer: Yes
Implications:
– Potential to save money on building matchers
– Potential to save money on maintain matchers

Thank you.
– Paper:
Bizer, Primpeli, Peeters: Using the Semantic Web as a source of training data.
Datenbank‐Spektrum, 19, 127‐135, 2019.
– Training Data and Goldstandard

Using the Semantic Web as Training Data for Product Matching

Recommended

Recommended

More Related Content

Similar to Using the Semantic Web as Training Data for Product Matching

Similar to Using the Semantic Web as Training Data for Product Matching (20)

More from Chris Bizer

More from Chris Bizer (11)

Recently uploaded

Recently uploaded (20)

Using the Semantic Web as Training Data for Product Matching