Talk at the OpenKG Forum co-located with JIST2019 about using schema.org annotations from the Web for training product matchers.
See also:
http://webdatacommons.org/largescaleproductcorpus/v2/
http://jist2019.openkg.cn/index.php/openkgasia-forum/
Using the Semantic Web as Training Data for Product Matching
1. Data and Web Science GroupUsing the Semantic Web as Training Data
for Product Matching
1Nov. 25, 2019, Hangzhou, China
Prof. Dr. Christian Bizer
OpenKG Forum
2. Data and Web Science Group
Product Matching
– Does an offer on one website refer to the same product as
another offer on a different website?
– Core Challenge in E‐Commerce
– Necessary for building
price comparison portals
– Necessary for building
product knowledge graphs
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 2
3. Data and Web Science Group
Why is the task hard?
– For marketing reasons, merchants present the same
product differently
– Heterogeneous
product title
– Heterogeneous
product description
– Heterogeneous
specification tables
– Heterogeneous
categorization
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 3
4. Data and Web Science Group
State of the Art: Product Matching
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 4
Mudgal, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018.
Dong: ML for Entity Linkage. DI&ML tutorial at SIGMOD 2018.
– Deep Learning
– combining embeddings and RNNs
– Large Training Sets >100K examples
– owned by large companies such as Walmart, Amazon, Alibaba
– Matching Performance: F1 >90%
5. Data and Web Science Group
Question
Can we achieve the same matching
performance using the Semantic Web
as a source of training data?
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 5
6. Data and Web Science Group
Schema.org Annotations
6
– ask site owners since 2011 to
annotate data for enriching
search results
– 675 Types: Event, Place, Local
Business, Product, Review, Person
– Encoding: Microdata, RDFa, JSON‐LD
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
7. Data and Web Science Group
Annotation Example
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 7
8. Data and Web Science Group
Web Data Commons – Structured Data
8
– extracts all Microformat, Microdata,
RDFa, JSON‐LD data from the Common Crawl
– analyzes and provides the extracted data for download
– statistics of some extraction runs
– 2018 CC Corpus: 2.5 billion HTML pages 31.5 billion RDF triples
– 2017 CC Corpus: 3.1 billion HTML pages 38.2 billion RDF triples
– 2014 CC Corpus: 2.0 billion HTML pages 20.4 billion RDF triples
– 2010 CC Corpus: 2.8 billion HTML pages 5.1 billion RDF triples
– Download
– http://webdatacommons.org/structureddata/
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
9. Data and Web Science Group
Language and Top‐Level‐Domain
Distribution of Common Crawl
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25 9
English
44%
Chinese
8%
Russian
7%
German
5%
Japanese
5%
French
4%
Spanish
4%
Other
23%
com
52%
org
6%
net
4%
uk
2%
de
4%
jp
2%
ru
6%
cn
1%
other
23%
10. Data and Web Science Group
Overall Adoption 2018
http://webdatacommons.org/structureddata/2018‐12/
10
944 million HTML pages out of the 2.5 billion pages
provide semantic annotations (37.1%).
9.6 million pay-level-domains (PLDs) out of the
32.8 million pay-level-domains covered by the crawl
provide semantic annotations (29.3%).
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
12. Data and Web Science Group
Attributes used to Describe Products
12
Top Attributes PLDs Microdata
# %
schema:Product/name 754,812 92 %
schema:Product/offers 645,994 79 %
schema:Offer/price 639,598 78 %
schema:Offer/priceCurrency 606,990 74 %
schema:Product/image 573,614 70 %
schema:Product/description 520,307 64 %
schema:Offer/availability 477,170 58 %
schema:Product/url 364,889 44 %
schema:Product/sku 160,343 19 %
schema:Product/aggregateRating 141,194 17 %
schema:Product/brand 113,209 13 %
schema:Product/category 62,170 7 %
schema:Product/productID 47,088 5 %
… … …
http://webdatacommons.org/structureddata/2018‐12/stats/html‐md.xlsx
Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
Das Samsung Galaxy S4 ist der
unterhaltsame und hilfreiche Begleiter
für Ihr mobiles Leben. Es verbindet Sie
mit Ihren Liebsten. Es lässt Sie
gemeinsam unvergessliche Momente
erleben und festhalten. Es vereinfacht
Ihren Alltag.
UPC 610214632623
000214632623
13. Data and Web Science Group
Using Product ID Annotations
as Supervision for Product Matching
– Some e‐shops annotate product IDs
– Most e‐shops do not
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 13
Properties PLDs
# %
schema:Product/name 754,812 92 %
schema:Product/description 520,307 64 %
schema:Product/sku 160,343 19 %
schema:Product/productID 47,088 5 %
schema:Product/mpn 12,882 1.6%
schema:Product/gtin13 7,994 1%
14. Data and Web Science Group
Learn How to Match Products using
Schema.org Data as Supervision
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 14
Product
offer
Product
offer
Product
offer
Product
offerProduct
offerProduct
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Clusters of offers
having the same product ID
Unseen offers
without product IDs
Learn
Matcher
Matcher Same product?Product
offer
Product
offer
Product
offer
15. Data and Web Science Group
Data Cleansing
for Building the Clusters
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 15
Filtering of product
offers with annotated
product identifiers
Removal of
listing pages
Filtering by
identifier value
length
Cluster creation
based on identifier
value co‐occurrence
Split wrong
clusters due to
category IDs
121M offers
out of 812M
58M offers 26M offers 16.4M clusters
All Languages
26.5M offers
79K distinct websites
16.6M clusters (products)
English Offers
16M offers
43K distinct websites
10M clusters (products)
16.6M clusters
16. Data and Web Science Group
Cluster Size Distribution by Category
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 16
A. Primpeli, R. Peeters, C. Bizer: The WDC Training Dataset and Gold Standard for Large‐Scale Product Matching.
ECNLP 2019 Workshop @ WWW2019.
17. Data and Web Science Group
Pre‐Assembled Training Sets
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 17
http://webdatacommons.org/largescaleproductcorpus/v2/
– Four categories: computers, cameras, watches, shoes
– Four sizes: small, medium, large, xlarge
– 9,000 to 214,000 examples
– 93,4 % of the pairs are correct (evaluation sample: 900 pairs)
– Statistics about xlarge training set
18. Data and Web Science Group
Gold Standard
– Mixture of random and difficult borderline pairs.
– All pairs are manually verified.
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 18
http://webdatacommons.org/largescaleproductcorpus/v2/
19. Data and Web Science Group
Comparison to Existing Benchmark
Datasets
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 19
1,200
Bizer, Primpeli, Peeters: Using the Semantic Web as a Source of Training Data. Datenbank Spektrum, 2019.
Not public
Our Datasets
20. Data and Web Science Group
Results: Traditional Learning Methods
20
Magellan: xlarge set
Category Method and Features Precision Recall F1
Computers XGBoost title‐description‐brand+specs 0.74 0.55 0.62
Cameras XGBoost title‐description‐brand+specs 0.72 0.58 0.64
Watches XGBoost title‐description‐brand+specs 0.76 0.50 0.60
Shoes RandomForest title‐description‐brand+specs 0.74 0.51 0.60
All categories RandomForest title‐description‐brand+specs 0.48 0.77 0.59
Word Co‐Occurrence: xlarge set
Category Method and Features Precision Recall F1
Computers LinearSVC title‐description‐brand+specs 0.86 0.80 0.83
Cameras LinearSVC title‐description‐brand+specs 0.83 0.65 0.73
Watches LogisticReg title‐description‐brand+specs 0.87 0.68 0.77
Shoes LogisticReg title 0.87 0.61 0.72
All categories RandomForest title‐description‐brand+specs 0.85 0.65 0.75
21. Data and Web Science Group
DeepMatcher Framework
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 21
‐ word‐based, character based
‐ pre‐trained (word2vec, GloVe, fastText), learned
‐ attribute summarization: SIF, RNN, Attention, Hybrid
‐ attribute comparison: fixed distance, learnable distance
‐ classification: fully connected neural net
Mudgal, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018.
22. Data and Web Science Group
Results: DeepMatcher
– Near human‐level matching performance just using web data.
– The 6.6% errors in the training data are averaged out in the learning.
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 22
Category Features Precision Recall F1
Computers title‐description‐brand 0.91 0.97 0.95
Cameras title 0.94 0.93 0.92
Watches title 0.95 0.97 0.96
Shoes title 0.96 0.99 0.95
All categories title‐description‐brand 0.91 0.93 0.92
23. Data and Web Science Group
Learning Curves
Deep Matcher vs. Baselines
– Results get really good starting at 100K training examples
– Gap Deep Matcher vs. Random Forest @200K: 17% F1
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 23
24. Data and Web Science Group
Learning Curves:
Deep Matcher Configurations
– RNN, fasttest embeddings pre‐trained on Wikipedia
– end2end training which adjusts embeddings: +2‐3% F1
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 24
25. Data and Web Science Group
Back to Our Question
Can we achieve the same matching performance using
the Semantic Web instead of manually labeled training data?
Answer: Yes
Implications:
– Potential to save money on building matchers
– Potential to save money on maintain matchers
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 25
26. Data and Web Science Group
Thank you.
– Paper:
Bizer, Primpeli, Peeters: Using the Semantic Web as a source of training data.
Datenbank‐Spektrum, 19, 127‐135, 2019.
– Training Data and Goldstandard
http://webdatacommons.org/largescaleproductcorpus/v2/
Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 26