Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using the Semantic Web as Training Data for Product Matching

128 views

Published on

Talk at the OpenKG Forum co-located with JIST2019 about using schema.org annotations from the Web for training product matchers.
See also:
http://webdatacommons.org/largescaleproductcorpus/v2/
http://jist2019.openkg.cn/index.php/openkgasia-forum/

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Using the Semantic Web as Training Data for Product Matching

  1. 1. Data and Web Science GroupUsing the Semantic Web as Training Data for Product Matching 1Nov. 25, 2019, Hangzhou, China Prof. Dr. Christian Bizer OpenKG Forum
  2. 2. Data and Web Science Group Product Matching – Does an offer on one website refer to the same product as  another offer on a different website? – Core Challenge in E‐Commerce – Necessary for building  price comparison portals – Necessary for building  product knowledge graphs Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 2
  3. 3. Data and Web Science Group Why is the task hard? – For marketing reasons, merchants present the same  product differently  – Heterogeneous  product title – Heterogeneous  product description – Heterogeneous  specification tables – Heterogeneous  categorization Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 3
  4. 4. Data and Web Science Group State of the Art: Product Matching Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 4 Mudgal, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018. Dong: ML for Entity Linkage. DI&ML tutorial at SIGMOD 2018.  – Deep Learning  – combining embeddings and RNNs – Large Training Sets >100K examples – owned by large companies such as Walmart, Amazon, Alibaba – Matching Performance: F1 >90%
  5. 5. Data and Web Science Group Question Can we achieve the same matching  performance using the Semantic Web  as a source of training data? Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 5
  6. 6. Data and Web Science Group Schema.org Annotations 6 – ask site owners since 2011 to  annotate data for enriching  search results – 675 Types: Event, Place, Local  Business, Product, Review, Person   – Encoding: Microdata, RDFa, JSON‐LD Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
  7. 7. Data and Web Science Group Annotation Example  Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 7
  8. 8. Data and Web Science Group Web Data Commons – Structured Data 8 – extracts all Microformat, Microdata,  RDFa, JSON‐LD data from the Common Crawl – analyzes and provides the extracted data for download – statistics of some extraction runs – 2018 CC Corpus: 2.5 billion HTML pages  31.5 billion RDF triples – 2017 CC Corpus: 3.1 billion HTML pages  38.2 billion RDF triples – 2014 CC Corpus: 2.0 billion HTML pages  20.4 billion RDF triples – 2010 CC Corpus: 2.8 billion HTML pages  5.1 billion RDF triples – Download – http://webdatacommons.org/structureddata/ Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
  9. 9. Data and Web Science Group Language and Top‐Level‐Domain  Distribution of Common Crawl Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25 9 English 44% Chinese 8% Russian 7% German 5% Japanese 5% French 4% Spanish 4% Other 23% com 52% org 6% net 4% uk 2% de 4% jp 2% ru 6% cn 1% other 23%
  10. 10. Data and Web Science Group Overall Adoption 2018 http://webdatacommons.org/structureddata/2018‐12/ 10 944 million HTML pages out of the 2.5 billion pages provide semantic annotations (37.1%). 9.6 million pay-level-domains (PLDs) out of the 32.8 million pay-level-domains covered by the crawl provide semantic annotations (29.3%). Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
  11. 11. Data and Web Science Group Frequently used Schema.org Classes 11 Top Classes # Websites (PLDs) Microdata JSON‐LD schema:WebPage 1,124,583 121,393 schema:Product 812,205 40,169 schema:Offer 676,899 57,756 schema:BreadcrumbList 621,344 205,971 schema:Article 612,361 57,082 schema:Organization 510,069 1,349,775 schema:PostalAddress 502,615 176,500 schema:ImageObject 360,875 111,946 schema:Blog 337,843 12,174 schema:Person 324,349 335,784 schema:LocalBusiness 294,390 249,017 schema:AggregateRating 258,078 23,105 schema:Review 124,022 6,622 schema:Place 92,127 66,396 schema:Event 88,130 63,605 http://webdatacommons.org/structureddata/2018‐12/  Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25
  12. 12. Data and Web Science Group Attributes used to Describe Products 12 Top Attributes PLDs Microdata # % schema:Product/name 754,812 92 % schema:Product/offers 645,994 79 % schema:Offer/price 639,598 78 % schema:Offer/priceCurrency 606,990 74 % schema:Product/image 573,614 70 % schema:Product/description 520,307 64 % schema:Offer/availability 477,170 58 % schema:Product/url 364,889 44 % schema:Product/sku 160,343 19 % schema:Product/aggregateRating 141,194 17 % schema:Product/brand 113,209 13 % schema:Product/category 62,170 7 % schema:Product/productID 47,088 5 % … … … http://webdatacommons.org/structureddata/2018‐12/stats/html‐md.xlsx Christian Bizer: Completing Knowledge Graphs. JIST2019, Hangzhou, 2019.11.25 Das Samsung Galaxy S4 ist der unterhaltsame und hilfreiche Begleiter für Ihr mobiles Leben. Es verbindet Sie mit Ihren Liebsten. Es lässt Sie gemeinsam unvergessliche Momente erleben und festhalten. Es vereinfacht Ihren Alltag. UPC 610214632623 000214632623
  13. 13. Data and Web Science Group Using Product ID Annotations as Supervision for Product Matching – Some e‐shops annotate product IDs – Most e‐shops do not  Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 13 Properties PLDs # % schema:Product/name 754,812 92 % schema:Product/description 520,307 64 % schema:Product/sku 160,343 19 % schema:Product/productID 47,088 5 % schema:Product/mpn 12,882 1.6% schema:Product/gtin13 7,994 1%
  14. 14. Data and Web Science Group Learn How to Match Products using Schema.org Data as Supervision  Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 14 Product offer Product offer Product offer Product offerProduct offerProduct offer Product offer Product offer Product offer Product offer Product offer Product offer Clusters of offers having the same product ID Unseen offers without product IDs Learn Matcher Matcher Same product?Product offer Product offer Product offer
  15. 15. Data and Web Science Group Data Cleansing for Building the Clusters Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 15 Filtering of product  offers with annotated  product identifiers Removal of  listing pages Filtering by  identifier value  length Cluster creation  based on identifier  value co‐occurrence Split wrong clusters due to category IDs 121M offers out of 812M 58M offers 26M offers 16.4M clusters All Languages 26.5M offers 79K distinct websites 16.6M clusters (products) English Offers 16M offers 43K distinct websites 10M clusters (products) 16.6M clusters
  16. 16. Data and Web Science Group Cluster Size Distribution by Category Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 16 A. Primpeli, R. Peeters, C. Bizer: The WDC Training Dataset and Gold Standard for Large‐Scale Product Matching.  ECNLP 2019 Workshop @ WWW2019.
  17. 17. Data and Web Science Group Pre‐Assembled Training Sets Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 17 http://webdatacommons.org/largescaleproductcorpus/v2/ – Four categories: computers, cameras, watches, shoes – Four sizes: small, medium, large, xlarge – 9,000 to 214,000 examples – 93,4 % of the pairs are correct (evaluation sample: 900 pairs) – Statistics about xlarge training set
  18. 18. Data and Web Science Group Gold Standard – Mixture of random and difficult borderline pairs. – All pairs are manually verified. Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 18 http://webdatacommons.org/largescaleproductcorpus/v2/
  19. 19. Data and Web Science Group Comparison to Existing Benchmark  Datasets Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 19 1,200 Bizer, Primpeli, Peeters: Using the Semantic Web as a Source of Training Data. Datenbank Spektrum, 2019. Not public  Our Datasets
  20. 20. Data and Web Science Group Results: Traditional Learning Methods 20 Magellan: xlarge set Category Method and Features Precision Recall F1 Computers XGBoost title‐description‐brand+specs 0.74 0.55 0.62 Cameras XGBoost title‐description‐brand+specs 0.72 0.58 0.64 Watches XGBoost title‐description‐brand+specs 0.76 0.50 0.60 Shoes RandomForest title‐description‐brand+specs 0.74 0.51 0.60 All categories RandomForest title‐description‐brand+specs 0.48 0.77 0.59 Word Co‐Occurrence: xlarge set Category Method and Features Precision Recall F1 Computers LinearSVC title‐description‐brand+specs 0.86 0.80 0.83 Cameras LinearSVC title‐description‐brand+specs 0.83 0.65 0.73 Watches LogisticReg title‐description‐brand+specs 0.87 0.68 0.77 Shoes LogisticReg title 0.87 0.61 0.72 All categories RandomForest title‐description‐brand+specs 0.85 0.65 0.75
  21. 21. Data and Web Science Group DeepMatcher Framework Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 21 ‐ word‐based, character based ‐ pre‐trained (word2vec, GloVe, fastText), learned ‐ attribute summarization: SIF, RNN, Attention, Hybrid ‐ attribute comparison: fixed distance, learnable distance  ‐ classification: fully connected neural net Mudgal, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018.
  22. 22. Data and Web Science Group Results: DeepMatcher – Near human‐level matching performance just using web data. – The 6.6% errors in the training data are averaged out in the learning. Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 22 Category Features Precision Recall F1 Computers title‐description‐brand 0.91 0.97 0.95 Cameras title 0.94 0.93 0.92 Watches title 0.95 0.97 0.96 Shoes title 0.96 0.99 0.95 All categories title‐description‐brand 0.91 0.93 0.92
  23. 23. Data and Web Science Group Learning Curves Deep Matcher vs. Baselines – Results get really good starting at 100K training examples – Gap Deep Matcher vs. Random Forest @200K: 17% F1 Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 23
  24. 24. Data and Web Science Group Learning Curves: Deep Matcher Configurations – RNN, fasttest embeddings pre‐trained on Wikipedia – end2end training which adjusts embeddings: +2‐3% F1 Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 24
  25. 25. Data and Web Science Group Back to Our Question Can we achieve the same matching performance using  the Semantic Web instead of manually labeled training data? Answer: Yes Implications: – Potential to save money on building matchers – Potential to save money on maintain matchers Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 25
  26. 26. Data and Web Science Group Thank you. – Paper: Bizer, Primpeli, Peeters: Using the Semantic Web as a source of training data.  Datenbank‐Spektrum, 19, 127‐135, 2019. – Training Data and Goldstandard http://webdatacommons.org/largescaleproductcorpus/v2/ Christian Bizer: Using the Semantic Web as Training Data. Open KG Forum, Hangzhou, 2019.11.25 26

×