SlideShare a Scribd company logo
1 of 33
Learning Regular Expressions 
for the Extraction of Product Attributes 
from E-commerce Microdata 
Petar Petrovski, Volha Bryl, Christian Bizer 
Data and Web Science Research Group 
University of Mannheim, Germany 
LD4IE @ ISWC'2014, October 20, 2014, Riva del Garda, Italy 
School of Business Informatics and Mathematics
Outline 
1. HTML-embedded Data on the Web 
2. Data Integration Pipeline 
3. Learning regular expression 
4. Evaluation 
– Extraction of product attributes 
– Identity resolution for products 
5. Conclusions 
2 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
HTML-embedded Data 
More and more Websites semantically markup the 
content of their HTML pages. 
Microformats 
Microdata 
RDFa 
3 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Schema.org 
• ask site owners to embed 
data to enrich search results. 
• 200+ Classes: Product, Review, LocalBusiness, Person, Place, Event, … 
• Encoding: Microdata or RDFa 
4 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Usage of Schema.org Data @ Google 
Data snippets 
within 
search results 
Data snippets 
within 
info boxes 
5 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Websites Containing Structured Data 
(November 2013) 
585 million of the 2.2 billion pages contain 
Microformat, Microdata or RDFa data (26%). 
1.7 million websites (PLDs) out of 12.8 million 
provide Microformat, Microdata or RDFa data (13%) 
http://webdatacommons.org/structureddata/ 
Google, October 2013: 
15% of all websites provide structured data. 
6 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Top Classes, Microdata (2013) 
• schema = Schema.org 
• datavoc = Google‘s 
Rich Snippet Vocabulary 
7 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Outline 
1. HTML-embedded Data on the Web 
2. Data Integration Pipeline 
3. Learning regular expression 
4. Evaluation 
– Extraction of product attributes 
– Identity resolution for products 
5. Conclusions 
8 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
The Data Integration Pipeline 
• Objective: integrate all data found on the web describing a 
specific entity (e.g. product or organization) 
• Motivation: enables creation of powerful applications, e.g. 
comparison shopping portals 
• Our use case: product data, electronics & computers 
• Product classification and data fusion are out of the scope of this presentation 
• More details in Petrovski, Bryl, Bizer. Integrating Product Data from Websites 
offering Microdata Markup. DEOS @ WWW 2014 
9 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Web Data Commons Dataset 
• Web Data Commons project: extracts structured data from the Common 
Crawl corpora 
– http://webdatacommons.org/ 
– http://commoncrawl.org/ 
• Our evaluation dataset is extracted from Common Crawl 2012 
– 3 billion HTML pages, 40.6 million websites 
– 7.3 billion statements describing 1.15 billion things 
– 9.4 million product offers from 9240 e-shops 
• 1.9 million products with English descriptions with length grater than 20 words 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Problem: Product Matching 
by Titles and Descriptions 
Title 
Description 
AppleMacBook Air MC968/A 11.6-Inch Laptop 
Faster Flash Storage with 64 GB Solid State Drive and USB 3.0. 720p FaceTime HD 
Camera. The new 1.6 GHz Intel Core i5 Processor with Intel HD Graphics 3000 
enabling beautiful rendering and 4GB DDR3 RAM. 11.6” LED display with the best 
resolution… 
Different descriptions follow be found 
different levels of detail 
Title 
Description 
Various abbreviations can be 
found describing same features Often imprecise values due to 
rounding in numeric values can 
Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 4 
GB, 64 GB, Mac OS X Lion 10.7 
The MacBook Air MC 968/A powered by Intel Core i5(1.6GHz, 3MB L3). 64 GB SSD 
and 4096 MB of DDR3 RAM. 29.464cm (11.6”) TFT 1366x768, Intel HD Graphics, 
IEEE 802.11a/b/g, Bluetooth 4.0, FaceTme camera, OS X LIon 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer 
Most common 
attributes: 
Title : 89% 
Description : 67% 
Others: scarce
Product Feature Extraction 
12 
• Low precision (69%) for identity resolution without product feature 
extraction, reason – lack of structure 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Product Feature Extraction 
• Low precision (69%) for identity resolution without product feature 
extraction, reason – lack of structure 
• We developed the Free Text Preprocessor 
– Makes the data more structured by extracting new property-value 
pairs from free-text properties 
– https://www.assembla.com/spaces/silk/wiki/Silk_Free_Text_Preprocessor 
• With pre-processing precision goes up to 85% 
We used Silk framework for identity resolution: http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/ 
13 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Free Text Preprocessor by Example 
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" . 
<http://wdc.org/resource/2> <http://schema.org/Product/description> 
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. 
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" . 
14 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Free Text Preprocessor by Example 
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" . 
<http://wdc.org/resource/2> <http://schema.org/Product/description> 
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. 
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" . 
<http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" . 
<http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" . 
<http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" . 
<http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" . 
15 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Free Text Preprocessor by Example 
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" . 
<http://wdc.org/resource/2> <http://schema.org/Product/description> 
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. 
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" . 
<http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" . 
<http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" . 
<http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" . 
<http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" . 
16 
Preprocessor worked 
efficiently when regular 
expressions for extraction 
certain attributes were 
configured manually. 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Outline 
1. HTML-embedded Data on the Web 
2. Data Integration Pipeline 
3. Learning regular expression 
4. Evaluation 
– Extraction of product attributes 
– Identity resolution for products 
5. Conclusions 
17 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Solution: Learning Regular Expressions 
• No more manual configuration 
– Familiarity with regex syntax and deep understanding of 
the data is no longer required from the user 
• Approach: Genetic Programming 
• Based on 
– Li et al. Regular expressions learning for information extraction. 
EMNLP’08. 
– Langdon et al. Creating regular expressions as mRNA motfis with 
GP to predict human exon splitting. GECCO’09. 
– Bartoli et al. Automatic generation of regular expressions from 
examples with genetic programming. GECCO’12. 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Learning Regular Expressions 
• Every individual is a tree representing a valid regex 
• Possible nodes 
– Concatenate node || 
– Possessive quantifiers 
• *+, ++, ?+, {m,n}+ 
– Group operator () 
– Character class node [] 
• Terminal nodes 
– Constants (d, 5, abc) 
– Ranges (a-z, 0-9) 
– Character classes (w or d) 
– Wildcards (.) 
– Whitespaces (s) 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Training Data and Initial Population 
• Input set T: pairs of strings (t,s) 
– t is a text string 
– s is a substring of t that must be detected as a regular expression 
– if s is empty the pair is considered as a negative example 
• Generating initial population 
– Population size = 2 * |T| 
– Half generated from examples 
• Each digit is replaced by d 
• Each character sequence is replaced by w 
– Half generated randomly by the ramped half-and-half method 
• Half full/bushy trees 
• Half diverse trees 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Operators: Crossover 
• Two point crossover 
• Individuals for crossover are selected with 
tournament selection method 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Operators: Mutation 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer 
• Two step mutation process 
1. Selecting the crossover operator 
2. Executing headless chicken crossover 
• cross an individual from the population with a randomly generated 
individual
Fitness Function 
• Matthews correlation coefficient 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Outline 
1. HTML-embedded Data on the Web 
2. Data Integration Pipeline 
3. Learning regular expression 
4. Evaluation 
– Extraction of product attributes 
– Identity resolution for products 
5. Conclusions 
24 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Attribute Extraction: Experimental Setting 
• 5,000 products from the WDC dataset 
• Training set 
– 500 product specification from Amazon catalogue 
– Positive examples: from the property to be extracted 
– Negative examples: from other properties at random or 
random text 
• 5 attributes 
– Model, Storage, Display, Processor, Dimension 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Attribute Extraction: Evaluation 
• Learned regular expressions 
– Model: (?:[^d]+s[a-z0-9]+)*+ 
– Storage: (?:d+[^B]+[B]+)++ 
– Display: d+.[^nc]*+nc[^o]*+ 
– Processor: d+s?[^z]++z 
– Dimension: d[^.]x?[d]++ 
• F-measure 
– 89.4% for numeric or simple combination of numbers and 
letters (Display, Storage, Processor, Dimension) 
– 94.2% for Dimension (best) 
– 77.2% for Model (worst) 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
F-Measure: Model Property 
27 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
F-Measure: Dimension Property 
28 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Identity Resolution: Experimental Setting 
• Input 
• 5,000 products with extracted product attributes 
• Matching against 20 electronics products from the Amazon 
product catalogue 
• Gold standard 
• 5,000 links manually annotated, 2,500 positive/2,500 negative 
• Baseline 
• Pairwise matching of just title and description 
• Jaccarad similarity measure, extracting patterns with regex 
• Tool: Silk Link Discovery Framework 
29 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Identity Resolution: Evaluation 
Precision % Recall % F-Measure % 
Baseline 69 90 78.1 
Manual 
configuration * 
85 80 82.4 
Learned Regular 
Expressions 
80 84 81.9 
* See Petrovski, Bryl, Bizer. Integrating Product Data from Websites offering Microdata Markup. DEOS @ WWW 2014 
30 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Conclusions 
• By using Microdata, thousands of websites help us to 
understand their content 
• We have presented the 5-step data integration pipeline 
– From Microdata markup to an integrated dataset 
• Pre-processing (attribute extraction) step is crucial for the 
precision of data integration 
– In cases the input data is not structured enough 
• Learning regular expression allows us to achieve similar 
matching quality to that of manually configured pre-processing 
31 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Conclusions 
• Future work 
– Change the “select first match” strategy in attribute extraction 
• Rank all available matches and then use this information during extraction 
– Look at elitist strategy 
• Keep top 1% when it comes to breeding 
– Apply the approach to other domains 
• Local businesses, job announcements, addresses, … 
32 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer
Questions? 
33 
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. 
Petar Petrovski, Volha Bryl, Chris Bizer

More Related Content

Viewers also liked

SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionGiorgio Orsi
 
Bigdata Landscape and Competitive Intelligence
Bigdata Landscape and Competitive IntelligenceBigdata Landscape and Competitive Intelligence
Bigdata Landscape and Competitive IntelligenceJithin S L
 
Aspect extraction (A survey)
Aspect extraction (A survey)Aspect extraction (A survey)
Aspect extraction (A survey)Mido Razaz
 
Unsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionUnsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionRakuten Group, Inc.
 
Tools for 21st Century Learning Design - Web Tool Edition
Tools for 21st Century Learning Design - Web Tool EditionTools for 21st Century Learning Design - Web Tool Edition
Tools for 21st Century Learning Design - Web Tool EditionPip Cleaves
 

Viewers also liked (6)

SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect Extraction
 
Bigdata Landscape and Competitive Intelligence
Bigdata Landscape and Competitive IntelligenceBigdata Landscape and Competitive Intelligence
Bigdata Landscape and Competitive Intelligence
 
Kasdorf EPUB and Metadata (rev. 1.0)
Kasdorf EPUB and Metadata (rev. 1.0)Kasdorf EPUB and Metadata (rev. 1.0)
Kasdorf EPUB and Metadata (rev. 1.0)
 
Aspect extraction (A survey)
Aspect extraction (A survey)Aspect extraction (A survey)
Aspect extraction (A survey)
 
Unsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionUnsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product Description
 
Tools for 21st Century Learning Design - Web Tool Edition
Tools for 21st Century Learning Design - Web Tool EditionTools for 21st Century Learning Design - Web Tool Edition
Tools for 21st Century Learning Design - Web Tool Edition
 

Similar to Learning Regular Expressions for the Extraction of Product Attributes from E-commerce Microdata

Integrating Product Data from Websites offering Microdata Markup
Integrating Product Data from Websites offering Microdata MarkupIntegrating Product Data from Websites offering Microdata Markup
Integrating Product Data from Websites offering Microdata MarkupVolha Bryl
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Building Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and SalesforceBuilding Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and SalesforceRaymond Gao
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Sri Ambati
 
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...DataWorks Summit
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfAltinity Ltd
 
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Debmalya Biswas
 
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! Embarcadero Technologies
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2Joe_F
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Big Data Value Association
 
IBM Meetup on November 1, 2018: Machine Learning made easy with Watson Studio
IBM Meetup on November 1, 2018: Machine Learning made easy with Watson StudioIBM Meetup on November 1, 2018: Machine Learning made easy with Watson Studio
IBM Meetup on November 1, 2018: Machine Learning made easy with Watson StudioSvetlana Levitan, PhD
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Koneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data WebKoneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data WebKoneksys
 
GraphTour - Neo4j Database Overview
GraphTour - Neo4j Database OverviewGraphTour - Neo4j Database Overview
GraphTour - Neo4j Database OverviewNeo4j
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?Örjan Lundberg
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark newAnam Mahmood
 
Data Management is a Team Sport - IBM
Data Management is a Team Sport - IBMData Management is a Team Sport - IBM
Data Management is a Team Sport - IBMMongoDB
 

Similar to Learning Regular Expressions for the Extraction of Product Attributes from E-commerce Microdata (20)

Integrating Product Data from Websites offering Microdata Markup
Integrating Product Data from Websites offering Microdata MarkupIntegrating Product Data from Websites offering Microdata Markup
Integrating Product Data from Websites offering Microdata Markup
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Building Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and SalesforceBuilding Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and Salesforce
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session
 
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
 
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
 
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2
 
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
 
IBM Meetup on November 1, 2018: Machine Learning made easy with Watson Studio
IBM Meetup on November 1, 2018: Machine Learning made easy with Watson StudioIBM Meetup on November 1, 2018: Machine Learning made easy with Watson Studio
IBM Meetup on November 1, 2018: Machine Learning made easy with Watson Studio
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Koneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data WebKoneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data Web
 
GraphTour - Neo4j Database Overview
GraphTour - Neo4j Database OverviewGraphTour - Neo4j Database Overview
GraphTour - Neo4j Database Overview
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark new
 
Data Management is a Team Sport - IBM
Data Management is a Team Sport - IBMData Management is a Team Sport - IBM
Data Management is a Team Sport - IBM
 

Recently uploaded

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 

Recently uploaded (20)

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 

Learning Regular Expressions for the Extraction of Product Attributes from E-commerce Microdata

  • 1. Learning Regular Expressions for the Extraction of Product Attributes from E-commerce Microdata Petar Petrovski, Volha Bryl, Christian Bizer Data and Web Science Research Group University of Mannheim, Germany LD4IE @ ISWC'2014, October 20, 2014, Riva del Garda, Italy School of Business Informatics and Mathematics
  • 2. Outline 1. HTML-embedded Data on the Web 2. Data Integration Pipeline 3. Learning regular expression 4. Evaluation – Extraction of product attributes – Identity resolution for products 5. Conclusions 2 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 3. HTML-embedded Data More and more Websites semantically markup the content of their HTML pages. Microformats Microdata RDFa 3 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 4. Schema.org • ask site owners to embed data to enrich search results. • 200+ Classes: Product, Review, LocalBusiness, Person, Place, Event, … • Encoding: Microdata or RDFa 4 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 5. Usage of Schema.org Data @ Google Data snippets within search results Data snippets within info boxes 5 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 6. Websites Containing Structured Data (November 2013) 585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26%). 1.7 million websites (PLDs) out of 12.8 million provide Microformat, Microdata or RDFa data (13%) http://webdatacommons.org/structureddata/ Google, October 2013: 15% of all websites provide structured data. 6 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 7. Top Classes, Microdata (2013) • schema = Schema.org • datavoc = Google‘s Rich Snippet Vocabulary 7 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 8. Outline 1. HTML-embedded Data on the Web 2. Data Integration Pipeline 3. Learning regular expression 4. Evaluation – Extraction of product attributes – Identity resolution for products 5. Conclusions 8 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 9. The Data Integration Pipeline • Objective: integrate all data found on the web describing a specific entity (e.g. product or organization) • Motivation: enables creation of powerful applications, e.g. comparison shopping portals • Our use case: product data, electronics & computers • Product classification and data fusion are out of the scope of this presentation • More details in Petrovski, Bryl, Bizer. Integrating Product Data from Websites offering Microdata Markup. DEOS @ WWW 2014 9 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 10. Web Data Commons Dataset • Web Data Commons project: extracts structured data from the Common Crawl corpora – http://webdatacommons.org/ – http://commoncrawl.org/ • Our evaluation dataset is extracted from Common Crawl 2012 – 3 billion HTML pages, 40.6 million websites – 7.3 billion statements describing 1.15 billion things – 9.4 million product offers from 9240 e-shops • 1.9 million products with English descriptions with length grater than 20 words Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 11. Problem: Product Matching by Titles and Descriptions Title Description AppleMacBook Air MC968/A 11.6-Inch Laptop Faster Flash Storage with 64 GB Solid State Drive and USB 3.0. 720p FaceTime HD Camera. The new 1.6 GHz Intel Core i5 Processor with Intel HD Graphics 3000 enabling beautiful rendering and 4GB DDR3 RAM. 11.6” LED display with the best resolution… Different descriptions follow be found different levels of detail Title Description Various abbreviations can be found describing same features Often imprecise values due to rounding in numeric values can Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 4 GB, 64 GB, Mac OS X Lion 10.7 The MacBook Air MC 968/A powered by Intel Core i5(1.6GHz, 3MB L3). 64 GB SSD and 4096 MB of DDR3 RAM. 29.464cm (11.6”) TFT 1366x768, Intel HD Graphics, IEEE 802.11a/b/g, Bluetooth 4.0, FaceTme camera, OS X LIon Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer Most common attributes: Title : 89% Description : 67% Others: scarce
  • 12. Product Feature Extraction 12 • Low precision (69%) for identity resolution without product feature extraction, reason – lack of structure Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 13. Product Feature Extraction • Low precision (69%) for identity resolution without product feature extraction, reason – lack of structure • We developed the Free Text Preprocessor – Makes the data more structured by extracting new property-value pairs from free-text properties – https://www.assembla.com/spaces/silk/wiki/Silk_Free_Text_Preprocessor • With pre-processing precision goes up to 85% We used Silk framework for identity resolution: http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/ 13 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 14. Free Text Preprocessor by Example <http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" . <http://wdc.org/resource/2> <http://schema.org/Product/description> "Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" . 14 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 15. Free Text Preprocessor by Example <http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" . <http://wdc.org/resource/2> <http://schema.org/Product/description> "Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" . <http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" . <http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" . <http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" . <http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" . 15 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 16. Free Text Preprocessor by Example <http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" . <http://wdc.org/resource/2> <http://schema.org/Product/description> "Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" . <http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" . <http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" . <http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" . <http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" . 16 Preprocessor worked efficiently when regular expressions for extraction certain attributes were configured manually. Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 17. Outline 1. HTML-embedded Data on the Web 2. Data Integration Pipeline 3. Learning regular expression 4. Evaluation – Extraction of product attributes – Identity resolution for products 5. Conclusions 17 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 18. Solution: Learning Regular Expressions • No more manual configuration – Familiarity with regex syntax and deep understanding of the data is no longer required from the user • Approach: Genetic Programming • Based on – Li et al. Regular expressions learning for information extraction. EMNLP’08. – Langdon et al. Creating regular expressions as mRNA motfis with GP to predict human exon splitting. GECCO’09. – Bartoli et al. Automatic generation of regular expressions from examples with genetic programming. GECCO’12. Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 19. Learning Regular Expressions • Every individual is a tree representing a valid regex • Possible nodes – Concatenate node || – Possessive quantifiers • *+, ++, ?+, {m,n}+ – Group operator () – Character class node [] • Terminal nodes – Constants (d, 5, abc) – Ranges (a-z, 0-9) – Character classes (w or d) – Wildcards (.) – Whitespaces (s) Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 20. Training Data and Initial Population • Input set T: pairs of strings (t,s) – t is a text string – s is a substring of t that must be detected as a regular expression – if s is empty the pair is considered as a negative example • Generating initial population – Population size = 2 * |T| – Half generated from examples • Each digit is replaced by d • Each character sequence is replaced by w – Half generated randomly by the ramped half-and-half method • Half full/bushy trees • Half diverse trees Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 21. Operators: Crossover • Two point crossover • Individuals for crossover are selected with tournament selection method Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 22. Operators: Mutation Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer • Two step mutation process 1. Selecting the crossover operator 2. Executing headless chicken crossover • cross an individual from the population with a randomly generated individual
  • 23. Fitness Function • Matthews correlation coefficient Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 24. Outline 1. HTML-embedded Data on the Web 2. Data Integration Pipeline 3. Learning regular expression 4. Evaluation – Extraction of product attributes – Identity resolution for products 5. Conclusions 24 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 25. Attribute Extraction: Experimental Setting • 5,000 products from the WDC dataset • Training set – 500 product specification from Amazon catalogue – Positive examples: from the property to be extracted – Negative examples: from other properties at random or random text • 5 attributes – Model, Storage, Display, Processor, Dimension Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 26. Attribute Extraction: Evaluation • Learned regular expressions – Model: (?:[^d]+s[a-z0-9]+)*+ – Storage: (?:d+[^B]+[B]+)++ – Display: d+.[^nc]*+nc[^o]*+ – Processor: d+s?[^z]++z – Dimension: d[^.]x?[d]++ • F-measure – 89.4% for numeric or simple combination of numbers and letters (Display, Storage, Processor, Dimension) – 94.2% for Dimension (best) – 77.2% for Model (worst) Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 27. F-Measure: Model Property 27 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 28. F-Measure: Dimension Property 28 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 29. Identity Resolution: Experimental Setting • Input • 5,000 products with extracted product attributes • Matching against 20 electronics products from the Amazon product catalogue • Gold standard • 5,000 links manually annotated, 2,500 positive/2,500 negative • Baseline • Pairwise matching of just title and description • Jaccarad similarity measure, extracting patterns with regex • Tool: Silk Link Discovery Framework 29 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 30. Identity Resolution: Evaluation Precision % Recall % F-Measure % Baseline 69 90 78.1 Manual configuration * 85 80 82.4 Learned Regular Expressions 80 84 81.9 * See Petrovski, Bryl, Bizer. Integrating Product Data from Websites offering Microdata Markup. DEOS @ WWW 2014 30 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 31. Conclusions • By using Microdata, thousands of websites help us to understand their content • We have presented the 5-step data integration pipeline – From Microdata markup to an integrated dataset • Pre-processing (attribute extraction) step is crucial for the precision of data integration – In cases the input data is not structured enough • Learning regular expression allows us to achieve similar matching quality to that of manually configured pre-processing 31 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 32. Conclusions • Future work – Change the “select first match” strategy in attribute extraction • Rank all available matches and then use this information during extraction – Look at elitist strategy • Keep top 1% when it comes to breeding – Apply the approach to other domains • Local businesses, job announcements, addresses, … 32 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer
  • 33. Questions? 33 Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata. Petar Petrovski, Volha Bryl, Chris Bizer

Editor's Notes

  1. Rather 500+ classes
  2. Be aware: Only sample, taken using PageRank 2012: 369 million of the 3 billion pages contain Microformat, Microdata or RDFa data (12.3%). 2.29 million websites (PLDs) out of 40.6 million provide Microformat, Microdata or RDFa data (5.65%) March 2014 CC data already available!
  3. The content and the vocabularies are very focused towards the mayor consumers (Google, Yahoo, Bing, Facebook) Providing structured data has come SEO topic The data structures are rather simple (mostly atomic entities)
  4. Based on Anything To Triples (any23) library for extracting structured data: http://any23.apache.org Code available at: https://subversion.assembla.com/svn/commondata/
  5. Already give an example of the data and the problems at this point, as it is necessary for understanding the baseline. Mabe change slide accordingly.
  6. Questions about the example!!! Intermediate || nodes - ??
  7. Initial version: Training set 20 products from amazon.com WDC product subset 5,000 products from the WDC product dataset
  8. Model: Matches zero or more words greedily(*+) that don’t start with a digit [^\d] and have characters in range [a-z0-9] Storage: Matches one or more groups (?:)++ that have the pattern digit, followed by not B, followed by B Display: Matches the input that has a digit one or more times followed by any character followed by characters that are not “nc” followed by the characters “nc”… Processor: ---- Same as display ----- Dimension: Matches two things: something like 14x12… (and so on if it has; based on matching subsequences in a string) and numbers with more than 3 digits.
  9. F is F1?
  10. Extracting patterns: From the Title: .*(\w+_[a_zA-Z0-9]+)_\d.*(gb|hd|p[x]|in- che?s?|m).*\$ From the Description: number/unit_of_measurment
  11. I would also show this slide! ----- Meeting Notes (10/25/13 15:41) ----- link to other tools yuima
  12. NB compared to KNN and SVM Features generation 4 step process – tokenizing and removing stop words, pruning, n-grams, TF-IDF ~3600 features Training the model Naïve Bayes Classifier Starting from 9.4 million products: Products with English descriptions with length grater than 20 words => 1,986,359 products from 9,240 e-shops
  13. NB compared to KNN and SVM Features generation 4 step process – tokenizing and removing stop words, pruning, n-grams, TF-IDF ~3600 features