VIRUSES structure and classification ppt by Dr.Prince C P
Learning Regular Expressions for the Extraction of Product Attributes from E-commerce Microdata
1. Learning Regular Expressions
for the Extraction of Product Attributes
from E-commerce Microdata
Petar Petrovski, Volha Bryl, Christian Bizer
Data and Web Science Research Group
University of Mannheim, Germany
LD4IE @ ISWC'2014, October 20, 2014, Riva del Garda, Italy
School of Business Informatics and Mathematics
2. Outline
1. HTML-embedded Data on the Web
2. Data Integration Pipeline
3. Learning regular expression
4. Evaluation
– Extraction of product attributes
– Identity resolution for products
5. Conclusions
2
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
3. HTML-embedded Data
More and more Websites semantically markup the
content of their HTML pages.
Microformats
Microdata
RDFa
3
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
4. Schema.org
• ask site owners to embed
data to enrich search results.
• 200+ Classes: Product, Review, LocalBusiness, Person, Place, Event, …
• Encoding: Microdata or RDFa
4
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
5. Usage of Schema.org Data @ Google
Data snippets
within
search results
Data snippets
within
info boxes
5
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
6. Websites Containing Structured Data
(November 2013)
585 million of the 2.2 billion pages contain
Microformat, Microdata or RDFa data (26%).
1.7 million websites (PLDs) out of 12.8 million
provide Microformat, Microdata or RDFa data (13%)
http://webdatacommons.org/structureddata/
Google, October 2013:
15% of all websites provide structured data.
6
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
7. Top Classes, Microdata (2013)
• schema = Schema.org
• datavoc = Google‘s
Rich Snippet Vocabulary
7
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
8. Outline
1. HTML-embedded Data on the Web
2. Data Integration Pipeline
3. Learning regular expression
4. Evaluation
– Extraction of product attributes
– Identity resolution for products
5. Conclusions
8
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
9. The Data Integration Pipeline
• Objective: integrate all data found on the web describing a
specific entity (e.g. product or organization)
• Motivation: enables creation of powerful applications, e.g.
comparison shopping portals
• Our use case: product data, electronics & computers
• Product classification and data fusion are out of the scope of this presentation
• More details in Petrovski, Bryl, Bizer. Integrating Product Data from Websites
offering Microdata Markup. DEOS @ WWW 2014
9
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
10. Web Data Commons Dataset
• Web Data Commons project: extracts structured data from the Common
Crawl corpora
– http://webdatacommons.org/
– http://commoncrawl.org/
• Our evaluation dataset is extracted from Common Crawl 2012
– 3 billion HTML pages, 40.6 million websites
– 7.3 billion statements describing 1.15 billion things
– 9.4 million product offers from 9240 e-shops
• 1.9 million products with English descriptions with length grater than 20 words
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
11. Problem: Product Matching
by Titles and Descriptions
Title
Description
AppleMacBook Air MC968/A 11.6-Inch Laptop
Faster Flash Storage with 64 GB Solid State Drive and USB 3.0. 720p FaceTime HD
Camera. The new 1.6 GHz Intel Core i5 Processor with Intel HD Graphics 3000
enabling beautiful rendering and 4GB DDR3 RAM. 11.6” LED display with the best
resolution…
Different descriptions follow be found
different levels of detail
Title
Description
Various abbreviations can be
found describing same features Often imprecise values due to
rounding in numeric values can
Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 4
GB, 64 GB, Mac OS X Lion 10.7
The MacBook Air MC 968/A powered by Intel Core i5(1.6GHz, 3MB L3). 64 GB SSD
and 4096 MB of DDR3 RAM. 29.464cm (11.6”) TFT 1366x768, Intel HD Graphics,
IEEE 802.11a/b/g, Bluetooth 4.0, FaceTme camera, OS X LIon
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
Most common
attributes:
Title : 89%
Description : 67%
Others: scarce
12. Product Feature Extraction
12
• Low precision (69%) for identity resolution without product feature
extraction, reason – lack of structure
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
13. Product Feature Extraction
• Low precision (69%) for identity resolution without product feature
extraction, reason – lack of structure
• We developed the Free Text Preprocessor
– Makes the data more structured by extracting new property-value
pairs from free-text properties
– https://www.assembla.com/spaces/silk/wiki/Silk_Free_Text_Preprocessor
• With pre-processing precision goes up to 85%
We used Silk framework for identity resolution: http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/
13
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
14. Free Text Preprocessor by Example
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .
<http://wdc.org/resource/2> <http://schema.org/Product/description>
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g.
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .
14
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
15. Free Text Preprocessor by Example
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .
<http://wdc.org/resource/2> <http://schema.org/Product/description>
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g.
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .
<http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" .
<http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" .
<http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" .
<http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" .
15
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
16. Free Text Preprocessor by Example
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .
<http://wdc.org/resource/2> <http://schema.org/Product/description>
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g.
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .
<http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" .
<http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" .
<http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" .
<http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" .
16
Preprocessor worked
efficiently when regular
expressions for extraction
certain attributes were
configured manually.
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
17. Outline
1. HTML-embedded Data on the Web
2. Data Integration Pipeline
3. Learning regular expression
4. Evaluation
– Extraction of product attributes
– Identity resolution for products
5. Conclusions
17
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
18. Solution: Learning Regular Expressions
• No more manual configuration
– Familiarity with regex syntax and deep understanding of
the data is no longer required from the user
• Approach: Genetic Programming
• Based on
– Li et al. Regular expressions learning for information extraction.
EMNLP’08.
– Langdon et al. Creating regular expressions as mRNA motfis with
GP to predict human exon splitting. GECCO’09.
– Bartoli et al. Automatic generation of regular expressions from
examples with genetic programming. GECCO’12.
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
19. Learning Regular Expressions
• Every individual is a tree representing a valid regex
• Possible nodes
– Concatenate node ||
– Possessive quantifiers
• *+, ++, ?+, {m,n}+
– Group operator ()
– Character class node []
• Terminal nodes
– Constants (d, 5, abc)
– Ranges (a-z, 0-9)
– Character classes (w or d)
– Wildcards (.)
– Whitespaces (s)
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
20. Training Data and Initial Population
• Input set T: pairs of strings (t,s)
– t is a text string
– s is a substring of t that must be detected as a regular expression
– if s is empty the pair is considered as a negative example
• Generating initial population
– Population size = 2 * |T|
– Half generated from examples
• Each digit is replaced by d
• Each character sequence is replaced by w
– Half generated randomly by the ramped half-and-half method
• Half full/bushy trees
• Half diverse trees
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
21. Operators: Crossover
• Two point crossover
• Individuals for crossover are selected with
tournament selection method
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
22. Operators: Mutation
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
• Two step mutation process
1. Selecting the crossover operator
2. Executing headless chicken crossover
• cross an individual from the population with a randomly generated
individual
23. Fitness Function
• Matthews correlation coefficient
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
24. Outline
1. HTML-embedded Data on the Web
2. Data Integration Pipeline
3. Learning regular expression
4. Evaluation
– Extraction of product attributes
– Identity resolution for products
5. Conclusions
24
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
25. Attribute Extraction: Experimental Setting
• 5,000 products from the WDC dataset
• Training set
– 500 product specification from Amazon catalogue
– Positive examples: from the property to be extracted
– Negative examples: from other properties at random or
random text
• 5 attributes
– Model, Storage, Display, Processor, Dimension
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
26. Attribute Extraction: Evaluation
• Learned regular expressions
– Model: (?:[^d]+s[a-z0-9]+)*+
– Storage: (?:d+[^B]+[B]+)++
– Display: d+.[^nc]*+nc[^o]*+
– Processor: d+s?[^z]++z
– Dimension: d[^.]x?[d]++
• F-measure
– 89.4% for numeric or simple combination of numbers and
letters (Display, Storage, Processor, Dimension)
– 94.2% for Dimension (best)
– 77.2% for Model (worst)
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
27. F-Measure: Model Property
27
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
28. F-Measure: Dimension Property
28
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
29. Identity Resolution: Experimental Setting
• Input
• 5,000 products with extracted product attributes
• Matching against 20 electronics products from the Amazon
product catalogue
• Gold standard
• 5,000 links manually annotated, 2,500 positive/2,500 negative
• Baseline
• Pairwise matching of just title and description
• Jaccarad similarity measure, extracting patterns with regex
• Tool: Silk Link Discovery Framework
29
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
30. Identity Resolution: Evaluation
Precision % Recall % F-Measure %
Baseline 69 90 78.1
Manual
configuration *
85 80 82.4
Learned Regular
Expressions
80 84 81.9
* See Petrovski, Bryl, Bizer. Integrating Product Data from Websites offering Microdata Markup. DEOS @ WWW 2014
30
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
31. Conclusions
• By using Microdata, thousands of websites help us to
understand their content
• We have presented the 5-step data integration pipeline
– From Microdata markup to an integrated dataset
• Pre-processing (attribute extraction) step is crucial for the
precision of data integration
– In cases the input data is not structured enough
• Learning regular expression allows us to achieve similar
matching quality to that of manually configured pre-processing
31
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
32. Conclusions
• Future work
– Change the “select first match” strategy in attribute extraction
• Rank all available matches and then use this information during extraction
– Look at elitist strategy
• Keep top 1% when it comes to breeding
– Apply the approach to other domains
• Local businesses, job announcements, addresses, …
32
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
33. Questions?
33
Learning Regular Expressions for the Extraction of Product Attributes from E-Commerce Microdata.
Petar Petrovski, Volha Bryl, Chris Bizer
Editor's Notes
Rather 500+ classes
Be aware: Only sample, taken using PageRank
2012:
369 million of the 3 billion pages contain Microformat, Microdata or RDFa data (12.3%).
2.29 million websites (PLDs) out of 40.6 million provide Microformat, Microdata or RDFa data (5.65%)
March 2014 CC data already available!
The content and the vocabularies are very focused towards the mayor consumers (Google, Yahoo, Bing, Facebook)
Providing structured data has come SEO topic
The data structures are rather simple (mostly atomic entities)
Based on Anything To Triples (any23) library for extracting structured data: http://any23.apache.org
Code available at: https://subversion.assembla.com/svn/commondata/
Already give an example of the data and the problems at this point, as it is necessary for understanding the baseline.
Mabe change slide accordingly.
Questions about the example!!!
Intermediate || nodes - ??
Initial version:
Training set
20 products from amazon.com
WDC product subset
5,000 products from the WDC product dataset
Model: Matches zero or more words greedily(*+) that don’t start with a digit [^\d] and have characters in range [a-z0-9]
Storage: Matches one or more groups (?:)++ that have the pattern digit, followed by not B, followed by B
Display: Matches the input that has a digit one or more times followed by any character followed by characters that are not “nc” followed by the characters “nc”…
Processor: ---- Same as display -----
Dimension: Matches two things: something like 14x12… (and so on if it has; based on matching subsequences in a string) and numbers with more than 3 digits.
F is F1?
Extracting patterns:
From the Title: .*(\w+_[a_zA-Z0-9]+)_\d.*(gb|hd|p[x]|in- che?s?|m).*\$
From the Description: number/unit_of_measurment
I would also show this slide!
----- Meeting Notes (10/25/13 15:41) -----
link to other tools yuima
NB compared to KNN and SVM
Features generation
4 step process – tokenizing and removing stop words, pruning, n-grams, TF-IDF
~3600 features
Training the model
Naïve Bayes Classifier
Starting from 9.4 million products:
Products with English descriptions with length grater than 20 words
=> 1,986,359 products from 9,240 e-shops
NB compared to KNN and SVM
Features generation
4 step process – tokenizing and removing stop words, pruning, n-grams, TF-IDF
~3600 features