SlideShare a Scribd company logo
Unsupervised Extraction of Attributes and
Their Values from Product Description
Keiji Shinzato and Satoshi Sekine
Rakuten Institute of Technology
17th Oct. 2013
The 6th International Joint Conference on Natural Language Processing
2
What is Rakuten?
• Biggest e-commerce company in Japan.
• B2B2C model.
• Statistics:
– # of merchants: 40K+
– # of products: 100M+
– # of product categories: 40K+
• Product page is categorized into a single product
category by a merchant.
• Product info. offered by merchants is described
by various kind of methods.
– Not well organized :-(
3
Examples of product pages (wine category)
Table
Itemizations
Product data is offered by merchants using various methods.
4
Examples of product pages (wine category)
Product data is offered by merchants using various methods.
Full texts
5
Goal
• Develop an unsupervised methodology for
constructing structured data from full texts.
Attribute Value
Color Red
Production
area
Italy,
Tuscany
Grape
variety
Merlot,
Cabernet sauvignon,
Petit verdot,
Cabernet franc
Vintage 2010
Volume 750ml
Full texts
(Unstructured data)
Structured data
6
Unsupervised information extraction
• Distant supervision [Mintz+ 2009]
– Construct an annotated corpus using an existing
Knowledge Base (KB).
– Train a model from the constructed corpus.
Hiroshi Mikitani is founder and CEO of
the online marketing company Rakuten .
Training data for founder-company
information extraction
Founder:
Hiroshi Mikitani
Machine learning
Extraction model
7
Problem of existing KBs
• Wikipedia
– Infobox is not tailored towards e-commerce.
• Freebase
– Only available in English.
– Attribute and values are limited even in English.
Production area
Grape variety
Winery
Attributes in the infobox for the
wine article in Wikipedia.
Attributes for users seeking
their favorite wines.
Vintage
Gap
1. Construct KB for product information extraction.
2. Remove false-positive and false-negative annotations in
the automatically constructed corpous.
8
Agenda
• Background
• Overview of our approach
– Knowledge base induction
– Training data construction
– Extraction model training
– Product page structuring
• Experiments
• Conclusion and future work
9
Overview of our approach
Input: Product pages in the category C
Pages for model construction Pages that we want to structure
Winery: Bodegas Carchelo
Type: Medium body
Grape: Monastrell 40%, Syrah 40%,
Cabernet Sauvignon 20%
Type Red
Country Italy / Tuscany
Grape Sangiovese
Year 2011
Pages including
tables or itemizations
Unstructured pages
10
Overview of our approach
Input: Product pages in the category C
Pages for model construction
Pages including
tables or itemizations
Unstructured pages
1. Knowledge
base induction
Knowledge base(KB)
<attr1, value1>
<attr2, value2>
<attr1, value3>
:
Pages that we want to structure
Annotated pages
2. Training data
construction
3. Extraction
model training
Extraction
model
4. Product page
structuring
Output:
Structured data
11
KB induction – Extraction of attribute and its value -
• Attribute acquisition:
– Assumption: Expressions that are often located in
table headers can be considered as attributes.
– Extract expressions enclosed by <TH> tags.
• Attribute value extraction:
– Extract attribute-value using regular expression
patterns [Yoshinaga and Torisawa 2006].
– Store <attr., val.> in the KB along with the number of
merchants that use it in tables or itemizations.
Merchant frequency (MF)
<Production area, France> (29),
<Region, Italy> (13)
12
KB induction - Attribute synonym discovery -
• Assumption: Attributes can be seen as
synonyms of one another if
– they are not included in the same structured data, and
– they share an identical popular value.
• Regard attribute pairs satisfying the conditions as
synonyms.
• Aggregate similar pairs of attribute synonyms by
computing cosine measure.
Non
synonym
<Alcohol, 15 degree>
<Temperature, 15 degree>
Synonym
<Production area, France>
<Region, France>
(Country, Region, Production area)
(Production area, Region),
(Country, Production area)
13
ぶどう品種
(Grape variety)
内容量
(Volume)
産地
(Production area)
生産者
(Producer)
タイプ
(Type)
ブドウ品種,
(Grape variety)
葡萄品種,
(Grape variety)
使用品種,
(Usage variety)
品種
(Variety)
容量
(Content)
原産地呼称AOC,
(Appellations of origin)
原産地,
(Region of origin)
国,
(Country)
生産地域,
(Production region)
地域,
(Region)
生産地
(Production region)
製造元,
(Manufacturer)
生産者名
(Name of producers)
シャルドネ [59]
(Chardonnay)
750ML [147]
フランス [45]
(France)
ファルネーゼ [9]
(Farnese)
辛口 [34]
(Dry)
メルロー [36]
(Merlot)
720ML [64] イタリア [30]
(Italy)
マス デ モニストロル [4]
(Mas de Monistrol)
赤 [24]
(Red)
シラー [29]
(Syrah)
375ML [49] スペイン [30]
(Spain)
ルロワ [3]
(Leroy)
白 [23]
(White)
リースリング [29]
(Riesling)
500ML [41] チリ [25]
(Chile)
M. シャプティエ [3]
(M. Chapoutier)
フルボディ [23]
(Full body)
グルナッシュ [22]
(Grenache)
1500ML [22] ボルドー [22]
(Bordeaux)
マストロベラルディーノ [3]
(Mastroberardino)
やや甘口 [15]
(Slightly
sweet)
14
Overview of our approach
Input: Product pages in the category C
Pages for model construction
Pages including
tables or itemizations
Unstructured pages
1. Knowledge
base induction
Knowledge base(KB)
<attr1, value1>
<attr2, value2>
<attr1, value3>
:
Pages that we want to structure
Annotated pages
2. Training data
construction
3. Extraction
model training
Extraction
model
4. Product page
structuring
Output:
Structured data
15
Training data construction
• Simple longest string matching between full texts
and attribute-values in KB.
• Problems in automatic annotation:
– Incorrect annotation (false-positive)
• The flavor of the <grape_variety> grape </grape_variety> is quite
a little.
– Missing annotation (false-negative)
• Chateau Talbot is a famous winery in <production_area> France
</production_area>.
16
Incorrect annotation filtering
• Assumption: Attribute values with low MFs in
structured data and high MFs in unstructured
data are likely to be incorrect.
NM … # of merchants offering a product in a category.
MS … # of merchants offering structured data in a category.
MFD (v) … # of merchants describing the value v in full texts.
MFS (v) … # of merchants describing the value v in structure data.
𝑆𝑐𝑜𝑟𝑒 𝑣 =
𝑀𝐹 𝐷(𝑣) 𝑁 𝑀
𝑀𝐹𝑆(𝑣) 𝑀𝑆
Likeliness of occurring the value v in structured data.
Likeliness of occurring the value v in full texts.
We regard attribute values with scores greater than 30 as incorrect,
and remove sentences including such values from the corpus.
17
Missing annotation filtering
• Induce frequently occurred token sequences in
attribute values with PrefixSpan [Pei+ 2001].
• Remove sentences containing a string that is not
annotated and matches an induced pattern.
– Chateau Talbot is a famous winery in <production_area>
France </production_area>.
Pattern:
[chateau] [ANY_TOKEN]
<Winery, Chateau Lanessan>
<Winery, Chateau Fontareche>
<Winery, Chateau Latour>
18
Overview of our approach
Input: Product pages in the category C
Pages for model construction
Pages including
tables or itemizations
Unstructured pages
1. Knowledge
base induction
Knowledge base(KB)
<attr1, value1>
<attr2, value2>
<attr1, value3>
:
Pages that we want to structure
Annotated pages
2. Training data
construction
3. Extraction
model training
Extraction
model
4. Product page
structuring
Output:
Structured data
19
Extraction model training
• Algorithm: Conditional random fields [Lafferty+ 2001]
• Chunk tag: Start/End (IOBES) model [Sekine+ 1998]
• Features:
– Token: Surface form of the token.
– Base: Base form of the token.
– PoS: Part-of-Speech tag of the token.
– Char. type: Types of characters in the token.
– Prefix: Double character prefix of the token.
– Suffix: Double character suffix of the token.
– The above features of ±3 tokens surrounding the token.
They are frequently employed in the task of Japanese NER.
20
Overview of our approach
Input: Product pages in the category C
Pages for model construction
Pages including
tables or itemizations
Unstructured pages
1. Knowledge
base induction
Knowledge base(KB)
<attr1, value1>
<attr2, value2>
<attr1, value3>
:
Pages that we want to structure
Annotated pages
2. Training data
construction
3. Extraction
model training
Extraction
model
4. Product page
structuring
Output:
Structured data
21
Agenda
• Background
• Overview of our approach
– Knowledge base induction
– Training data construction
– Extraction model training
– Product page structuring
• Experiments
• Conclusion and future work
22
Experiments
• Evaluation of KB
– Extracted attributes
– Aggregated attribute synonyms
– Extracted attribute-values
• Evaluation of the quality of annotated corpora
• Evaluation of extraction models
23
Experimental setting
• Category:
– Selected major eight categories in Rakuten.
• Wine, T-shirts, Printer ink, Shampoo, Golf ball, and others.
• Attribute:
– Selected the top eight attributes in each category
according to the merchant frequencies of the attributes.
• Training dataset:
– Randomly picked up 100K sentences for each category.
• Evaluation dataset:
– Tailored annotated corpus comprising 1,776 product
pages gathered from the categories.
24
Compared models
• KB match:
– Matching attribute values in KB, and then filtering out
problematic annotations.
• Model w/o filters:
– Training models based on a corpus where the both
filters are not applied.
• Model w/ incorrect annotation filter:
– Training models based on a corpus where only the
filter for incorrect annotations is applied.
• Model w/ missing annotation filter:
– Training models based on a corpus where only the
filter for missing annotations is applied.
25
Evaluation of extraction models
Model P (%) R (%) F score
KB match 57.14 29.29 37.21
Model w/o filters 52.60 54.49 53.14
Model w/ incorrect annotation
filter
60.46 54.23 56.84
Model w/ missing annotation
filter
50.47 59.71 54.43
Model of the proposed method 57.05 59.66 58.15
26
Model P (%) R (%) F score
KB match 57.14 29.29 37.21
Model w/o filters 52.60 54.49 53.14
Model w/ incorrect annotation
filter
60.46 54.23 56.84
Model w/ missing annotation
filter
50.47 59.71 54.43
Model of the proposed method 57.05 59.66 58.15
Evaluation of extraction models
+30.4 %.
Recall was dramatically improved.
⇒ Contexts surrounding a value and patterns of
⇒ tokens in a value are successfully captured.
27
Model P (%) R (%) F score
KB match 57.14 29.29 37.21
Model w/o filters 52.60 54.49 53.14
Model w/ incorrect annotation
filter
60.46 54.23 56.84
Model w/ missing annotation
filter
50.47 59.71 54.43
Model of the proposed method 57.05 59.66 58.15
Evaluation of extraction models
+7.9 %.
The incorrect annotation filter improved precision.
28
Model P (%) R (%) F score
KB match 57.14 29.29 37.21
Model w/o filters 52.60 54.49 53.14
Model w/ incorrect annotation
filter
60.46 54.23 56.84
Model w/ missing annotation
filter
50.47 59.71 54.43
Model of the proposed method 57.05 59.66 58.15
Evaluation of extraction models
+5.2 %.
The incorrect annotation filter improved precision.
The missing annotation filter improved recall.
29
Model P (%) R (%) F score
KB match 57.14 29.29 37.21
Model w/o filters 52.60 54.49 53.14
Model w/ incorrect annotation
filter
60.46 54.23 56.84
Model w/ missing annotation
filter
50.47 59.71 54.43
Model of the proposed method 57.05 59.66 58.15
Evaluation of extraction models
+5.1 pt.
The incorrect annotation filter improved precision.
The missing annotation filter improved recall.
⇒ The precision and recall of the proposed method
⇒ are enhanced by employing both filters.
30
Error trend
• Randomly selected 50 attribute values judged as
incorrect in wine and shampoo categories.
Type # of err.
Automatic annotation 36
Incorrect KB entry 23
Over generation by learned patterns 15
Extraction from unrelated regions 12
Others 14
31
Automatic annotation error
• 土壌が<産地>ボルドー</産地>のポムロールと非常に似ている。
• <成分>ヒアルロン酸</成分> 以上の保水力がある。
• <タイプ>白</タイプ>カビチーズに合わせるとより楽しめます。
• 輸出は全体の<アルコール>10%</アルコール>程度。
Soil is very similar with ones in Pomerol region of <production_area>
Bordeaux </production_area>.
<type>White</type> mold cheese will enhance the taste of the wine.
The amount of exports is approximately <alcohol>10 %</alcohol> of
the total.
It has a higher water-holding ability <constituent> than hyaluronan
</constituent> has.
32
Related work
• Product information extraction
– (Semi-) Supervised methodology [Ghani+ 2006, Probst+
2007, Davidov+ 2010, Bakalov+ 2011, Putthividhya+ 2011]
⇒ Training data or initial seeds are required.
– Unsupervised methodology [Yoshinaga+ 2006, Dalvi+ 2009,
Gulhane+ 2010, Mauge+ 2012, Bing+ 2012]
⇒ Not for full texts or limited to the size of texts.
• Unsupervised NER / Unsupervised IE
– Many attempts based on distant supervision [Nadeau+
2006, Whitelaw+ 2008, Nothman+ 2008, Mintz+ 2009, Ritter+
2011]
⇒ Wikipedia and Freebase are resources.
33
Conclusion and future work
• Distant supervision based approach for extracting
attributes and their values from product pages.
– Construction of knowledge base.
– Remove false-positive and false-negative annotations
from automatically constructed corpus.
• Evaluated the performance of KB induction,
automatic annotation, and extraction models
under multiple categories.
• Future work
– Improve the annotation quality by considering contexts.
– Construct KB with wide coverage and high quality.
34
Thank you for your kind attention !

More Related Content

Similar to Unsupervised Extraction of Attributes and Their Values from Product Description

Retail Reference Architecture Part 1: Flexible, Searchable, Low-Latency Produ...
Retail Reference Architecture Part 1: Flexible, Searchable, Low-Latency Produ...Retail Reference Architecture Part 1: Flexible, Searchable, Low-Latency Produ...
Retail Reference Architecture Part 1: Flexible, Searchable, Low-Latency Produ...
MongoDB
 
Retail Reference Architecture
Retail Reference ArchitectureRetail Reference Architecture
Retail Reference Architecture
MongoDB
 
FDSeminar Processen Stroomlijnen - Bart De Backer en Joris Vanderlinden - Bar...
FDSeminar Processen Stroomlijnen - Bart De Backer en Joris Vanderlinden - Bar...FDSeminar Processen Stroomlijnen - Bart De Backer en Joris Vanderlinden - Bar...
FDSeminar Processen Stroomlijnen - Bart De Backer en Joris Vanderlinden - Bar...
FDMagazine
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Databricks
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
Tobias Lindaaker
 
6 product specifications (1)
6 product specifications (1)6 product specifications (1)
6 product specifications (1)
sadha sivam
 
6 product specifications
6 product specifications6 product specifications
6 product specifications
sadha sivam
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?
Sveta Smirnova
 
SIMUL8 User Group - Visual8 Case Study - Plywood Manufacturing.
SIMUL8 User Group - Visual8 Case Study - Plywood Manufacturing. SIMUL8 User Group - Visual8 Case Study - Plywood Manufacturing.
SIMUL8 User Group - Visual8 Case Study - Plywood Manufacturing.
SIMUL8 Corporation
 
2.1 Product_Specifications.ppt
2.1 Product_Specifications.ppt2.1 Product_Specifications.ppt
2.1 Product_Specifications.ppt
girilogu2
 
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB
 
MT311 Operations and Quality ManagementFall 2019 Team Research.docx
MT311 Operations and Quality ManagementFall 2019 Team Research.docxMT311 Operations and Quality ManagementFall 2019 Team Research.docx
MT311 Operations and Quality ManagementFall 2019 Team Research.docx
roushhsiu
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016
Niko Neugebauer
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDB
MongoDB
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-Source
Databricks
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
MongoDB
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
Foundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkFoundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache Spark
Databricks
 
Let's build an adoption centre in office 365
Let's build an adoption centre in office 365Let's build an adoption centre in office 365
Let's build an adoption centre in office 365
Joanne Klein
 

Similar to Unsupervised Extraction of Attributes and Their Values from Product Description (20)

Retail Reference Architecture Part 1: Flexible, Searchable, Low-Latency Produ...
Retail Reference Architecture Part 1: Flexible, Searchable, Low-Latency Produ...Retail Reference Architecture Part 1: Flexible, Searchable, Low-Latency Produ...
Retail Reference Architecture Part 1: Flexible, Searchable, Low-Latency Produ...
 
Retail Reference Architecture
Retail Reference ArchitectureRetail Reference Architecture
Retail Reference Architecture
 
FDSeminar Processen Stroomlijnen - Bart De Backer en Joris Vanderlinden - Bar...
FDSeminar Processen Stroomlijnen - Bart De Backer en Joris Vanderlinden - Bar...FDSeminar Processen Stroomlijnen - Bart De Backer en Joris Vanderlinden - Bar...
FDSeminar Processen Stroomlijnen - Bart De Backer en Joris Vanderlinden - Bar...
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
 
6 product specifications (1)
6 product specifications (1)6 product specifications (1)
6 product specifications (1)
 
6 product specifications
6 product specifications6 product specifications
6 product specifications
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?
 
SIMUL8 User Group - Visual8 Case Study - Plywood Manufacturing.
SIMUL8 User Group - Visual8 Case Study - Plywood Manufacturing. SIMUL8 User Group - Visual8 Case Study - Plywood Manufacturing.
SIMUL8 User Group - Visual8 Case Study - Plywood Manufacturing.
 
2.1 Product_Specifications.ppt
2.1 Product_Specifications.ppt2.1 Product_Specifications.ppt
2.1 Product_Specifications.ppt
 
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
 
MT311 Operations and Quality ManagementFall 2019 Team Research.docx
MT311 Operations and Quality ManagementFall 2019 Team Research.docxMT311 Operations and Quality ManagementFall 2019 Team Research.docx
MT311 Operations and Quality ManagementFall 2019 Team Research.docx
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDB
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-Source
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Foundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkFoundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache Spark
 
Let's build an adoption centre in office 365
Let's build an adoption centre in office 365Let's build an adoption centre in office 365
Let's build an adoption centre in office 365
 

More from Rakuten Group, Inc.

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
Rakuten Group, Inc.
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
Rakuten Group, Inc.
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
Rakuten Group, Inc.
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Rakuten Group, Inc.
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
Rakuten Group, Inc.
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
Rakuten Group, Inc.
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
Rakuten Group, Inc.
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
Rakuten Group, Inc.
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
Rakuten Group, Inc.
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
Rakuten Group, Inc.
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
Rakuten Group, Inc.
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
Rakuten Group, Inc.
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
Rakuten Group, Inc.
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
Rakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
Rakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
Rakuten Group, Inc.
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
Rakuten Group, Inc.
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
Rakuten Group, Inc.
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
Rakuten Group, Inc.
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
Rakuten Group, Inc.
 

More from Rakuten Group, Inc. (20)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 

Recently uploaded

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 

Recently uploaded (20)

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 

Unsupervised Extraction of Attributes and Their Values from Product Description

  • 1. Unsupervised Extraction of Attributes and Their Values from Product Description Keiji Shinzato and Satoshi Sekine Rakuten Institute of Technology 17th Oct. 2013 The 6th International Joint Conference on Natural Language Processing
  • 2. 2 What is Rakuten? • Biggest e-commerce company in Japan. • B2B2C model. • Statistics: – # of merchants: 40K+ – # of products: 100M+ – # of product categories: 40K+ • Product page is categorized into a single product category by a merchant. • Product info. offered by merchants is described by various kind of methods. – Not well organized :-(
  • 3. 3 Examples of product pages (wine category) Table Itemizations Product data is offered by merchants using various methods.
  • 4. 4 Examples of product pages (wine category) Product data is offered by merchants using various methods. Full texts
  • 5. 5 Goal • Develop an unsupervised methodology for constructing structured data from full texts. Attribute Value Color Red Production area Italy, Tuscany Grape variety Merlot, Cabernet sauvignon, Petit verdot, Cabernet franc Vintage 2010 Volume 750ml Full texts (Unstructured data) Structured data
  • 6. 6 Unsupervised information extraction • Distant supervision [Mintz+ 2009] – Construct an annotated corpus using an existing Knowledge Base (KB). – Train a model from the constructed corpus. Hiroshi Mikitani is founder and CEO of the online marketing company Rakuten . Training data for founder-company information extraction Founder: Hiroshi Mikitani Machine learning Extraction model
  • 7. 7 Problem of existing KBs • Wikipedia – Infobox is not tailored towards e-commerce. • Freebase – Only available in English. – Attribute and values are limited even in English. Production area Grape variety Winery Attributes in the infobox for the wine article in Wikipedia. Attributes for users seeking their favorite wines. Vintage Gap 1. Construct KB for product information extraction. 2. Remove false-positive and false-negative annotations in the automatically constructed corpous.
  • 8. 8 Agenda • Background • Overview of our approach – Knowledge base induction – Training data construction – Extraction model training – Product page structuring • Experiments • Conclusion and future work
  • 9. 9 Overview of our approach Input: Product pages in the category C Pages for model construction Pages that we want to structure Winery: Bodegas Carchelo Type: Medium body Grape: Monastrell 40%, Syrah 40%, Cabernet Sauvignon 20% Type Red Country Italy / Tuscany Grape Sangiovese Year 2011 Pages including tables or itemizations Unstructured pages
  • 10. 10 Overview of our approach Input: Product pages in the category C Pages for model construction Pages including tables or itemizations Unstructured pages 1. Knowledge base induction Knowledge base(KB) <attr1, value1> <attr2, value2> <attr1, value3> : Pages that we want to structure Annotated pages 2. Training data construction 3. Extraction model training Extraction model 4. Product page structuring Output: Structured data
  • 11. 11 KB induction – Extraction of attribute and its value - • Attribute acquisition: – Assumption: Expressions that are often located in table headers can be considered as attributes. – Extract expressions enclosed by <TH> tags. • Attribute value extraction: – Extract attribute-value using regular expression patterns [Yoshinaga and Torisawa 2006]. – Store <attr., val.> in the KB along with the number of merchants that use it in tables or itemizations. Merchant frequency (MF) <Production area, France> (29), <Region, Italy> (13)
  • 12. 12 KB induction - Attribute synonym discovery - • Assumption: Attributes can be seen as synonyms of one another if – they are not included in the same structured data, and – they share an identical popular value. • Regard attribute pairs satisfying the conditions as synonyms. • Aggregate similar pairs of attribute synonyms by computing cosine measure. Non synonym <Alcohol, 15 degree> <Temperature, 15 degree> Synonym <Production area, France> <Region, France> (Country, Region, Production area) (Production area, Region), (Country, Production area)
  • 13. 13 ぶどう品種 (Grape variety) 内容量 (Volume) 産地 (Production area) 生産者 (Producer) タイプ (Type) ブドウ品種, (Grape variety) 葡萄品種, (Grape variety) 使用品種, (Usage variety) 品種 (Variety) 容量 (Content) 原産地呼称AOC, (Appellations of origin) 原産地, (Region of origin) 国, (Country) 生産地域, (Production region) 地域, (Region) 生産地 (Production region) 製造元, (Manufacturer) 生産者名 (Name of producers) シャルドネ [59] (Chardonnay) 750ML [147] フランス [45] (France) ファルネーゼ [9] (Farnese) 辛口 [34] (Dry) メルロー [36] (Merlot) 720ML [64] イタリア [30] (Italy) マス デ モニストロル [4] (Mas de Monistrol) 赤 [24] (Red) シラー [29] (Syrah) 375ML [49] スペイン [30] (Spain) ルロワ [3] (Leroy) 白 [23] (White) リースリング [29] (Riesling) 500ML [41] チリ [25] (Chile) M. シャプティエ [3] (M. Chapoutier) フルボディ [23] (Full body) グルナッシュ [22] (Grenache) 1500ML [22] ボルドー [22] (Bordeaux) マストロベラルディーノ [3] (Mastroberardino) やや甘口 [15] (Slightly sweet)
  • 14. 14 Overview of our approach Input: Product pages in the category C Pages for model construction Pages including tables or itemizations Unstructured pages 1. Knowledge base induction Knowledge base(KB) <attr1, value1> <attr2, value2> <attr1, value3> : Pages that we want to structure Annotated pages 2. Training data construction 3. Extraction model training Extraction model 4. Product page structuring Output: Structured data
  • 15. 15 Training data construction • Simple longest string matching between full texts and attribute-values in KB. • Problems in automatic annotation: – Incorrect annotation (false-positive) • The flavor of the <grape_variety> grape </grape_variety> is quite a little. – Missing annotation (false-negative) • Chateau Talbot is a famous winery in <production_area> France </production_area>.
  • 16. 16 Incorrect annotation filtering • Assumption: Attribute values with low MFs in structured data and high MFs in unstructured data are likely to be incorrect. NM … # of merchants offering a product in a category. MS … # of merchants offering structured data in a category. MFD (v) … # of merchants describing the value v in full texts. MFS (v) … # of merchants describing the value v in structure data. 𝑆𝑐𝑜𝑟𝑒 𝑣 = 𝑀𝐹 𝐷(𝑣) 𝑁 𝑀 𝑀𝐹𝑆(𝑣) 𝑀𝑆 Likeliness of occurring the value v in structured data. Likeliness of occurring the value v in full texts. We regard attribute values with scores greater than 30 as incorrect, and remove sentences including such values from the corpus.
  • 17. 17 Missing annotation filtering • Induce frequently occurred token sequences in attribute values with PrefixSpan [Pei+ 2001]. • Remove sentences containing a string that is not annotated and matches an induced pattern. – Chateau Talbot is a famous winery in <production_area> France </production_area>. Pattern: [chateau] [ANY_TOKEN] <Winery, Chateau Lanessan> <Winery, Chateau Fontareche> <Winery, Chateau Latour>
  • 18. 18 Overview of our approach Input: Product pages in the category C Pages for model construction Pages including tables or itemizations Unstructured pages 1. Knowledge base induction Knowledge base(KB) <attr1, value1> <attr2, value2> <attr1, value3> : Pages that we want to structure Annotated pages 2. Training data construction 3. Extraction model training Extraction model 4. Product page structuring Output: Structured data
  • 19. 19 Extraction model training • Algorithm: Conditional random fields [Lafferty+ 2001] • Chunk tag: Start/End (IOBES) model [Sekine+ 1998] • Features: – Token: Surface form of the token. – Base: Base form of the token. – PoS: Part-of-Speech tag of the token. – Char. type: Types of characters in the token. – Prefix: Double character prefix of the token. – Suffix: Double character suffix of the token. – The above features of ±3 tokens surrounding the token. They are frequently employed in the task of Japanese NER.
  • 20. 20 Overview of our approach Input: Product pages in the category C Pages for model construction Pages including tables or itemizations Unstructured pages 1. Knowledge base induction Knowledge base(KB) <attr1, value1> <attr2, value2> <attr1, value3> : Pages that we want to structure Annotated pages 2. Training data construction 3. Extraction model training Extraction model 4. Product page structuring Output: Structured data
  • 21. 21 Agenda • Background • Overview of our approach – Knowledge base induction – Training data construction – Extraction model training – Product page structuring • Experiments • Conclusion and future work
  • 22. 22 Experiments • Evaluation of KB – Extracted attributes – Aggregated attribute synonyms – Extracted attribute-values • Evaluation of the quality of annotated corpora • Evaluation of extraction models
  • 23. 23 Experimental setting • Category: – Selected major eight categories in Rakuten. • Wine, T-shirts, Printer ink, Shampoo, Golf ball, and others. • Attribute: – Selected the top eight attributes in each category according to the merchant frequencies of the attributes. • Training dataset: – Randomly picked up 100K sentences for each category. • Evaluation dataset: – Tailored annotated corpus comprising 1,776 product pages gathered from the categories.
  • 24. 24 Compared models • KB match: – Matching attribute values in KB, and then filtering out problematic annotations. • Model w/o filters: – Training models based on a corpus where the both filters are not applied. • Model w/ incorrect annotation filter: – Training models based on a corpus where only the filter for incorrect annotations is applied. • Model w/ missing annotation filter: – Training models based on a corpus where only the filter for missing annotations is applied.
  • 25. 25 Evaluation of extraction models Model P (%) R (%) F score KB match 57.14 29.29 37.21 Model w/o filters 52.60 54.49 53.14 Model w/ incorrect annotation filter 60.46 54.23 56.84 Model w/ missing annotation filter 50.47 59.71 54.43 Model of the proposed method 57.05 59.66 58.15
  • 26. 26 Model P (%) R (%) F score KB match 57.14 29.29 37.21 Model w/o filters 52.60 54.49 53.14 Model w/ incorrect annotation filter 60.46 54.23 56.84 Model w/ missing annotation filter 50.47 59.71 54.43 Model of the proposed method 57.05 59.66 58.15 Evaluation of extraction models +30.4 %. Recall was dramatically improved. ⇒ Contexts surrounding a value and patterns of ⇒ tokens in a value are successfully captured.
  • 27. 27 Model P (%) R (%) F score KB match 57.14 29.29 37.21 Model w/o filters 52.60 54.49 53.14 Model w/ incorrect annotation filter 60.46 54.23 56.84 Model w/ missing annotation filter 50.47 59.71 54.43 Model of the proposed method 57.05 59.66 58.15 Evaluation of extraction models +7.9 %. The incorrect annotation filter improved precision.
  • 28. 28 Model P (%) R (%) F score KB match 57.14 29.29 37.21 Model w/o filters 52.60 54.49 53.14 Model w/ incorrect annotation filter 60.46 54.23 56.84 Model w/ missing annotation filter 50.47 59.71 54.43 Model of the proposed method 57.05 59.66 58.15 Evaluation of extraction models +5.2 %. The incorrect annotation filter improved precision. The missing annotation filter improved recall.
  • 29. 29 Model P (%) R (%) F score KB match 57.14 29.29 37.21 Model w/o filters 52.60 54.49 53.14 Model w/ incorrect annotation filter 60.46 54.23 56.84 Model w/ missing annotation filter 50.47 59.71 54.43 Model of the proposed method 57.05 59.66 58.15 Evaluation of extraction models +5.1 pt. The incorrect annotation filter improved precision. The missing annotation filter improved recall. ⇒ The precision and recall of the proposed method ⇒ are enhanced by employing both filters.
  • 30. 30 Error trend • Randomly selected 50 attribute values judged as incorrect in wine and shampoo categories. Type # of err. Automatic annotation 36 Incorrect KB entry 23 Over generation by learned patterns 15 Extraction from unrelated regions 12 Others 14
  • 31. 31 Automatic annotation error • 土壌が<産地>ボルドー</産地>のポムロールと非常に似ている。 • <成分>ヒアルロン酸</成分> 以上の保水力がある。 • <タイプ>白</タイプ>カビチーズに合わせるとより楽しめます。 • 輸出は全体の<アルコール>10%</アルコール>程度。 Soil is very similar with ones in Pomerol region of <production_area> Bordeaux </production_area>. <type>White</type> mold cheese will enhance the taste of the wine. The amount of exports is approximately <alcohol>10 %</alcohol> of the total. It has a higher water-holding ability <constituent> than hyaluronan </constituent> has.
  • 32. 32 Related work • Product information extraction – (Semi-) Supervised methodology [Ghani+ 2006, Probst+ 2007, Davidov+ 2010, Bakalov+ 2011, Putthividhya+ 2011] ⇒ Training data or initial seeds are required. – Unsupervised methodology [Yoshinaga+ 2006, Dalvi+ 2009, Gulhane+ 2010, Mauge+ 2012, Bing+ 2012] ⇒ Not for full texts or limited to the size of texts. • Unsupervised NER / Unsupervised IE – Many attempts based on distant supervision [Nadeau+ 2006, Whitelaw+ 2008, Nothman+ 2008, Mintz+ 2009, Ritter+ 2011] ⇒ Wikipedia and Freebase are resources.
  • 33. 33 Conclusion and future work • Distant supervision based approach for extracting attributes and their values from product pages. – Construction of knowledge base. – Remove false-positive and false-negative annotations from automatically constructed corpus. • Evaluated the performance of KB induction, automatic annotation, and extraction models under multiple categories. • Future work – Improve the annotation quality by considering contexts. – Construct KB with wide coverage and high quality.
  • 34. 34 Thank you for your kind attention !