Your SlideShare is downloading. ×
0
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

634

Published on

Rakuten Technology Conference 2013 …

Rakuten Technology Conference 2013
"Building Structured Data from Product Descriptions"
Keiji Shinzato (Rakuten)

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
634
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Building Structured Data from Product Descriptions Keiji Shinzato
  • 2. Product information extraction An Italian product. This is a fruity red wine that mainly consists of sangiovese grapes of Tuscany. Type Red Grape variety Sangiovese Region Italy, Tuscany 2
  • 3. Background • Structured data play a crucial role for making Rakuten more attractive service. – Faceted navigation, recommendation, and market analysis. ベリンダ・コーリー キアンティ 2011 750ml トスカーナ州 キャ ンティ地区のサン ジョベーゼ種を主 体につくられる、 イタリアを代表す る赤ワインの一つ。 Attribute Value Type 赤 Region イタリア, トスカーナ州キャンティ 地区 Grape サンジョベーゼ Vintage 2011 3
  • 4. Faceted navigation Reference: http://www.amazon.com/ 4
  • 5. Background • Structured data play a crucial role for making Rakuten more attractive service. – Faceted navigation, recommendation, and market analysis. • Unsupervised methodology is required. – 100 million products / 40,000 categories. ベリンダ・コーリー キアンティ 2011 750ml トスカーナ州 キャ ンティ地区のサン ジョベーゼ種を主 体につくられる、 イタリアを代表す る赤ワインの一つ。 Attribute Value Type 赤 Region イタリア, トスカーナ州キャンティ 地区 Grape サンジョベーゼ Vintage 2011 5
  • 6. Table is an useful clue, but… WINE > CHILE WINE > CHILE Montes Alpha M 2009 Montes Alpha M 2009 Type Red Region Chile 38% Grape Cabernet sauvignon, Merlot, Cabernet franc, Petit verdot Year 2009 Product page including a table Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabern et Franc, and Petit Verdot. A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a … Product page consists of sentences 6
  • 7. Product information extraction WINE > CHILE Montes Alpha M 2009 Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot. A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a very well defined character. … Product page (unstructured) Attribute Value Type Red Region Chile Grape Cabernet sauvignon, Merlot, Cabernet franc, Petit verdot Vintage 2009 Company Montes Structured data • Issue1: How do we know attributes for a category ?? • Issue2: How do we extract attribute values from full texts ?? 7
  • 8. Attribute name collection Analyze a large amount of table data for collecting attributes of an object Attribute values Attribute names of Wine Reference: http://item.rakuten.co.jp/redbox/odm3000728/ 8
  • 9. Attribute value database (wine) ぶどう品種 (Grape variety) 内容量 (Volume) 産地 (Region) 生産者 (Winery) 味わい (Taste) Chardonnay 750ML France Farnese Dry Chardonnay 100% 720ML Italy Mas de Monistrol Full body Merlot 375ML Spain Leroy Medium body Riesling 500ML Chile M. Chapoutier Slightly sweet Syrah 1500ML German Mastroberardino Sweet Grenache 360ML Australia Santero Medium dry Merlot 200ML America Saltarelli Extremely sweet Tempranillo 3000ML Bordeaux Cavicchioli Medium dry Sangiovese 1800ML Champagne Fontodi Red Full body Syrah100% 1000ML Argentina Ca'Rugate Middle sweet Precision is high, but coverage is low. 9
  • 10. Product information extraction WINE > CHILE Montes Alpha M 2009 Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot. A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a very well defined character. … Product page (unstructured) Attribute Value Type Red Region Chile Grape Cabernet sauvignon, Merlot, Cabernet franc, Petit verdot Vintage 2009 Company Montes Structured data • Issue1: How do we know attributes for each category ?? • Issue2: How do we extract attribute values from product descriptions ?? 10
  • 11. Unsupervised attribute value extraction - distant supervision approach Semi-structured data Generation Chateau d’Issan 1994 Construction Database : <Region, Margaux> <Color, White> : This is a wine from Margaux. ... Annotation Rule wine from x ⇒ x is a Region Rule is generated through machine learning algorithm. Product page including entries in the database 11
  • 12. Corpus with attribute-value annotations (wine) • <産地>アルザス</産地>で最も香り豊かと言われるスパイシーで華やかなワイ J: E: ン。 A spicy and gorgeous wine that is known as the richest aroma one in J: <production_area> Alsace </production_area>. • 最もお手頃で、<生産者>ドメーヌ・ペゴー</生産者>の美味しさを気軽に楽し E: める、とっても嬉しい一本なのです This is a very nice wine because we can easily enjoy the taste of <winery> J: Domaine Pegau </winery> at the best price. • <ぶどう品種>ソーヴィニヨン・ブラン</ぶどう品種>種の特長がよく表れたワ E: J: イン。 A wine that <grape_variety> Sauvignon Blanc </grape_variety> was well E: featured. • <タイプ>白</タイプ>身魚の塩焼きやシンプルな味付けのソテー、焼き牡蠣、 豚のしょうが焼き、ボンゴレビアンコなどと。 12
  • 13. Unsupervised attribute value extraction - distant supervision approach Semi-structured data Generation Chateau d’Issan 1994 Construction Database : <Region, Margaux> <Color, White> : This is a wine from Margaux. ... Annotation Rule wine from x ⇒ x is a Region Rule is generated through machine learning algorithm. Product page including entries in the database 13
  • 14. Extraction rule generation • Algorithm: Conditional random fields [Lafferty+ 2001] • Chunk tag: Start/End (IOBES) model [Sekine+ 1998] • Features: – – – – – – – Token: Surface form of the token. Base: Base form of the token. PoS: Part-of-Speech tag of the token. Char. type: Types of characters in the token. Prefix: Double character prefix of the token. Suffix: Double character suffix of the token. The above features of ±3 tokens surrounding the token. They are frequently employed in the task of Japanese named entity recognition. 14
  • 15. Unsupervised attribute value extraction - distant supervision approach Semi-structured data Generation Chateau d’Issan 1994 Construction Database : <Region, Margaux> <Color, White> : This is a wine from Margaux. ... Annotation Rule wine from x ⇒ x is a Region Rule is generated through machine learning algorithm. Product page including entries in the database 15
  • 16. Unsupervised attribute value extraction - distant supervision approach Terre di matraja Bianco 2012 Apply Rule wine from x ⇒ x is a Region This is a wine from Tuscany. ... Rule 1800 < x <= 2013 ⇒ x is a Vintage Attribute Region Vintage Grape Value Tuscany 2012 Chardonnay 16
  • 17. Performance (F-score) Without ML With ML 43.8 pt. 60.1pt. Wine 24.1pt. 71.5 pt. Shampoo 17
  • 18. Wine / Japanese An Italian product. This is a fruity red wine that mainly consists of sangiovese grapes of Tuscany. Type Red Grape variety Sangiovese Region Italy, Tuscany 18
  • 19. Shampoo / Japanese ``MCH Natural shampoo 1000ml’’ is a shampoo consisting of cypress oil and charcoal. Category Product name Shampoo MCH Natural shampoo 1000ml Ingredient Cypress oil, Charcoal 19
  • 20. Video game / French Product type Saga Nintendo 64, Nintendo DS Mario 20
  • 21. Conclusion • Developing a technique for extracting product information from unstructured data. – Independent of any category and language. • Useful services can be realized on structured product data. • Our paper is available on the web. – ACL anthology: http://aclweb.org/anthology//I/I13/ 21
  • 22. Thank you for listing ! 22

×