Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Organizing Big Data for Text in Rakuten

616 views

Published on

Rakuten has various kinds of text data such as query keywords, product descriptions, and reviews from our users. The data is collected in our servers continuously and the size grows by the hour. However, we need to convert these massive unstructured data into structured data in order to take advantage of big data for text in our businesses. In this presentation, we will talk about methodologies to automatically organize unstructured data using natural language processing techniques so that we can help Rakuten's business.

Published in: Technology
  • Login to see the comments

  • Be the first to like this

Organizing Big Data for Text in Rakuten

  1. 1. Organizing Big Data for Text in Rakuten October 28, 2017 Keiji Shinzato Rakuten Institute of Technology Rakuten, Inc.
  2. 2. 2 Text in Rakuten Understanding • Search Queries • Reviews from Users • Product Descriptions • Etc. Valuable Information • User Interest • User Experience • Product Features • Etc.
  3. 3. 3 Number of products has risen 2.6 times compared to five years ago. • 100M products in 20121) to 258M products in 20172) How much time do you need for reading descriptions of 258M products? A. 4 years C. 400 years B. 40 years D. 4,000 years 1) https://corp.rakuten.co.jp/about/history.html 2) https://www.rakuten.co.jp/ (as of October 2nd, 2017.)
  4. 4. 4 Number of products has risen 2.6 times compared to five years ago. • 100M products in 20121) to 258M products in 20172) How much time do you need for reading descriptions of 258M products? A. 4 years C. 400 years B. 40 years D. 4,000 years 1) https://corp.rakuten.co.jp/about/history.html 2) https://www.rakuten.co.jp/ (as of October 2nd, 2017.) Technology to organize big data for text is critical!
  5. 5. 5 • Information Extraction from Product Data • Sentiment Analysis on Review Data
  6. 6. 6 Application Crafted from sleek spazzolato leather (black), the Dorian shopper is an elegant carryall that's perfect for your essentials. 10"H x 13"L x 6"D. RALPH LAUREN Attribute Value Brand Ralph Lauren Color Black Material Leather Size 10’’H x 13’’L x 6’’D Unstructured Data Structured Data Faceted Navigation / Recommendation / Market Research The bag image is designed by Freepik (http://www.freepik.com/free-vector/set-of-woman-s-bags-in-flat-style_960523.htm)
  7. 7. 7 Difficulty • Ambiguity • パーカー (luxury pen brand and hoodie), PUMA (sports brand and knife brand) • Diversity (long tail) • 風と光 (a company of natural foods) Dictionary-based approach • Easily control system behavior by editing entries in the dictionary. • Easily understand errors.
  8. 8. 8 Brand Dictionary Product Titles and their Genres Input Data with Brands IDs • Tokenization • PoS Tagging Extraction Morphological Analysis • List tokens matched with the dictionary entries. • Extract the candidate to the furthest left. Normalization • Retrieve brand IDs corresponding to extracted brands. Synonym Dictionary
  9. 9. 9 Brand expression Relevant Genre 力王 Unknown 中部電磁器工業 Computers & Networking キメラパーク Unknown シュガーローズ Women's Clothing サスクワッチファブリッ クス Women's Clothing 藤栄 Home Decor, Housewares & Furniture ミキモト Unknown エドウィンゴルフ Sports & Outdoors AKI WORLD Sports & Outdoors 工房飛竜 Toys, Hobbies & Games パーカー Home & Office Supplies ハイライトキャバレー Men's Clothing 杉野 Unknown カウネット Kitchen, Dining & Bar Brand expressions are contained with their relevant genres. • 190K entries Relevant genres are critical for disambiguation. • Employ brand expressions whose relevant genre is the same with a given product. • Retrieve パーカー only for products in home & office supplies.
  10. 10. 10 Screenshot of https://item.rakuten.co.jp/brandol-ec/gu-295710-j8400-8106/ as of October 10th, 2017.
  11. 11. 11 New entries 3. Assign new relevant genres to existing brands 4. Check manually Candidates 2. Train and run machine learning models Annotated text 1. Create training data Brand dictionary Product data in ICHIBA 5. Update
  12. 12. 12 Brand Dictionary Product Titles and their Genres Input Data with Brands IDs • Tokenization • PoS Tagging Extraction Morphological Analysis • List tokens matched with the dictionary entries. • Extract the candidate to the furthest left. Normalization • Retrieve brand IDs corresponding to extracted brands. Synonym Dictionary
  13. 13. 13 Genre ID Synonym : : : Shoes, Bags,… B2449 NIKE, ナイキ Electronics B2450 SONY, ソニー : : : Genre Product Brand ID Label Shoes ナイキ B2449 NIKE Shoes NIKE B2449 NIKE Bags NIKE B2449 NIKE Interior ナイキ -- ナイキ Synonym DictionaryExtraction Results Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm) The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm) The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)
  14. 14. 14 Genre ID Synonym : : : Shoes, Bags,… B2449 NIKE, ナイキ Electronics B2450 SONY, ソニー : : : Genre Product Brand ID Shoes ナイキ B2449 Shoes NIKE B2449 Bags NIKE B2449 Interior ナイキ -- Extraction Results Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm) The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm) The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm) Synonym Dictionary
  15. 15. 15 Genre ID Synonym : : : Shoes, Bags,… B2449 NIKE, ナイキ Electronics B2450 SONY, ソニー : : : Genre Product Brand ID Shoes ナイキ B2449 Shoes NIKE B2449 Bags NIKE B2449 Interior ナイキ B3510 NAIKI Co.,LTD.: http://www.naiki.co.jp/index.html Extraction Results Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm) The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm) The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm) Synonym Dictionary
  16. 16. 16 Genre ID Synonym : : : Shoes, Bags,… B2449 NIKE, ナイキ Electronics B2450 SONY, ソニー : : : Genre Product Brand ID Shoes ナイキ B2449 Shoes NIKE B2449 Bags NIKE B2449 Interior ナイキ B3510 Information when we can use it is important NAIKI Co.,LTD.: http://www.naiki.co.jp/index.html Extraction Results Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm) The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm) The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm) Synonym Dictionary
  17. 17. 17 Find candidates automatically, and then check them manually. • JAN code • Wikipedia • Semantic similarity 206K triplets of <genre, brand id, synonyms>
  18. 18. 18 Manually assign brands to 500 randomly selected product titles • Percent of product titles including brands: 69.6% (348/500) Performance • Precision: 89.2% (224/251) • Recall: 64.4% (224/348) We can automatically extract correct brands for 100M products in 260M products!
  19. 19. 19 • Information Extraction from Product Data • Sentiment Analysis on Review Data
  20. 20. 20 I ordered this a week ago, but no response from the store. 176,502 reviewsStock Information Payment Service Package Shipping Snapshot of https://review.rakuten.co.jp/shop/4/261122_261122/cpmj-i0h5i-97x3lm_1_1/?l2-id=review_PC_sl_body_05 as of October 16th, 2017.
  21. 21. 21 • What aspects should we design? • How do we develop the system to perform it? s1: Item was nicely packaged. s2: A tracking # was given, but never worked. s3: Will shop again. s1: Package / Pos s2: Shipping / Neg s3: Repeat / Pos Input: Merchant Review Output: Aspect / Sentiment Polarity The robot image is designed by Freepik (http://www.freepik.com/free-vector/cute-robots-collection_713858.htm)
  22. 22. 22 # Aspect Example 1 配送 (Shipping) 迅速な配送ありがとうございました。 (Thank you for the quick shipping.) 2 対応 (Service) 今まで買い物した店舗で一番対応が遅かった。 (I’ve never seen such slow service!) 3 連絡 (Communication) 注文受付の自動送信メールが届いたきり一週間何の連絡もなし。 (No contact for a week after ordering it.) 4 店舗 (Shop) 信頼できるショップ様でした。 (They are a reliable store.) 5 商品 (Item) 安全に使用できそうで、これからが楽しみです。 (I’m looking forward to using this product.) 6 リピート (Repeat) また利用したいと思います。 (I’m going to purchase an item again.) 7 梱包 (Package) 梱包も破損のないよう、しっかりとされていました。 (It was tightly packaged to prevent damage.)
  23. 23. 23 # Aspect Example 8 品揃え (Stock/variety) 商品が多いので助かります。 (They have a big inventory.) 9 情報 (Information) マネキンの身長を記載してあったのでかなり参考になりました。 (The description about the height of a mannequin is very useful.) 10 キャンセル/返品 (Cancel/return) しかしたまに断りなく遅れたりキャンセルされている点に不満です。 (I’m not satisfied because they suddenly canceled without any notification.) 11 価格 (Price) 商品が安く、購入でき、まんぞくです。 (I’m satisfied with purchasing the item at a low price.) 12 楽天 (Rakuten) 楽天の全サービスに信用がなくなりました。 (Because of this experience, I can’t trust any services in Rakuten.) 13 支払い (Payment) 決済方法にEdyが使える方がよいと思います。 (It would be better if Rakuten Edy were acceptable.) 14 その他 (Other) レビューがもう少し増えるといいですね。 (I hope the number of reviews increases.)
  24. 24. 24 • What aspects should we design? • How do we develop the system to perform it? s1: Item was nicely packaged. s2: A tracking # was given, but never worked. s3: Will shop again. s1: Package / Pos s2: Shipping / Neg s3: Repeat / Pos Input: Merchant Review Output: Aspect / Sentiment Polarity The robot image is designed by Freepik (http://www.freepik.com/free-vector/cute-robots-collection_713858.htm)
  25. 25. 25 Annotated 1,510 reviews (5,277 sentences) • 配送も迅速で良かったです。 (I was very pleased at how quickly I received it.)  Shipping/Positive • いつになっても商品が来ず、問い合わせても返信がない。 (No shipment, no reply to inquiry.)  Shipping/Negative, Communication/Negative 103 hours / a well-trained annotator
  26. 26. 26 Train models using passive aggressive algorithm, and CRF. Features are: • Bag-of-words, aspect dictionary, sentiment polarity dictionary, and syntactic information. Performance • Aspect classification • Precision: 82.6%, Recall: 46.8% • Sentiment classification • Precision: 84.8%, Recall: 77.5%
  27. 27. 27 • Important to develop technique to automatically pull valuable information from Big Data for Text. • e.g., reviews  users’ experience • Rakuten develops techniques in-house to exploit Big Data for Text in the services. • Information extraction from product descriptions • Sentiment analysis on reviews of merchants

×