Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Elasticsearch in hatena bookmark

6,638 views

Published on

第11回Elasticsearch勉強会で発表した資料

Published in: Technology

Elasticsearch in hatena bookmark

  1. 1. Elasticsearch in Hatena Bookmark Shunsuke KOZAWA
  2. 2. About Me ● Shunsuke KOZAWA ○ Hatena id: skozawa ○ Twitter: @5kozawa ● 2007 - 2012 ○ Research: Natural Language Processing ○ Ph.D. in Information Science ● 2012 - ○ Hatena Inc. ■ Hatena Bookmark ■ Ad-tech
  3. 3. Hatena Bookmark Social Bookmark Service
  4. 4. Search Engine History in Hatena Bookmark 2005 - 2007 MySQL Like 2008 - 2012 Sedue (by Preferred Infrastructure) 2012 - 2014/06 Solr 2014/06 - Elasticsearch ref. http://bookmark.hatenastaff.com/entry/2014/06/27/180000
  5. 5. System Architecture
  6. 6. Mapping (partial) of Hatena Bookmark { “entry”: { “properties”: { “url”: { “type”: “string” }, “title”: { “type”: “string” }, “content”: { “type”: “string” }, “count”: { “type”: “integer” }, “created”: { “type”: “date” }, “bookmark”: { … } } } } “bookmark”: { “type”: “nested”, “properties”: { “user”: { “type”: “string” }, “tag”: { “type”: “string” }. “comment”: { “type”: “string” }, “created”: { “type”: “date” } } }
  7. 7. Features powered by Elasticsearch ● Entry Search ○ Tag Search ○ Title Search ○ Content Search ○ URL Search ● Related Entry ● Issue ● Topic ● Bookmark Counter
  8. 8. Tag/Title Search
  9. 9. Tag/Title Search Search by “Elasticsearch”
  10. 10. Tag/Title Search Sorting Filter by the number of bookmark Filter by timestamp
  11. 11. Tag/Title Search { “sort”: { “created”: “desc” }, “query”: { “bool”: { “must”: [ { “match_phrase”: { “title”: “elasticsearch” } } ] }, “filtered”: { “filter”: { “bool”: { “must”: [ { “range”: { “count”: { “gte”: 3 } } }, { “range”: { “created”: { “from”: “2015-05-01T00:00:00”, “to”: “2015-07-15T00:00:00” } } } ] } } } } }
  12. 12. Content Search
  13. 13. Concept Search ● Simple Content Search ○ High recall, but low precision ○ Precision is important in Hatena Bookmark ● Concept Search ○ Query Expansion ■ Use search results retrieved by tag search ■ Expand queries with TF-IDF and IDF, RIDF ● Term Vector API ○ Retrieve using expanded queries ■ eg. 「京都」 -> 「祇園、寺、神社、桜、京、...」 ref. はてなブックマークの全文検索の精度改善 https://speakerdeck.com/takuyaa/hatenabutukumakuquan-wen-jian-suo-falsejing-du-gai-shan
  14. 14. URL Search http://b.hatena.ne.jp/entrylist?url=http%3A%2F%2Fwww.elastic.co%2F http://www.elastic.co/
  15. 15. URL Search http://b.hatena.ne.jp/entrylist?url=http%3A%2F%2Fwww.elastic.co%2F { “query”: { “filtered”: { “filter”: { “bool”: { “should”: [ { “prefix”: { “url”: “http://www.elastic.co/” } } ] } } } } } http://www.elastic.co/
  16. 16. URL Subdomain Search hatenablog.com *.hatenablog.com
  17. 17. Related Entry ref. はてなブックマークに基づく関連記事レコメンドの開発 http://www.slideshare.net/shunsukekozawa5/hatena-engineer-seminar-5
  18. 18. Issue Made by editors in Hatena Entries in special features
  19. 19. Issue Hard to create Query DSL for non engineers Made by editors in Hatena Entries in special features
  20. 20. Edit page for Issue
  21. 21. Edit page for Issue Friendly for non engineers
  22. 22. Edit page for Issue Friendly for non engineers { “query”: { “bool”: { “must”: [ { “range”: { “count”: { “gte”: 5 } } } ], “should”: [ (tags, keywords, urls) ], “must_not”: [ (tags, keywods, urls) ], “minimum_should_match”: 1 } }, “sort”: { “created”: “desc” } } translate
  23. 23. Topic Estimate topics from entries in Hatena Bookmark
  24. 24. Topic Page Entries related with the topic
  25. 25. Topic by Elasticsearch ● Acquire topic keywords ○ Two-layered Significant Terms Aggregation ● Acquire entries related with the topic ○ Function Score Query ○ Retrieve using topic keywords and their scores 官邸、首相、ドローン、落下、カメラ ● 首相官邸にドローン落下 けが人はなし :日本経済新聞 ● 首相官邸の屋上にドローン落下、微量の放射線を検出| Reuters ref. はてなブックマークのトピックページの作り方 http://codezine.jp/article/detail/8767
  26. 26. Bookmark Counter ● Count the number of bookmarks in a web site ○ Count by Sum Aggregation ○ eg. http://d.hatena.ne.jp/ { “query”: { { “prefix”: { “url”: “http://d.hatena.ne.jp/” } } }, “aggs”: { “total_count”: { “sum” : { “field”: “count” }, } } }
  27. 27. Conclusion ● Elasticsearch in Hatena Bookmark ● Features powered by Elasticsearch ○ Tag / Title / Content / URL Search ○ Related entry ○ Issue ○ Topic ○ Bookmark Counter

×