Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Oktavia Search Engine - pyconjp2014

2,964 views

Published on

  • Be the first to comment

Oktavia Search Engine - pyconjp2014

  1. 1. DeNA Co, Ltd. Yoshiki Shibukawa 9/14/2014 PyConJP
  2. 2. ! Yoshiki Shibukawa ! Work for DeNA Co, Ltd. ! @shibu_jp (twitter) ! yoshiki.shibukawa (Facebook) ! yoshiki@shibu.jp (mail) ! Languages ! C/C++, Python, JavaScript ! Founder of sphinx-users.jp ! San Francisco -> Tokyo
  3. 3. ! The Basic of Existing Search Engines ! The structure of Oktavia ! Oktavia API examples
  4. 4. ! In some cases, inverted index is not good for Eastern Asian Languages. ! FM-index is a completely different search algorithm. ! I published new PyPI module yesterday ! It includes only essential part of Oktavia ! I will add features more.
  5. 5. AM.txt (0) • Good morning • Hi PM.txt (1) • Good afternoon • Good evening • Hi
  6. 6. Word Document ID Good 0, 1 Morning 0 Afternoon 1 Evening 1 Hi 0, 1 ! Word -> Document ! Split words in query string and search each word from table and show result. Good Morning → (0, 1) and (0,) → (0,)
  7. 7. English PyConJP. Chinese • 这是不错的天气出去PyConJP • It is nice weather to go out to Japanese ですね • 今日はPyConJPに出かけるにはいい天気 • 그것은 PyConJP 에 외출 좋은 날씨 입니 Korean※ 다 ※Korean has space between group of words, but not each word.
  8. 8. 今日はPyConJPに 出かけるにはいい天 気ですね 今日|は| PyConJP|に|出か ける|に|は|いい| 天気|です|ね ! Split word by using Natural Language Processor like ChaSen, MeCab, Kuromoji ! It needs deep knowledge of each language and big dictionary.
  9. 9. Word Doc ID 今日 0 は 0, 0 PyConJP 0 に 0, 0 出かける 0 いい 0 天気 0 です 0 ね 0 ! Document becomes words and it can use same inverted index backend. ! Same word splitter is needed when creating index and searching.
  10. 10. ! 2-gram ! 3-gram ! Split a query word into fixed length strings then search each chunk ! Use each chunk as a word こんにちは こん|んに|にち|ちは こんにちは こんに|んにち|にちは
  11. 11. Word Doc/Pos ID こん (0, 0) んに (0, 1) にち (0, 2) ちは (0, 3) ! It can still use an inverted index algorithm. ! Index file become big. ! It can’t treat shorter words than chunk size. こんにちは → こん / んに / にち / ちは → (0, 0) / (0, 1) / (0, 2) / (0, 3) → (0, 0)
  12. 12. Inverted Index Have space Split document by space Simple Space is needed Eastern Asian Language N-gram Still simple Index becomes huge NLP Works perfect with Asian language NLP processor and dictionary is needed
  13. 13. ! It provides a search engine for browser. ! Inverted Index ! It didn’t support Japanese. ! I sent some patches. ! But they were not enough…
  14. 14. ! Developed by… ! Paolo Ferragina ! Giovanni Manzini ! FM-index is not popular in western countries. ! It is completely different from existing algorithm. ! Existing algorithm is enough for western languages. ! It is popular in genome analysis. ! I made new search engine by using this algorithm.
  15. 15. Estimated Time: 15min
  16. 16. ! Search Engine works on web browser. ! Written in Python and JSX (altJS made by DeNA. See http://jsx.github.io/ ) ! It uses FM-index as a backend search algorithm.
  17. 17. ! It is similar to Action Script 3 ! Class statement (no prototype!) ! Strict type checking ! No “this” hell ! Performance optimization
  18. 18. ! FM-index is the fastest algorithm that uses a compressed index file. ! FM-index doesn’t need word splitting.
  19. 19. ! Oktavia adds extra information ! Add region information to source text. Use the Force, Luke. No, I am your father. Ep4.txt ! You can add as many metadata as you can. ! Section (documents and sections) ! Block (code block and so on) ! Splitter (word splitter) ! Table (rows and columns) Ep5.txt
  20. 20. Read Source Generate Index File API Read Index File API Search Result CLI tool Browser search program
  21. 21. Read Source Generate Index File API Read Index File API Show Search Result CLI tool Browser search program ! I published yesterday. ! It supports Python 2.6, 2.7, 3.3, 3.4.
  22. 22. ! Use Oktavia API to implement search feature in your application
  23. 23. ! Build JSX version $ git clone git@github.com:shibukawa/oktavia.git $ cd oktavia $ npm install $ ./node_modules/.bin/grunt build ! web/bin/oktavia-jquery-ui.js, web/bin/ oktavia-web-runtime.js are important.
  24. 24. ! Creating index ! Dump an index file in base64 encode and create file in the following style. var searchIndex = 'aGVsbG8gd29ybGQ…..=’; ! concatenate with JSX web search runtime (web/bin/oktavia-web-runtime.js). ! Add web/bin/oktavia-jquery-ui.js to your website. ! It reads index and runtime on WebWorker and sends requests and show result.
  25. 25. Estimated Time: 23min
  26. 26. ! Oktavia provides APIs for creating your better search engine. ! Most important part for user experience is an adjustment of scoring (sorting and filtering). ! In some case, user feels “not available” is important information, but in other case, it is just noise.
  27. 27. ! I want to buy some bottle of wine for gift! Cabernet Sauvignon [Sold Out] • From France Pinot noir [Sold Out] • From Chili Zinfandel [Sold Out] • From USA Photo by Josh Kenzer under CC-NC-SA
  28. 28. ! I want to buy “My Little Pony DVD”! Season One $32 Season Two $32 Season Three [Sold out]
  29. 29. ! Oktavia class (oktavia.py) ! Main entry point of creating/searching. ! Metadata classes (metadata.py) ! Section ! Block ! Splitter ! Table ! Query, Result classes (TBD)
  30. 30. ! Sorry, I am working… In future the following code will work:
  31. 31. ! In some cases, inverted index is not good for Eastern Asian Languages. ! FM-index is a completely different search algorithm. ! I published new PyPI module yesterday ! It includes only essential part of Oktavia ! I will add features more.
  32. 32. ! Office Hour ! 13:40-14:10 ! Message ! Facebook(yoshiki.shibukawa) ! Twitter(@shibu_jp, @shibukawa)

×