DeNA Co, Ltd. 
Yoshiki Shibukawa 
9/14/2014 PyConJP
! Yoshiki Shibukawa 
! Work for DeNA Co, Ltd. 
! @shibu_jp (twitter) 
! yoshiki.shibukawa (Facebook) 
! yoshiki@shibu.jp (mail) 
! Languages 
! C/C++, Python, JavaScript 
! Founder of sphinx-users.jp 
! San Francisco -> Tokyo
! The Basic of Existing Search Engines 
! The structure of Oktavia 
! Oktavia API examples
! In some cases, inverted index is not good 
for Eastern Asian Languages. 
! FM-index is a completely different 
search algorithm. 
! I published new PyPI module yesterday 
! It includes only essential part of Oktavia 
! I will add features more.
AM.txt (0) 
• Good 
morning 
• Hi 
PM.txt (1) 
• Good 
afternoon 
• Good 
evening 
• Hi
Word 
Document ID 
Good 
0, 1 
Morning 
0 
Afternoon 
1 
Evening 
1 
Hi 
0, 1 
! Word -> Document 
! Split words in query 
string and search 
each word from table 
and show result. 
Good Morning 
→ (0, 1) and (0,) 
→ (0,)
English PyConJP. 
Chinese • 这是不错的天气出去PyConJP 
• It is nice weather to go out to 
Japanese ですね 
• 今日はPyConJPに出かけるにはいい天気 
• 그것은 PyConJP 에 외출 좋은 날씨 입니 
Korean※ 다 
※Korean has space between group of words, but not each word.
今日はPyConJPに 
出かけるにはいい天 
気ですね 
今日|は| 
PyConJP|に|出か 
ける|に|は|いい| 
天気|です|ね 
! Split word by using 
Natural Language 
Processor like 
ChaSen, MeCab, 
Kuromoji 
! It needs deep 
knowledge of each 
language and big 
dictionary.
Word 
Doc ID 
今日 
0 
は 
0, 0 
PyConJP 
0 
に 
0, 0 
出かける 
0 
いい 
0 
天気 
0 
です 
0 
ね 
0 
! Document becomes 
words and it can use 
same inverted index 
backend. 
! Same word splitter is 
needed when 
creating index and 
searching.
! 2-gram 
! 3-gram 
! Split a query word 
into fixed length 
strings then search 
each chunk 
! Use each chunk as a 
word 
こんにちは 
こん|んに|にち|ちは 
こんにちは 
こんに|んにち|にちは
Word 
Doc/Pos ID 
こん 
(0, 0) 
んに 
(0, 1) 
にち 
(0, 2) 
ちは 
(0, 3) 
! It can still use an 
inverted index 
algorithm. 
! Index file become 
big. 
! It can’t treat shorter 
words than chunk 
size. 
こんにちは 
→ こん / んに / にち / ちは 
→ (0, 0) / (0, 1) / (0, 2) / (0, 3) 
→ (0, 0)
Inverted Index 
Have space 
Split document 
by space 
Simple 
Space is needed 
Eastern Asian 
Language 
N-gram 
Still simple 
Index becomes 
huge 
NLP 
Works perfect 
with Asian 
language 
NLP processor 
and dictionary 
is needed
! It provides a search engine for browser. 
! Inverted Index 
! It didn’t support Japanese. 
! I sent some patches. 
! But they were not enough…
! Developed by… 
! Paolo Ferragina 
! Giovanni Manzini 
! FM-index is not popular in western countries. 
! It is completely different from existing algorithm. 
! Existing algorithm is enough for western 
languages. 
! It is popular in genome analysis. 
! I made new search engine by using this 
algorithm.
Estimated Time: 15min
! Search Engine works on web browser. 
! Written in Python and JSX (altJS made by 
DeNA. See http://jsx.github.io/ ) 
! It uses FM-index as a backend search 
algorithm.
! It is similar to Action Script 3 
! Class statement (no prototype!) 
! Strict type checking 
! No “this” hell 
! Performance optimization
! FM-index is the fastest algorithm that 
uses a compressed index file. 
! FM-index doesn’t need word splitting.
! Oktavia adds extra information 
! Add region information to source text. 
Use the Force, Luke. No, I am your father. 
Ep4.txt 
! You can add as many metadata as you can. 
! Section (documents and sections) 
! Block (code block and so on) 
! Splitter (word splitter) 
! Table (rows and columns) 
Ep5.txt
Read 
Source 
Generate 
Index 
File API 
Read 
Index 
File API 
Search 
Result 
CLI tool 
Browser search program
Read 
Source 
Generate 
Index 
File API 
Read 
Index 
File API 
Show 
Search 
Result 
CLI tool 
Browser search program 
! I published yesterday. 
! It supports Python 2.6, 2.7, 3.3, 3.4.
! Use Oktavia API to implement search 
feature in your application
! Build JSX version 
$ git clone git@github.com:shibukawa/oktavia.git 
$ cd oktavia 
$ npm install 
$ ./node_modules/.bin/grunt build 
! web/bin/oktavia-jquery-ui.js, web/bin/ 
oktavia-web-runtime.js are important.
! Creating index 
! Dump an index file in base64 encode and create 
file in the following style. 
var searchIndex = 'aGVsbG8gd29ybGQ…..=’; 
! concatenate with JSX web search runtime 
(web/bin/oktavia-web-runtime.js). 
! Add web/bin/oktavia-jquery-ui.js to your 
website. 
! It reads index and runtime on WebWorker and 
sends requests and show result.
Estimated Time: 23min
! Oktavia provides APIs for creating your 
better search engine. 
! Most important part for user experience is 
an adjustment of scoring (sorting and 
filtering). 
! In some case, user feels “not available” is 
important information, but in other case, it 
is just noise.
! I want to buy some bottle of wine for gift! 
Cabernet Sauvignon [Sold Out] 
• From France 
Pinot noir [Sold Out] 
• From Chili 
Zinfandel [Sold Out] 
• From USA 
Photo by Josh Kenzer under CC-NC-SA
! I want to buy “My Little Pony DVD”! 
Season One 
$32 
Season Two 
$32 
Season Three 
[Sold out]
! Oktavia class (oktavia.py) 
! Main entry point of creating/searching. 
! Metadata classes (metadata.py) 
! Section 
! Block 
! Splitter 
! Table 
! Query, Result classes (TBD)
! Sorry, I am working… In future the 
following code will work:
! In some cases, inverted index is not good 
for Eastern Asian Languages. 
! FM-index is a completely different 
search algorithm. 
! I published new PyPI module yesterday 
! It includes only essential part of Oktavia 
! I will add features more.
! Office Hour 
! 13:40-14:10 
! Message 
! Facebook(yoshiki.shibukawa) 
! Twitter(@shibu_jp, @shibukawa)

Oktavia Search Engine - pyconjp2014

  • 1.
    DeNA Co, Ltd. Yoshiki Shibukawa 9/14/2014 PyConJP
  • 2.
    ! Yoshiki Shibukawa ! Work for DeNA Co, Ltd. ! @shibu_jp (twitter) ! yoshiki.shibukawa (Facebook) ! yoshiki@shibu.jp (mail) ! Languages ! C/C++, Python, JavaScript ! Founder of sphinx-users.jp ! San Francisco -> Tokyo
  • 3.
    ! The Basicof Existing Search Engines ! The structure of Oktavia ! Oktavia API examples
  • 4.
    ! In somecases, inverted index is not good for Eastern Asian Languages. ! FM-index is a completely different search algorithm. ! I published new PyPI module yesterday ! It includes only essential part of Oktavia ! I will add features more.
  • 6.
    AM.txt (0) •Good morning • Hi PM.txt (1) • Good afternoon • Good evening • Hi
  • 7.
    Word Document ID Good 0, 1 Morning 0 Afternoon 1 Evening 1 Hi 0, 1 ! Word -> Document ! Split words in query string and search each word from table and show result. Good Morning → (0, 1) and (0,) → (0,)
  • 8.
    English PyConJP. Chinese• 这是不错的天气出去PyConJP • It is nice weather to go out to Japanese ですね • 今日はPyConJPに出かけるにはいい天気 • 그것은 PyConJP 에 외출 좋은 날씨 입니 Korean※ 다 ※Korean has space between group of words, but not each word.
  • 9.
    今日はPyConJPに 出かけるにはいい天 気ですね 今日|は| PyConJP|に|出か ける|に|は|いい| 天気|です|ね ! Split word by using Natural Language Processor like ChaSen, MeCab, Kuromoji ! It needs deep knowledge of each language and big dictionary.
  • 10.
    Word Doc ID 今日 0 は 0, 0 PyConJP 0 に 0, 0 出かける 0 いい 0 天気 0 です 0 ね 0 ! Document becomes words and it can use same inverted index backend. ! Same word splitter is needed when creating index and searching.
  • 11.
    ! 2-gram !3-gram ! Split a query word into fixed length strings then search each chunk ! Use each chunk as a word こんにちは こん|んに|にち|ちは こんにちは こんに|んにち|にちは
  • 12.
    Word Doc/Pos ID こん (0, 0) んに (0, 1) にち (0, 2) ちは (0, 3) ! It can still use an inverted index algorithm. ! Index file become big. ! It can’t treat shorter words than chunk size. こんにちは → こん / んに / にち / ちは → (0, 0) / (0, 1) / (0, 2) / (0, 3) → (0, 0)
  • 13.
    Inverted Index Havespace Split document by space Simple Space is needed Eastern Asian Language N-gram Still simple Index becomes huge NLP Works perfect with Asian language NLP processor and dictionary is needed
  • 15.
    ! It providesa search engine for browser. ! Inverted Index ! It didn’t support Japanese. ! I sent some patches. ! But they were not enough…
  • 17.
    ! Developed by… ! Paolo Ferragina ! Giovanni Manzini ! FM-index is not popular in western countries. ! It is completely different from existing algorithm. ! Existing algorithm is enough for western languages. ! It is popular in genome analysis. ! I made new search engine by using this algorithm.
  • 18.
  • 19.
    ! Search Engineworks on web browser. ! Written in Python and JSX (altJS made by DeNA. See http://jsx.github.io/ ) ! It uses FM-index as a backend search algorithm.
  • 20.
    ! It issimilar to Action Script 3 ! Class statement (no prototype!) ! Strict type checking ! No “this” hell ! Performance optimization
  • 22.
    ! FM-index isthe fastest algorithm that uses a compressed index file. ! FM-index doesn’t need word splitting.
  • 23.
    ! Oktavia addsextra information ! Add region information to source text. Use the Force, Luke. No, I am your father. Ep4.txt ! You can add as many metadata as you can. ! Section (documents and sections) ! Block (code block and so on) ! Splitter (word splitter) ! Table (rows and columns) Ep5.txt
  • 24.
    Read Source Generate Index File API Read Index File API Search Result CLI tool Browser search program
  • 25.
    Read Source Generate Index File API Read Index File API Show Search Result CLI tool Browser search program ! I published yesterday. ! It supports Python 2.6, 2.7, 3.3, 3.4.
  • 26.
    ! Use OktaviaAPI to implement search feature in your application
  • 27.
    ! Build JSXversion $ git clone git@github.com:shibukawa/oktavia.git $ cd oktavia $ npm install $ ./node_modules/.bin/grunt build ! web/bin/oktavia-jquery-ui.js, web/bin/ oktavia-web-runtime.js are important.
  • 28.
    ! Creating index ! Dump an index file in base64 encode and create file in the following style. var searchIndex = 'aGVsbG8gd29ybGQ…..=’; ! concatenate with JSX web search runtime (web/bin/oktavia-web-runtime.js). ! Add web/bin/oktavia-jquery-ui.js to your website. ! It reads index and runtime on WebWorker and sends requests and show result.
  • 29.
  • 30.
    ! Oktavia providesAPIs for creating your better search engine. ! Most important part for user experience is an adjustment of scoring (sorting and filtering). ! In some case, user feels “not available” is important information, but in other case, it is just noise.
  • 31.
    ! I wantto buy some bottle of wine for gift! Cabernet Sauvignon [Sold Out] • From France Pinot noir [Sold Out] • From Chili Zinfandel [Sold Out] • From USA Photo by Josh Kenzer under CC-NC-SA
  • 32.
    ! I wantto buy “My Little Pony DVD”! Season One $32 Season Two $32 Season Three [Sold out]
  • 33.
    ! Oktavia class(oktavia.py) ! Main entry point of creating/searching. ! Metadata classes (metadata.py) ! Section ! Block ! Splitter ! Table ! Query, Result classes (TBD)
  • 35.
    ! Sorry, Iam working… In future the following code will work:
  • 36.
    ! In somecases, inverted index is not good for Eastern Asian Languages. ! FM-index is a completely different search algorithm. ! I published new PyPI module yesterday ! It includes only essential part of Oktavia ! I will add features more.
  • 37.
    ! Office Hour ! 13:40-14:10 ! Message ! Facebook(yoshiki.shibukawa) ! Twitter(@shibu_jp, @shibukawa)