©2013 CMScom info@cmscom.jp
Fuzzy Search on Plone and Search for East Asian Language
CMS communications Inc,
Manabu TERADA...
Who I am? (お前だれよ?)
©2013 CMScom info@cmscom.jp
•Manabu TERADA (寺田 学) @terapyon
•Advisory Board Member of Plone Foundation
...
Contents
©2013 CMScom info@cmscom.jp
•About Japanese Language and other Languages
•Fuzzy Search on Plone
•About the produc...
Language Questions
©2013 CMScom info@cmscom.jp
3
ありがとう Thank you Obrigado
Gracias 谢谢 감사 합니다
ขอบคุณ Спасибо ‫$#"ا‬
Language Questions
©2013 CMScom info@cmscom.jp
3
ありがとう
日本語
Thank you
English
Obrigado
Portuguese
Gracias
Spanish
谢谢
Chines...
Language Questions
©2013 CMScom info@cmscom.jp
3
•Double bytes
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합...
Language Questions
©2013 CMScom info@cmscom.jp
3
•Double bytes
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합...
Language Questions
©2013 CMScom info@cmscom.jp
3
•Left to Right (LTR) or Right to Left (RTL)
ありがとう
日本語
Thank you
English
G...
Language Questions
©2013 CMScom info@cmscom.jp
3
•Left to Right (LTR) or Right to Left (RTL)
ありがとう
日本語
Thank you
English
G...
Language Questions
©2013 CMScom info@cmscom.jp
3
•No white space?
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감...
Language Questions
©2013 CMScom info@cmscom.jp
3
•No white space
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사...
Japanese
©2013 CMScom info@cmscom.jp
4
•Can you read this Japanese?
•私は寺田学です。日本の東京から来ました。ブラジルに
来たのは初めてです。
•I am Manabu TER...
Japanese
©2013 CMScom info@cmscom.jp
4
•Japanese doesn t have white space for splitting
words.
•Japanese has 3 different ch...
Japanese
©2013 CMScom info@cmscom.jp
4
•They are the same meaning.
•Kyoto ← Roma-ji
•京都 ← Kanji
•きょうと ← Hiragana
•キョウト ← K...
Japanese
©2013 CMScom info@cmscom.jp
4
•Can you read?
•橋 → ハシ → Hashi
•端 → ハシ → Hashi
•箸 → ハシ → Hashi
•They are different m...
Japanese and other Languages
©2013 CMScom info@cmscom.jp
4
•We have a lot of languages.
•We have a lot of rules.
•We have ...
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
Fuzzy Search
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
•Name: c2.search.fuzzy
•1.0a5 (alpha release)
https://pypi.python.org/...
5 About
©2012 CMScom info@cmscom.jp
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
•We want to get suggestions the same as Google.
•In the Intranet, we c...
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
•NOT use Solr. I know Solr is working well,
•But, it's difficult to inst...
Basic technology
©2013 CMScom info@cmscom.jp
6
•This system is not difficult.
•Keywords
•Levenshtein Distance
•Sorted list
•...
Basic technology
©2013 CMScom info@cmscom.jp
6
the Levenshtein distance is a string metric for
measuring the difference bet...
Basic technology
©2013 CMScom info@cmscom.jp
6
Levenshtein Distance
•base word: plone
•Zero Distance
•PLONE, Plone, pLone
...
Basic technology
©2013 CMScom info@cmscom.jp
6
Sorted list
•Ordered container (List) or 
•Can get Order of words
Sorted Or...
Basic technology
©2013 CMScom info@cmscom.jp
6
From @hiratara s slide:http://www.slideshare.net/hiratara/levenshtein-autom...
Basic technology
©2013 CMScom info@cmscom.jp
6
Levenshtein Automata
•I found a good blog entry:
• Damn Cool Algorithms: Le...
Basic technology
©2013 CMScom info@cmscom.jp
6
Index
•It create original index, like a Sorted List, when Plone
content is ...
Basic technology
©2013 CMScom info@cmscom.jp
6
•For example,
•We want to show by one distance (it s default).
•From the G2...
Dependencies
©2013 CMScom info@cmscom.jp
7
We need only Python.
Dependencies
©2013 CMScom info@cmscom.jp
7
•We use MeCab for Japanese support.
•Japanese don t has white space for splitti...
Dependencies
©2013 CMScom info@cmscom.jp
7
•Support language
•English and other European languages
•MAYBE: Arabic
•Chinese...
Domo
©2013 CMScom info@cmscom.jp
8
•View the video on YouTube
http://youtu.be/e5DHsF7Gi70
Structure of the product
©2013 CMScom info@cmscom.jp
9
•Index data will be stored in ZODB, it's List object.
•When it bein...
Structure of the product
©2013 CMScom info@cmscom.jp
9
•Search
•Checking the List from input word for less distance
by aut...
Structure of the product
©2013 CMScom info@cmscom.jp
9
for Japanese
•I'm using MeCab for splitting and getting phonetic.
•...
The plan of future
©2013 CMScom info@cmscom.jp
10
•Now, I'm using ZODB for index storing.
•I want to have a option, Storin...
Thanks
©2013 CMScom info@cmscom.jp
11
•Japanese & East Asian languages
•We have any problems yet in Plone.
•I think Plone ...
12 Special thanks
©2012 CMScom info@cmscom.jp
• Supported by
• ike @rokujyouhitoma
• @hiratara
• Referred web site
• http:...
13 Contact me
©2012 CMScom info@cmscom.jp
• Twitter: @terapyon
• Facebook: https://www.facebook.com/terapyon
Upcoming SlideShare
Loading in …5
×

Fuzzy search on plone & search for east asian language

779 views

Published on

Published in: Technology, Education
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total views
779
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
4
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Fuzzy search on plone & search for east asian language

  1. 1. ©2013 CMScom info@cmscom.jp Fuzzy Search on Plone and Search for East Asian Language CMS communications Inc, Manabu TERADA terada@cmscom.jp http://www.cmscom.jp 4 / Oct / 2013 Plone Conference 2013 in Brasilia
  2. 2. Who I am? (お前だれよ?) ©2013 CMScom info@cmscom.jp •Manabu TERADA (寺田 学) @terapyon •Advisory Board Member of Plone Foundation •Chair of PyCon APAC 2013 in Japan •Owner of CMS communications Inc. •Member of Plone Users Group Japan •Authors 1
  3. 3. Contents ©2013 CMScom info@cmscom.jp •About Japanese Language and other Languages •Fuzzy Search on Plone •About the product •Basic technology •Dependencies •Domo •Structure of the product •The plan of future 2
  4. 4. Language Questions ©2013 CMScom info@cmscom.jp 3 ありがとう Thank you Obrigado Gracias 谢谢 감사 합니다 ขอบคุณ Спасибо ‫$#"ا‬
  5. 5. Language Questions ©2013 CMScom info@cmscom.jp 3 ありがとう 日本語 Thank you English Obrigado Portuguese Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic
  6. 6. Language Questions ©2013 CMScom info@cmscom.jp 3 •Double bytes ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  7. 7. Language Questions ©2013 CMScom info@cmscom.jp 3 •Double bytes ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  8. 8. Language Questions ©2013 CMScom info@cmscom.jp 3 •Left to Right (LTR) or Right to Left (RTL) ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  9. 9. Language Questions ©2013 CMScom info@cmscom.jp 3 •Left to Right (LTR) or Right to Left (RTL) ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  10. 10. Language Questions ©2013 CMScom info@cmscom.jp 3 •No white space? ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  11. 11. Language Questions ©2013 CMScom info@cmscom.jp 3 •No white space ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  12. 12. Japanese ©2013 CMScom info@cmscom.jp 4 •Can you read this Japanese? •私は寺田学です。日本の東京から来ました。ブラジルに 来たのは初めてです。 •I am Manabu TERADA. I came from Tokyo, Japan. I have come to Brazil for the first time. •私 は 寺田 学 です。日本 の 東京 から 来ました。ブラ ジル に 来た のは 初めて です。
  13. 13. Japanese ©2013 CMScom info@cmscom.jp 4 •Japanese doesn t have white space for splitting words. •Japanese has 3 different characters, •Hiragana, Katakana, Kanji •Hiragana and Katakana are each 50 characters •Kanji is over 2000 characters •Japanese has same homonym by different characters, and has different homonym by same character.
  14. 14. Japanese ©2013 CMScom info@cmscom.jp 4 •They are the same meaning. •Kyoto ← Roma-ji •京都 ← Kanji •きょうと ← Hiragana •キョウト ← Katakana
  15. 15. Japanese ©2013 CMScom info@cmscom.jp 4 •Can you read? •橋 → ハシ → Hashi •端 → ハシ → Hashi •箸 → ハシ → Hashi •They are different meaning. •We can understand those by context.
  16. 16. Japanese and other Languages ©2013 CMScom info@cmscom.jp 4 •We have a lot of languages. •We have a lot of rules. •We have a lot of issues. •I want to have any solutions in Plone.
  17. 17. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 Fuzzy Search
  18. 18. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 •Name: c2.search.fuzzy •1.0a5 (alpha release) https://pypi.python.org/pypi/c2.search.fuzzy https://bitbucket.org/cmscom/c2.search.fuzzy
  19. 19. 5 About ©2012 CMScom info@cmscom.jp
  20. 20. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 •We want to get suggestions the same as Google. •In the Intranet, we can NOT use Google.
  21. 21. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 •NOT use Solr. I know Solr is working well, •But, it's difficult to install/configure/implement. •And, I want to build own system.
  22. 22. Basic technology ©2013 CMScom info@cmscom.jp 6 •This system is not difficult. •Keywords •Levenshtein Distance •Sorted list •Automata system
  23. 23. Basic technology ©2013 CMScom info@cmscom.jp 6 the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other. The phrase edit distance is often used to refer specifically to Levenshtein distance. It is named after Vladimir Levenshtein, who considered this distance in 1965.[1] It is closely related to pairwise string alignments. From WikiPedia: http://en.wikipedia.org/wiki/Levenshtein_distance
  24. 24. Basic technology ©2013 CMScom info@cmscom.jp 6 Levenshtein Distance •base word: plone •Zero Distance •PLONE, Plone, pLone •One Distance •Phone, plene, plne, lone, ploneg, ..... •Two Distance •one, plo, polne, ......
  25. 25. Basic technology ©2013 CMScom info@cmscom.jp 6 Sorted list •Ordered container (List) or •Can get Order of words Sorted Order from Unicode (by alphabet) ['Argentina', 'Australia', 'Brazil', 'Canada', 'China', 'European Union', 'France', 'Germany', 'India', 'Indonesia', 'Italy', 'Japan', 'Mexico', 'Russia', 'Saudi Arabia', 'South Africa', 'South Korea', 'Turkey', 'United Kingdom', 'United States'] for example (G20 s countries)
  26. 26. Basic technology ©2013 CMScom info@cmscom.jp 6 From @hiratara s slide:http://www.slideshare.net/hiratara/levenshtein-automata
  27. 27. Basic technology ©2013 CMScom info@cmscom.jp 6 Levenshtein Automata •I found a good blog entry: • Damn Cool Algorithms: Levenshtein Automata •http://blog.notdot.net/2010/07/Damn-Cool- Algorithms-Levenshtein-Automata •https://gist.github.com/Arachnid/491973 •It s only using Python!!
  28. 28. Basic technology ©2013 CMScom info@cmscom.jp 6 Index •It create original index, like a Sorted List, when Plone content is being created or modified. Search •Searching from original index when we input into search-box. •Correct spelling will be shown in original index in less distance. •Because, It can be shown inside Plone content.
  29. 29. Basic technology ©2013 CMScom info@cmscom.jp 6 •For example, •We want to show by one distance (it s default). •From the G20 countries list. •Brezil → Brazil •Japon → Japan •And, it use Automata system for increased speed.
  30. 30. Dependencies ©2013 CMScom info@cmscom.jp 7 We need only Python.
  31. 31. Dependencies ©2013 CMScom info@cmscom.jp 7 •We use MeCab for Japanese support. •Japanese don t has white space for splitting word. •(same as Chinese and Koran)
  32. 32. Dependencies ©2013 CMScom info@cmscom.jp 7 •Support language •English and other European languages •MAYBE: Arabic •Chinese and Korean •It s need to work splitting system •I don t know it.
  33. 33. Domo ©2013 CMScom info@cmscom.jp 8 •View the video on YouTube http://youtu.be/e5DHsF7Gi70
  34. 34. Structure of the product ©2013 CMScom info@cmscom.jp 9 •Index data will be stored in ZODB, it's List object. •When it being created or modified, will update the List by sorted. •List is into Dict, Dict key is phonetic (or lower case in English), value is original word. [{'argentina' : ['Argentina', 'argentina', 'ARGENTINA']}, {'australia': ['Australia']}, {'brazil' : ['Brazil]}, {'きょうと' : ['京都', 'キョウト']}] Example Index data
  35. 35. Structure of the product ©2013 CMScom info@cmscom.jp 9 •Search •Checking the List from input word for less distance by automata system. •It's shown the original word from list in Dict values under the search-box by JavaScript.
  36. 36. Structure of the product ©2013 CMScom info@cmscom.jp 9 for Japanese •I'm using MeCab for splitting and getting phonetic. •It's stored phonetic and original word. •Because Japanese has same homonym by different characters
  37. 37. The plan of future ©2013 CMScom info@cmscom.jp 10 •Now, I'm using ZODB for index storing. •I want to have a option, Storing to RDBMS. I'm trying to develop it. •I want to support more language. •Please help me for more support languages.
  38. 38. Thanks ©2013 CMScom info@cmscom.jp 11 •Japanese & East Asian languages •We have any problems yet in Plone. •I think Plone is working well in multi languages. •I wish Plone will be continuous working well. •All developers, you never forget other languages. •Fuzzy search •I want to get the bug report. •Please try to use the product.
  39. 39. 12 Special thanks ©2012 CMScom info@cmscom.jp • Supported by • ike @rokujyouhitoma • @hiratara • Referred web site • http://blog.notdot.net/2010/07/Damn-Cool- Algorithms-Levenshtein-Automata
  40. 40. 13 Contact me ©2012 CMScom info@cmscom.jp • Twitter: @terapyon • Facebook: https://www.facebook.com/terapyon

×