Fuzzy search on plone & search for east asian language

©2013 CMScom info@cmscom.jp
Fuzzy Search on Plone and Search for East Asian Language
CMS communications Inc,
Manabu TERADA terada@cmscom.jp
http://www.cmscom.jp
4 / Oct / 2013
Plone Conference 2013 in Brasilia

Who I am? (お前だれよ？)
•Manabu TERADA (寺田学) @terapyon
•Advisory Board Member of Plone Foundation
•Chair of PyCon APAC 2013 in Japan
•Owner of CMS communications Inc.
•Member of Plone Users Group Japan
•Authors
1

Contents
•About Japanese Language and other Languages
•Fuzzy Search on Plone
•About the product
•Basic technology
•Dependencies
•Domo
•Structure of the product
•The plan of future
2

Language Questions
3
ありがとう Thank you Obrigado
Gracias 谢谢 감사 합니다
ขอบคุณ Спасибо ‫$#"ا‬

Language Questions
3
ありがとう
日本語
Thank you
English
Obrigado
Portuguese
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic

Language Questions
3
•Double bytes
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese

Language Questions
3
•Left to Right (LTR) or Right to Left (RTL)
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese

Language Questions
3
•No white space?
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese

Language Questions
3
•No white space
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese

Japanese
4
•Can you read this Japanese?
•私は寺田学です。日本の東京から来ました。ブラジルに
来たのは初めてです。
•I am Manabu TERADA. I came from Tokyo, Japan. I
have come to Brazil for the ﬁrst time.
•私は寺田学です。日本の東京から来ました。ブラ
ジルに来たのは初めてです。

Japanese
4
•Japanese doesn t have white space for splitting
words.
•Japanese has 3 different characters,
•Hiragana, Katakana, Kanji
•Hiragana and Katakana are each 50 characters
•Kanji is over 2000 characters
•Japanese has same homonym by different
characters, and has different homonym by same
character.

Japanese
4
•They are the same meaning.
•Kyoto ← Roma-ji
•京都 ← Kanji
•きょうと ← Hiragana
•キョウト ← Katakana

Japanese
4
•Can you read?
•橋 → ハシ → Hashi
•端 → ハシ → Hashi
•箸 → ハシ → Hashi
•They are diﬀerent meaning.
•We can understand those by context.

Japanese and other Languages
4
•We have a lot of languages.
•We have a lot of rules.
•We have a lot of issues.
•I want to have any solutions in Plone.

Fuzzy Search on Plone
5
Fuzzy Search

5
•Name: c2.search.fuzzy
•1.0a5 (alpha release)
https://pypi.python.org/pypi/c2.search.fuzzy
https://bitbucket.org/cmscom/c2.search.fuzzy

5 About

5
•We want to get suggestions the same as Google.
•In the Intranet, we can NOT use Google.

5
•NOT use Solr. I know Solr is working well,
•But, it's diﬃcult to install/conﬁgure/implement.
•And, I want to build own system.

Basic technology
6
•This system is not diﬃcult.
•Keywords
•Levenshtein Distance
•Sorted list
•Automata system

Basic technology
6
the Levenshtein distance is a string metric for
measuring the diﬀerence between two sequences.
Informally, the Levenshtein distance between two
words is the minimum number of single-character
edits (insertion, deletion, substitution) required to
change one word into the other. The phrase edit
distance is often used to refer speciﬁcally to
Levenshtein distance. It is named after Vladimir
Levenshtein, who considered this distance in 1965.[1]
It is closely related to pairwise string alignments.
From WikiPedia: http://en.wikipedia.org/wiki/Levenshtein_distance

Basic technology
6
Levenshtein Distance
•base word: plone
•Zero Distance
•PLONE, Plone, pLone
•One Distance
•Phone, plene, plne, lone, ploneg, .....
•Two Distance
•one, plo, polne, ......

Basic technology
6
Sorted list
•Ordered container (List) or
•Can get Order of words
Sorted Order from Unicode (by alphabet)
['Argentina', 'Australia', 'Brazil', 'Canada', 'China',
'European Union', 'France', 'Germany', 'India', 'Indonesia',
'Italy', 'Japan', 'Mexico', 'Russia', 'Saudi Arabia',
'South Africa', 'South Korea', 'Turkey', 'United Kingdom',
'United States']
for example (G20 s countries)

Basic technology
6
From @hiratara s slide:http://www.slideshare.net/hiratara/levenshtein-automata

Basic technology
6
Levenshtein Automata
•I found a good blog entry:
• Damn Cool Algorithms: Levenshtein Automata
•http://blog.notdot.net/2010/07/Damn-Cool-
Algorithms-Levenshtein-Automata
•https://gist.github.com/Arachnid/491973
•It s only using Python!!

Basic technology
6
Index
•It create original index, like a Sorted List, when Plone
content is being created or modiﬁed.
Search
•Searching from original index when we input into
search-box.
•Correct spelling will be shown in original index in less
distance.
•Because, It can be shown inside Plone content.

Basic technology
6
•For example,
•We want to show by one distance (it s default).
•From the G20 countries list.
•Brezil → Brazil
•Japon → Japan
•And, it use Automata system for increased speed.

Dependencies
7
We need only Python.

Dependencies
7
•We use MeCab for Japanese support.
•Japanese don t has white space for splitting word.
•(same as Chinese and Koran)

Dependencies
7
•Support language
•English and other European languages
•MAYBE: Arabic
•Chinese and Korean
•It s need to work splitting system
•I don t know it.

Domo
8
•View the video on YouTube
http://youtu.be/e5DHsF7Gi70

Structure of the product
9
•Index data will be stored in ZODB, it's List object.
•When it being created or modiﬁed, will update the
List by sorted.
•List is into Dict, Dict key is phonetic (or lower case in
English), value is original word.
[{'argentina' : ['Argentina', 'argentina', 'ARGENTINA']},
{'australia': ['Australia']},
{'brazil' : ['Brazil]},
{'きょうと' : ['京都', 'キョウト']}]
Example Index data

9
•Search
•Checking the List from input word for less distance
by automata system.
•It's shown the original word from list in Dict values
under the search-box by JavaScript.

9
for Japanese
•I'm using MeCab for splitting and getting phonetic.
•It's stored phonetic and original word.
•Because Japanese has same homonym by diﬀerent
characters

The plan of future
10
•Now, I'm using ZODB for index storing.
•I want to have a option, Storing to RDBMS. I'm trying
to develop it.
•I want to support more language.
•Please help me for more support languages.

Thanks
11
•Japanese & East Asian languages
•We have any problems yet in Plone.
•I think Plone is working well in multi languages.
•I wish Plone will be continuous working well.
•All developers, you never forget other languages.
•Fuzzy search
•I want to get the bug report.
•Please try to use the product.

12 Special thanks
• Supported by
• ike @rokujyouhitoma
• @hiratara
• Referred web site
• http://blog.notdot.net/2010/07/Damn-Cool-
Algorithms-Levenshtein-Automata

13 Contact me
• Twitter: @terapyon
• Facebook: https://www.facebook.com/terapyon

Fuzzy search on plone & search for east asian language

Recommended

Recommended

More Related Content

Similar to Fuzzy search on plone & search for east asian language

Similar to Fuzzy search on plone & search for east asian language (20)

More from Manabu Terada

More from Manabu Terada (20)

Recently uploaded

Recently uploaded (20)

Fuzzy search on plone & search for east asian language