SlideShare a Scribd company logo
©2013 CMScom info@cmscom.jp
Fuzzy Search on Plone and Search for East Asian Language
CMS communications Inc,
Manabu TERADA terada@cmscom.jp
http://www.cmscom.jp
4 / Oct / 2013
Plone Conference 2013 in Brasilia
Who I am? (お前だれよ?)
©2013 CMScom info@cmscom.jp
•Manabu TERADA (寺田 学) @terapyon
•Advisory Board Member of Plone Foundation
•Chair of PyCon APAC 2013 in Japan
•Owner of CMS communications Inc.
•Member of Plone Users Group Japan
•Authors
1
Contents
©2013 CMScom info@cmscom.jp
•About Japanese Language and other Languages
•Fuzzy Search on Plone
•About the product
•Basic technology
•Dependencies
•Domo
•Structure of the product
•The plan of future
2
Language Questions
©2013 CMScom info@cmscom.jp
3
ありがとう Thank you Obrigado
Gracias 谢谢 감사 합니다
ขอบคุณ Спасибо ‫$#"ا‬
Language Questions
©2013 CMScom info@cmscom.jp
3
ありがとう
日本語
Thank you
English
Obrigado
Portuguese
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Language Questions
©2013 CMScom info@cmscom.jp
3
•Double bytes
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Language Questions
©2013 CMScom info@cmscom.jp
3
•Double bytes
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Language Questions
©2013 CMScom info@cmscom.jp
3
•Left to Right (LTR) or Right to Left (RTL)
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Language Questions
©2013 CMScom info@cmscom.jp
3
•Left to Right (LTR) or Right to Left (RTL)
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Language Questions
©2013 CMScom info@cmscom.jp
3
•No white space?
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Language Questions
©2013 CMScom info@cmscom.jp
3
•No white space
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Japanese
©2013 CMScom info@cmscom.jp
4
•Can you read this Japanese?
•私は寺田学です。日本の東京から来ました。ブラジルに
来たのは初めてです。
•I am Manabu TERADA. I came from Tokyo, Japan. I
have come to Brazil for the first time.
•私 は 寺田 学 です。日本 の 東京 から 来ました。ブラ
ジル に 来た のは 初めて です。
Japanese
©2013 CMScom info@cmscom.jp
4
•Japanese doesn t have white space for splitting
words.
•Japanese has 3 different characters,
•Hiragana, Katakana, Kanji
•Hiragana and Katakana are each 50 characters
•Kanji is over 2000 characters
•Japanese has same homonym by different
characters, and has different homonym by same
character.
Japanese
©2013 CMScom info@cmscom.jp
4
•They are the same meaning.
•Kyoto ← Roma-ji
•京都 ← Kanji
•きょうと ← Hiragana
•キョウト ← Katakana
Japanese
©2013 CMScom info@cmscom.jp
4
•Can you read?
•橋 → ハシ → Hashi
•端 → ハシ → Hashi
•箸 → ハシ → Hashi
•They are different meaning.
•We can understand those by context.
Japanese and other Languages
©2013 CMScom info@cmscom.jp
4
•We have a lot of languages.
•We have a lot of rules.
•We have a lot of issues.
•I want to have any solutions in Plone.
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
Fuzzy Search
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
•Name: c2.search.fuzzy
•1.0a5 (alpha release)
https://pypi.python.org/pypi/c2.search.fuzzy
https://bitbucket.org/cmscom/c2.search.fuzzy
5 About
©2012 CMScom info@cmscom.jp
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
•We want to get suggestions the same as Google.
•In the Intranet, we can NOT use Google.
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
•NOT use Solr. I know Solr is working well,
•But, it's difficult to install/configure/implement.
•And, I want to build own system.
Basic technology
©2013 CMScom info@cmscom.jp
6
•This system is not difficult.
•Keywords
•Levenshtein Distance
•Sorted list
•Automata system
Basic technology
©2013 CMScom info@cmscom.jp
6
the Levenshtein distance is a string metric for
measuring the difference between two sequences.
Informally, the Levenshtein distance between two
words is the minimum number of single-character
edits (insertion, deletion, substitution) required to
change one word into the other. The phrase edit
distance is often used to refer specifically to
Levenshtein distance. It is named after Vladimir
Levenshtein, who considered this distance in 1965.[1]
It is closely related to pairwise string alignments.
From WikiPedia: http://en.wikipedia.org/wiki/Levenshtein_distance
Basic technology
©2013 CMScom info@cmscom.jp
6
Levenshtein Distance
•base word: plone
•Zero Distance
•PLONE, Plone, pLone
•One Distance
•Phone, plene, plne, lone, ploneg, .....
•Two Distance
•one, plo, polne, ......
Basic technology
©2013 CMScom info@cmscom.jp
6
Sorted list
•Ordered container (List) or 
•Can get Order of words
Sorted Order from Unicode (by alphabet)
['Argentina', 'Australia', 'Brazil', 'Canada', 'China',
'European Union', 'France', 'Germany', 'India', 'Indonesia',
'Italy', 'Japan', 'Mexico', 'Russia', 'Saudi Arabia',
'South Africa', 'South Korea', 'Turkey', 'United Kingdom',
'United States']
for example (G20 s countries)
Basic technology
©2013 CMScom info@cmscom.jp
6
From @hiratara s slide:http://www.slideshare.net/hiratara/levenshtein-automata
Basic technology
©2013 CMScom info@cmscom.jp
6
Levenshtein Automata
•I found a good blog entry:
• Damn Cool Algorithms: Levenshtein Automata
•http://blog.notdot.net/2010/07/Damn-Cool-
Algorithms-Levenshtein-Automata
•https://gist.github.com/Arachnid/491973
•It s only using Python!!
Basic technology
©2013 CMScom info@cmscom.jp
6
Index
•It create original index, like a Sorted List, when Plone
content is being created or modified.
Search
•Searching from original index when we input into
search-box.
•Correct spelling will be shown in original index in less
distance.
•Because, It can be shown inside Plone content.
Basic technology
©2013 CMScom info@cmscom.jp
6
•For example,
•We want to show by one distance (it s default).
•From the G20 countries list.
•Brezil → Brazil
•Japon → Japan
•And, it use Automata system for increased speed.
Dependencies
©2013 CMScom info@cmscom.jp
7
We need only Python.
Dependencies
©2013 CMScom info@cmscom.jp
7
•We use MeCab for Japanese support.
•Japanese don t has white space for splitting word.
•(same as Chinese and Koran)
Dependencies
©2013 CMScom info@cmscom.jp
7
•Support language
•English and other European languages
•MAYBE: Arabic
•Chinese and Korean
•It s need to work splitting system
•I don t know it.
Domo
©2013 CMScom info@cmscom.jp
8
•View the video on YouTube
http://youtu.be/e5DHsF7Gi70
Structure of the product
©2013 CMScom info@cmscom.jp
9
•Index data will be stored in ZODB, it's List object.
•When it being created or modified, will update the
List by sorted.
•List is into Dict, Dict key is phonetic (or lower case in
English), value is original word.
[{'argentina' : ['Argentina', 'argentina', 'ARGENTINA']},
{'australia': ['Australia']},
{'brazil' : ['Brazil]},
{'きょうと' : ['京都', 'キョウト']}]
Example Index data
Structure of the product
©2013 CMScom info@cmscom.jp
9
•Search
•Checking the List from input word for less distance
by automata system.
•It's shown the original word from list in Dict values
under the search-box by JavaScript.
Structure of the product
©2013 CMScom info@cmscom.jp
9
for Japanese
•I'm using MeCab for splitting and getting phonetic.
•It's stored phonetic and original word.
•Because Japanese has same homonym by different
characters
The plan of future
©2013 CMScom info@cmscom.jp
10
•Now, I'm using ZODB for index storing.
•I want to have a option, Storing to RDBMS. I'm trying
to develop it.
•I want to support more language.
•Please help me for more support languages.
Thanks
©2013 CMScom info@cmscom.jp
11
•Japanese & East Asian languages
•We have any problems yet in Plone.
•I think Plone is working well in multi languages.
•I wish Plone will be continuous working well.
•All developers, you never forget other languages.
•Fuzzy search
•I want to get the bug report.
•Please try to use the product.
12 Special thanks
©2012 CMScom info@cmscom.jp
• Supported by
• ike @rokujyouhitoma
• @hiratara
• Referred web site
• http://blog.notdot.net/2010/07/Damn-Cool-
Algorithms-Levenshtein-Automata
13 Contact me
©2012 CMScom info@cmscom.jp
• Twitter: @terapyon
• Facebook: https://www.facebook.com/terapyon

More Related Content

Similar to Fuzzy search on plone & search for east asian language

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Databricks
 
Prototyping Accessibility - WordCamp Europe 2018
Prototyping Accessibility - WordCamp Europe 2018Prototyping Accessibility - WordCamp Europe 2018
Prototyping Accessibility - WordCamp Europe 2018
Adrian Roselli
 
Graduates Gone Mad: Innovations in Software
Graduates Gone Mad: Innovations in SoftwareGraduates Gone Mad: Innovations in Software
Graduates Gone Mad: Innovations in Software
Alper Kanat
 
Communication tool & Environment for Remote Worker
Communication tool & Environment for Remote WorkerCommunication tool & Environment for Remote Worker
Communication tool & Environment for Remote Worker
Shotaro Sakamaki
 
LocJAM Japan Presentation - Kyoto Study Group (December 2016)
LocJAM Japan Presentation - Kyoto Study Group (December 2016)LocJAM Japan Presentation - Kyoto Study Group (December 2016)
LocJAM Japan Presentation - Kyoto Study Group (December 2016)
Anthony Teixeira - French Video Game Translator
 
Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew eng
CineSoft
 
Subtle Encipherment Hall
Subtle Encipherment HallSubtle Encipherment Hall
Subtle Encipherment Hall
VenkateshwarGS
 
FEC2017-Introduction-to-programming
FEC2017-Introduction-to-programmingFEC2017-Introduction-to-programming
FEC2017-Introduction-to-programming
Henrikki Tenkanen
 
Welcome to the Brixton Library Technology Initiative
Welcome to the Brixton Library Technology InitiativeWelcome to the Brixton Library Technology Initiative
Welcome to the Brixton Library Technology Initiative
Basil Bibi
 
Building Large Sustainable Apps
Building Large Sustainable AppsBuilding Large Sustainable Apps
Building Large Sustainable Apps
Buğra Oral
 
Python Capitulo uno curso de programacion
Python Capitulo uno curso de programacionPython Capitulo uno curso de programacion
Python Capitulo uno curso de programacion
Jesus Vilchez Sandoval
 
Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analytics
Erik Tromp
 
Prototyping Accessibility: Booster 2019
Prototyping Accessibility: Booster 2019Prototyping Accessibility: Booster 2019
Prototyping Accessibility: Booster 2019
Adrian Roselli
 
Open source and free technologies for study skills
Open source and free technologies for study skillsOpen source and free technologies for study skills
Open source and free technologies for study skills
E.A. Draffan
 
ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...
ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...
ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...
Dr. Haxel Consult
 
2019 Fall SourceCon Sourcing Tools Roundtable
2019 Fall SourceCon Sourcing Tools Roundtable2019 Fall SourceCon Sourcing Tools Roundtable
2019 Fall SourceCon Sourcing Tools Roundtable
Susanna Frazier
 
How to Implement Domain Driven Design in Real Life SDLC
How to Implement Domain Driven Design  in Real Life SDLCHow to Implement Domain Driven Design  in Real Life SDLC
How to Implement Domain Driven Design in Real Life SDLC
Abdul Karim
 
python classes in thane
python classes in thanepython classes in thane
python classes in thane
faizrashid1995
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
Iván Montes
 
A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)
Julien SIMON
 

Similar to Fuzzy search on plone & search for east asian language (20)

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
 
Prototyping Accessibility - WordCamp Europe 2018
Prototyping Accessibility - WordCamp Europe 2018Prototyping Accessibility - WordCamp Europe 2018
Prototyping Accessibility - WordCamp Europe 2018
 
Graduates Gone Mad: Innovations in Software
Graduates Gone Mad: Innovations in SoftwareGraduates Gone Mad: Innovations in Software
Graduates Gone Mad: Innovations in Software
 
Communication tool & Environment for Remote Worker
Communication tool & Environment for Remote WorkerCommunication tool & Environment for Remote Worker
Communication tool & Environment for Remote Worker
 
LocJAM Japan Presentation - Kyoto Study Group (December 2016)
LocJAM Japan Presentation - Kyoto Study Group (December 2016)LocJAM Japan Presentation - Kyoto Study Group (December 2016)
LocJAM Japan Presentation - Kyoto Study Group (December 2016)
 
Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew eng
 
Subtle Encipherment Hall
Subtle Encipherment HallSubtle Encipherment Hall
Subtle Encipherment Hall
 
FEC2017-Introduction-to-programming
FEC2017-Introduction-to-programmingFEC2017-Introduction-to-programming
FEC2017-Introduction-to-programming
 
Welcome to the Brixton Library Technology Initiative
Welcome to the Brixton Library Technology InitiativeWelcome to the Brixton Library Technology Initiative
Welcome to the Brixton Library Technology Initiative
 
Building Large Sustainable Apps
Building Large Sustainable AppsBuilding Large Sustainable Apps
Building Large Sustainable Apps
 
Python Capitulo uno curso de programacion
Python Capitulo uno curso de programacionPython Capitulo uno curso de programacion
Python Capitulo uno curso de programacion
 
Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analytics
 
Prototyping Accessibility: Booster 2019
Prototyping Accessibility: Booster 2019Prototyping Accessibility: Booster 2019
Prototyping Accessibility: Booster 2019
 
Open source and free technologies for study skills
Open source and free technologies for study skillsOpen source and free technologies for study skills
Open source and free technologies for study skills
 
ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...
ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...
ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...
 
2019 Fall SourceCon Sourcing Tools Roundtable
2019 Fall SourceCon Sourcing Tools Roundtable2019 Fall SourceCon Sourcing Tools Roundtable
2019 Fall SourceCon Sourcing Tools Roundtable
 
How to Implement Domain Driven Design in Real Life SDLC
How to Implement Domain Driven Design  in Real Life SDLCHow to Implement Domain Driven Design  in Real Life SDLC
How to Implement Domain Driven Design in Real Life SDLC
 
python classes in thane
python classes in thanepython classes in thane
python classes in thane
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
 
A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)
 

More from Manabu Terada

SI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えようSI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えよう
Manabu Terada
 
私とコミュニティとPython
私とコミュニティとPython私とコミュニティとPython
私とコミュニティとPython
Manabu Terada
 
Plone 5 & アクセシビリティ at OSC 2015 Tokyo fall
Plone 5 & アクセシビリティ at OSC 2015 Tokyo fallPlone 5 & アクセシビリティ at OSC 2015 Tokyo fall
Plone 5 & アクセシビリティ at OSC 2015 Tokyo fall
Manabu Terada
 
Plone + AWS at Plone Symposium tokyo 2015
Plone + AWS at Plone Symposium tokyo 2015Plone + AWS at Plone Symposium tokyo 2015
Plone + AWS at Plone Symposium tokyo 2015Manabu Terada
 
Osc2015 Tokyo Spring Plone by terada
Osc2015 Tokyo Spring Plone by teradaOsc2015 Tokyo Spring Plone by terada
Osc2015 Tokyo Spring Plone by terada
Manabu Terada
 
Plone conf 2014report by terada
Plone conf 2014report by teradaPlone conf 2014report by terada
Plone conf 2014report by terada
Manabu Terada
 
PloneConf 2014 CDN terada
PloneConf 2014 CDN teradaPloneConf 2014 CDN terada
PloneConf 2014 CDN terada
Manabu Terada
 
Planning plone Symposium Tokyo 2015
Planning plone Symposium Tokyo 2015Planning plone Symposium Tokyo 2015
Planning plone Symposium Tokyo 2015
Manabu Terada
 
OSC 2014 Tokyo fall plone_terada
OSC 2014 Tokyo fall plone_teradaOSC 2014 Tokyo fall plone_terada
OSC 2014 Tokyo fall plone_terada
Manabu Terada
 
PyCon JP 2014 plone terada
PyCon JP 2014 plone teradaPyCon JP 2014 plone terada
PyCon JP 2014 plone terada
Manabu Terada
 
Varnish 4 Release Party in Tokyo (terada)
Varnish 4 Release Party in Tokyo (terada)Varnish 4 Release Party in Tokyo (terada)
Varnish 4 Release Party in Tokyo (terada)
Manabu Terada
 
Ja sakai conf 2014 edx by Manabu TERADA
Ja sakai conf 2014 edx by Manabu TERADAJa sakai conf 2014 edx by Manabu TERADA
Ja sakai conf 2014 edx by Manabu TERADAManabu Terada
 
Reporting of PyCon APAC at ploneconf / PyCon BR
Reporting of  PyCon APAC at ploneconf / PyCon BRReporting of  PyCon APAC at ploneconf / PyCon BR
Reporting of PyCon APAC at ploneconf / PyCon BR
Manabu Terada
 
PyCon asiapacific 2013 bengkeat
PyCon asiapacific 2013 bengkeatPyCon asiapacific 2013 bengkeat
PyCon asiapacific 2013 bengkeat
Manabu Terada
 
PyCon APAC session Frontpage for iqbal
PyCon APAC session Frontpage for iqbalPyCon APAC session Frontpage for iqbal
PyCon APAC session Frontpage for iqbal
Manabu Terada
 
Pyconapac2014taiwan
Pyconapac2014taiwanPyconapac2014taiwan
Pyconapac2014taiwan
Manabu Terada
 
PyCon APAC 2013 Apac session terada
PyCon APAC 2013 Apac session teradaPyCon APAC 2013 Apac session terada
PyCon APAC 2013 Apac session terada
Manabu Terada
 
グリーンコンサート視察報告 (寺田)
グリーンコンサート視察報告 (寺田)グリーンコンサート視察報告 (寺田)
グリーンコンサート視察報告 (寺田)
Manabu Terada
 
Plone talk 201308_terada
Plone talk 201308_teradaPlone talk 201308_terada
Plone talk 201308_teradaManabu Terada
 

More from Manabu Terada (20)

SI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えようSI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えよう
 
私とコミュニティとPython
私とコミュニティとPython私とコミュニティとPython
私とコミュニティとPython
 
Plone 5 & アクセシビリティ at OSC 2015 Tokyo fall
Plone 5 & アクセシビリティ at OSC 2015 Tokyo fallPlone 5 & アクセシビリティ at OSC 2015 Tokyo fall
Plone 5 & アクセシビリティ at OSC 2015 Tokyo fall
 
Plone + AWS at Plone Symposium tokyo 2015
Plone + AWS at Plone Symposium tokyo 2015Plone + AWS at Plone Symposium tokyo 2015
Plone + AWS at Plone Symposium tokyo 2015
 
Osc2015 Tokyo Spring Plone by terada
Osc2015 Tokyo Spring Plone by teradaOsc2015 Tokyo Spring Plone by terada
Osc2015 Tokyo Spring Plone by terada
 
Plone conf 2014report by terada
Plone conf 2014report by teradaPlone conf 2014report by terada
Plone conf 2014report by terada
 
PloneConf 2014 CDN terada
PloneConf 2014 CDN teradaPloneConf 2014 CDN terada
PloneConf 2014 CDN terada
 
Planning plone Symposium Tokyo 2015
Planning plone Symposium Tokyo 2015Planning plone Symposium Tokyo 2015
Planning plone Symposium Tokyo 2015
 
OSC 2014 Tokyo fall plone_terada
OSC 2014 Tokyo fall plone_teradaOSC 2014 Tokyo fall plone_terada
OSC 2014 Tokyo fall plone_terada
 
PyCon JP 2014 plone terada
PyCon JP 2014 plone teradaPyCon JP 2014 plone terada
PyCon JP 2014 plone terada
 
WPD tokyo opening
WPD tokyo openingWPD tokyo opening
WPD tokyo opening
 
Varnish 4 Release Party in Tokyo (terada)
Varnish 4 Release Party in Tokyo (terada)Varnish 4 Release Party in Tokyo (terada)
Varnish 4 Release Party in Tokyo (terada)
 
Ja sakai conf 2014 edx by Manabu TERADA
Ja sakai conf 2014 edx by Manabu TERADAJa sakai conf 2014 edx by Manabu TERADA
Ja sakai conf 2014 edx by Manabu TERADA
 
Reporting of PyCon APAC at ploneconf / PyCon BR
Reporting of  PyCon APAC at ploneconf / PyCon BRReporting of  PyCon APAC at ploneconf / PyCon BR
Reporting of PyCon APAC at ploneconf / PyCon BR
 
PyCon asiapacific 2013 bengkeat
PyCon asiapacific 2013 bengkeatPyCon asiapacific 2013 bengkeat
PyCon asiapacific 2013 bengkeat
 
PyCon APAC session Frontpage for iqbal
PyCon APAC session Frontpage for iqbalPyCon APAC session Frontpage for iqbal
PyCon APAC session Frontpage for iqbal
 
Pyconapac2014taiwan
Pyconapac2014taiwanPyconapac2014taiwan
Pyconapac2014taiwan
 
PyCon APAC 2013 Apac session terada
PyCon APAC 2013 Apac session teradaPyCon APAC 2013 Apac session terada
PyCon APAC 2013 Apac session terada
 
グリーンコンサート視察報告 (寺田)
グリーンコンサート視察報告 (寺田)グリーンコンサート視察報告 (寺田)
グリーンコンサート視察報告 (寺田)
 
Plone talk 201308_terada
Plone talk 201308_teradaPlone talk 201308_terada
Plone talk 201308_terada
 

Recently uploaded

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 

Recently uploaded (20)

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 

Fuzzy search on plone & search for east asian language

  • 1. ©2013 CMScom info@cmscom.jp Fuzzy Search on Plone and Search for East Asian Language CMS communications Inc, Manabu TERADA terada@cmscom.jp http://www.cmscom.jp 4 / Oct / 2013 Plone Conference 2013 in Brasilia
  • 2. Who I am? (お前だれよ?) ©2013 CMScom info@cmscom.jp •Manabu TERADA (寺田 学) @terapyon •Advisory Board Member of Plone Foundation •Chair of PyCon APAC 2013 in Japan •Owner of CMS communications Inc. •Member of Plone Users Group Japan •Authors 1
  • 3. Contents ©2013 CMScom info@cmscom.jp •About Japanese Language and other Languages •Fuzzy Search on Plone •About the product •Basic technology •Dependencies •Domo •Structure of the product •The plan of future 2
  • 4. Language Questions ©2013 CMScom info@cmscom.jp 3 ありがとう Thank you Obrigado Gracias 谢谢 감사 합니다 ขอบคุณ Спасибо ‫$#"ا‬
  • 5. Language Questions ©2013 CMScom info@cmscom.jp 3 ありがとう 日本語 Thank you English Obrigado Portuguese Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic
  • 6. Language Questions ©2013 CMScom info@cmscom.jp 3 •Double bytes ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 7. Language Questions ©2013 CMScom info@cmscom.jp 3 •Double bytes ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 8. Language Questions ©2013 CMScom info@cmscom.jp 3 •Left to Right (LTR) or Right to Left (RTL) ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 9. Language Questions ©2013 CMScom info@cmscom.jp 3 •Left to Right (LTR) or Right to Left (RTL) ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 10. Language Questions ©2013 CMScom info@cmscom.jp 3 •No white space? ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 11. Language Questions ©2013 CMScom info@cmscom.jp 3 •No white space ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 12. Japanese ©2013 CMScom info@cmscom.jp 4 •Can you read this Japanese? •私は寺田学です。日本の東京から来ました。ブラジルに 来たのは初めてです。 •I am Manabu TERADA. I came from Tokyo, Japan. I have come to Brazil for the first time. •私 は 寺田 学 です。日本 の 東京 から 来ました。ブラ ジル に 来た のは 初めて です。
  • 13. Japanese ©2013 CMScom info@cmscom.jp 4 •Japanese doesn t have white space for splitting words. •Japanese has 3 different characters, •Hiragana, Katakana, Kanji •Hiragana and Katakana are each 50 characters •Kanji is over 2000 characters •Japanese has same homonym by different characters, and has different homonym by same character.
  • 14. Japanese ©2013 CMScom info@cmscom.jp 4 •They are the same meaning. •Kyoto ← Roma-ji •京都 ← Kanji •きょうと ← Hiragana •キョウト ← Katakana
  • 15. Japanese ©2013 CMScom info@cmscom.jp 4 •Can you read? •橋 → ハシ → Hashi •端 → ハシ → Hashi •箸 → ハシ → Hashi •They are different meaning. •We can understand those by context.
  • 16. Japanese and other Languages ©2013 CMScom info@cmscom.jp 4 •We have a lot of languages. •We have a lot of rules. •We have a lot of issues. •I want to have any solutions in Plone.
  • 17. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 Fuzzy Search
  • 18. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 •Name: c2.search.fuzzy •1.0a5 (alpha release) https://pypi.python.org/pypi/c2.search.fuzzy https://bitbucket.org/cmscom/c2.search.fuzzy
  • 19. 5 About ©2012 CMScom info@cmscom.jp
  • 20. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 •We want to get suggestions the same as Google. •In the Intranet, we can NOT use Google.
  • 21. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 •NOT use Solr. I know Solr is working well, •But, it's difficult to install/configure/implement. •And, I want to build own system.
  • 22. Basic technology ©2013 CMScom info@cmscom.jp 6 •This system is not difficult. •Keywords •Levenshtein Distance •Sorted list •Automata system
  • 23. Basic technology ©2013 CMScom info@cmscom.jp 6 the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other. The phrase edit distance is often used to refer specifically to Levenshtein distance. It is named after Vladimir Levenshtein, who considered this distance in 1965.[1] It is closely related to pairwise string alignments. From WikiPedia: http://en.wikipedia.org/wiki/Levenshtein_distance
  • 24. Basic technology ©2013 CMScom info@cmscom.jp 6 Levenshtein Distance •base word: plone •Zero Distance •PLONE, Plone, pLone •One Distance •Phone, plene, plne, lone, ploneg, ..... •Two Distance •one, plo, polne, ......
  • 25. Basic technology ©2013 CMScom info@cmscom.jp 6 Sorted list •Ordered container (List) or •Can get Order of words Sorted Order from Unicode (by alphabet) ['Argentina', 'Australia', 'Brazil', 'Canada', 'China', 'European Union', 'France', 'Germany', 'India', 'Indonesia', 'Italy', 'Japan', 'Mexico', 'Russia', 'Saudi Arabia', 'South Africa', 'South Korea', 'Turkey', 'United Kingdom', 'United States'] for example (G20 s countries)
  • 26. Basic technology ©2013 CMScom info@cmscom.jp 6 From @hiratara s slide:http://www.slideshare.net/hiratara/levenshtein-automata
  • 27. Basic technology ©2013 CMScom info@cmscom.jp 6 Levenshtein Automata •I found a good blog entry: • Damn Cool Algorithms: Levenshtein Automata •http://blog.notdot.net/2010/07/Damn-Cool- Algorithms-Levenshtein-Automata •https://gist.github.com/Arachnid/491973 •It s only using Python!!
  • 28. Basic technology ©2013 CMScom info@cmscom.jp 6 Index •It create original index, like a Sorted List, when Plone content is being created or modified. Search •Searching from original index when we input into search-box. •Correct spelling will be shown in original index in less distance. •Because, It can be shown inside Plone content.
  • 29. Basic technology ©2013 CMScom info@cmscom.jp 6 •For example, •We want to show by one distance (it s default). •From the G20 countries list. •Brezil → Brazil •Japon → Japan •And, it use Automata system for increased speed.
  • 31. Dependencies ©2013 CMScom info@cmscom.jp 7 •We use MeCab for Japanese support. •Japanese don t has white space for splitting word. •(same as Chinese and Koran)
  • 32. Dependencies ©2013 CMScom info@cmscom.jp 7 •Support language •English and other European languages •MAYBE: Arabic •Chinese and Korean •It s need to work splitting system •I don t know it.
  • 33. Domo ©2013 CMScom info@cmscom.jp 8 •View the video on YouTube http://youtu.be/e5DHsF7Gi70
  • 34. Structure of the product ©2013 CMScom info@cmscom.jp 9 •Index data will be stored in ZODB, it's List object. •When it being created or modified, will update the List by sorted. •List is into Dict, Dict key is phonetic (or lower case in English), value is original word. [{'argentina' : ['Argentina', 'argentina', 'ARGENTINA']}, {'australia': ['Australia']}, {'brazil' : ['Brazil]}, {'きょうと' : ['京都', 'キョウト']}] Example Index data
  • 35. Structure of the product ©2013 CMScom info@cmscom.jp 9 •Search •Checking the List from input word for less distance by automata system. •It's shown the original word from list in Dict values under the search-box by JavaScript.
  • 36. Structure of the product ©2013 CMScom info@cmscom.jp 9 for Japanese •I'm using MeCab for splitting and getting phonetic. •It's stored phonetic and original word. •Because Japanese has same homonym by different characters
  • 37. The plan of future ©2013 CMScom info@cmscom.jp 10 •Now, I'm using ZODB for index storing. •I want to have a option, Storing to RDBMS. I'm trying to develop it. •I want to support more language. •Please help me for more support languages.
  • 38. Thanks ©2013 CMScom info@cmscom.jp 11 •Japanese & East Asian languages •We have any problems yet in Plone. •I think Plone is working well in multi languages. •I wish Plone will be continuous working well. •All developers, you never forget other languages. •Fuzzy search •I want to get the bug report. •Please try to use the product.
  • 39. 12 Special thanks ©2012 CMScom info@cmscom.jp • Supported by • ike @rokujyouhitoma • @hiratara • Referred web site • http://blog.notdot.net/2010/07/Damn-Cool- Algorithms-Levenshtein-Automata
  • 40. 13 Contact me ©2012 CMScom info@cmscom.jp • Twitter: @terapyon • Facebook: https://www.facebook.com/terapyon