SlideShare a Scribd company logo
1 of 99
Download to read offline
Japanese linguistics
in Apache Lucene™ and Apache Solr™

             May 9th, 2012

             Christian Moen
          christian@atilika.com
About me
•   MSc. in computer science, University of Oslo, Norway
•   Worked with search at FAST (now Microsoft) for 10 years
     •   5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway
     •   5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan
•   Founded アティリカ株式会社 in 2009
     •   We help companies innovate using search technologies and good ideas
     •   We know information retrieval, natural language processing and big data
     •   We are based in Tokyo, but we have clients everywhere
•   Newbie Lucene & Solr Committer
     •   Mostly been working on Japanese language support (Kuromoji) so far
•   Please write me on christian@atilika.com or cm@apache.org
Today’s topics
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Japanese 101
ビールください
 bi-ru kudasai
ビールください
 bi-ru kudasai

A beer, please
ありがとうございます!
 arigatō gozaimasu!
ありがとうございます!
 arigatō gozaimasu!

Thank you very much!
乾杯!
kanpai!
乾杯!
kanpai!

Cheers!
JR新宿駅の近くにビールを飲みに行こうか?
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?
JR新宿駅の近くにビールを飲みに行こうか?
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?

  Shall we go for a beer near JR Shinjuku station?
JR新宿駅の近くにビールを飲みに行こうか?
Romaji - ローマ字
・Latin characters (26+)
・Used for proper nouns, etc.



 JR新宿駅の近くにビールを飲みに行こうか?
Katakana - カタカナ
          ・Phonetic script (~50)
          ・Typically used for loan words



JR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?


Kanji - 漢字
・Chinese characters (50,000+)
・Used for stems & proper nouns
JR新宿駅の近くにビールを飲みに行こうか?


          Hiragana - ひらがな
          ・Phonetic script (~50)
          ・Used for inflections & particles
Romaji - ローマ字                   Katakana - カタカナ
・Latin characters (26+)         ・Phonetic script (~50)
・Used for proper nouns, etc.    ・Typically used for loan words



 JR新宿駅の近くにビールを飲みに行こうか?


Kanji - 漢字                      Hiragana - ひらがな
・Chinese characters (50,000+)   ・Phonetic script (~50)
・Used for stems & proper nouns ・Used for inflections & particles
JR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?
? What are the words in this sentence?
JR新宿駅の近くにビールを飲みに行こうか?
? What are the words in this sentence?
! Words are implicit in Japanese - there
  is no white space that separates them
JR新宿駅の近くにビールを飲みに行こうか?
? How do we index this for search, then?
JR新宿駅の近くにビールを飲みに行こうか?
? How do we index this for search, then?
! We need to segment text into tokens first
! Two major approaches for segmentation

          1. n-gramming
          2. morphological analysis
            (statistical approach)
n-gramming (n=2)
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?
n-gramming (n=2)
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR               Shall we go for a beer near JR Shinjuku station?
 n=2




JR
n-gramming (n=2)
J R新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                Shall we go for a beer near JR Shinjuku station?
 n=2
       R新




JR R新
n-gramming (n=2)
J R 新宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                     Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿




JR R新 新宿
n-gramming (n=2)
J R 新 宿駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                      Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅




JR R新 新宿 宿駅
n-gramming (n=2)
J R 新 宿 駅の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                        Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の




JR R新 新宿 宿駅 駅の
n-gramming (n=2)
J R 新 宿 駅 の近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                             Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の

                           の近




JR R新 新宿 宿駅 駅の の近
n-gramming (n=2)
J R 新 宿 駅 の 近く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                                  Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の

                           の近


                                近く




JR R新 新宿 宿駅 駅の の近 近く
Problems with n-gramming
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×  ×  ●
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
Generates many terms per document or query
Impacts on index size and search performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
•   Also generates many terms per document or query
     •   Impacts on index size and performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
•   Also generates many terms per document or query
     •   Impacts on index size and performance
•   Still sometimes appropriate for certain search applications
     •   Compliance, e-commerce with special product names, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   CRFs decoded using Viterbi
       •   Also does part-of-speech tagging, readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   Conditional Random Fields (CRFs) decoded using Viterbi
       •   Also does part-of-speech tagging, extract readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   Conditional Random Fields (CRFs) decoded using Viterbi
       •   Also does part-of-speech tagging, readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
How does this actually work?
Demo
Japanese support in
  Lucene and Solr
Japanese in Lucene/Solr
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults

! Provides sophisticated Japanese linguistics
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults

! Provides sophisticated Japanese linguistics

! Customisable
How do we use it?
How do we use it?

      ! Use JapaneseAnalyzer
How do we use it?

      ! Use JapaneseAnalyzer



      ! Use field type “text_ja”
        in example schema.xml
Demo
Feature summary / text_ja analyzer chain
                       Segments Japanese text into tokens with very high accuracy
   JapaneseTokenizer   •   Token attributes for part-of-speech, base form, readings, etc.
                       •   Compound segmentation with compound synonyms
                       •   Segmentation is customisable using user dictionaries
Feature summary / text_ja analyzer chain
                         Segments Japanese text into tokens with very high accuracy
     JapaneseTokenizer    •   Token attributes for part-of-speech, base form, readings, etc.
                          •   Compound segmentation with compound synonyms
                          •   Segmentation is customisable using user dictionaries


JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
Feature summary / text_ja analyzer chain
                                 Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer     •   Token attributes for part-of-speech, base form, readings, etc.
                                  •   Compound segmentation with compound synonyms
                                  •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                 Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                 See example/solr/conf/lang/stoptags_ja.txt
Feature summary / text_ja analyzer chain
                                 Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer     •   Token attributes for part-of-speech, base form, readings, etc.
                                  •   Compound segmentation with compound synonyms
                                  •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                 Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                 See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt


   JapaneseKatakanaStemFilter Normalises common katakana spelling variations
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt


   JapaneseKatakanaStemFilter Normalises common katakana spelling variations

               LowerCaseFilter Lowercases
Feature details
Compound nouns
? How do we deal with compound nouns?
Compound nouns
? How do we deal with compound nouns?
      Japanese                English
    関西国際空港           Kansai International Airport
シニアソフトウェアエンジニア        Senior Software Engineer
Compound nouns
? How do we deal with compound nouns?
       Japanese                  English
    関西国際空港              Kansai International Airport
シニアソフトウェアエンジニア           Senior Software Engineer


! These are one word in Japanese, so
  searching for 空港 (airport) doesn’t match
Compound nouns
? How do we deal with compound nouns?
       Japanese                  English
    関西国際空港              Kansai International Airport
シニアソフトウェアエンジニア           Senior Software Engineer


! These are one word in Japanese, so
  searching for 空港 (airport) doesn’t match

! We need to segment the compounds, too
Compound segmentation

    関西国際空港
Kansai International Airport
シニアソフトウェアエンジニナ
 Senior Software Engineer




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西
Kansai International Airport   Kansai
シニアソフトウェアエンジニナ                 シニア
 Senior Software Engineer      Senior




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西          国際
Kansai International Airport   Kansai   International
シニアソフトウェアエンジニナ                 シニア      ソフトウェア
 Senior Software Engineer      Senior    Software




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西          国際            空港
Kansai International Airport   Kansai   International   Airport
シニアソフトウェアエンジニナ                 シニア      ソフトウェア          エンジニナ
 Senior Software Engineer      Senior    Software       Engineer




 ! We are using a heuristic to implement this
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its part
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Character width normalisation
? How do we deal with character widths?
         Half-width・半角   Full-width・全角
            Lucene        Lucene
             カタカナ          カタカナ
             123           123
Character width normalisation
? How do we deal with character widths?
              Half-width・半角              Full-width・全角
                   Lucene                 Lucene
                    カタカナ                   カタカナ
                    123                    123


! Use CJKWidthFilter to normalise them
  (Unicode NFKC subset)



             Input text Lucene             カタカナ        123

        CJKWidthFilter      Lucene        カタカナ          123

                            half-width    full-width   half-width
Katakana end-vowel stemming
? A common spelling variation in
  katakana is a end long-vowel sound
   English   Japanese spelling variations
  manager    マネージャー            マネージャ        マネジャー
Katakana end-vowel stemming
  ? A common spelling variation in
    katakana is a end long-vowel sound
       English     Japanese spelling variations
       manager     マネージャー            マネージャ         マネジャー



   ! We JapaneseKatakanaStemFilter to
     normalise/stem end-vowel for long terms

                 Input text コピー     マネージャー        マネージャ      マネジャー
JapaneseKatakanaStemFilter コピー       マネージャ        マネージャ      マネジャ
                            copy       manager     manager   “manager”
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form


        買う
       kau
      to buy
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form   Inflected forms (not exhaustive)
                       買いなさい       買いませんでしたら   買える        買わせられる


        買う             買いなさるな
                       買いましたら
                                   買いませんでしたり
                                   買いませんなら
                                               買おう
                                               買った
                                                          買わせる
                                                          買わない
                       買いましたり      買うだろう       買ったら       買わないだろう


       kau             買いまして
                       買いましょう
                                   買うでしょう
                                   買うな
                                               買ったり
                                               買って
                                                          買わないで
                                                          買わないでしょう
                                               買わせない

      to buy
                       買います        買うまい                   買わなかった
                       買いますまい      買え          買わせます      買わなかったら
                       買いませば       買えない        買わせません     買わなかったり
                       買いません       買えば         買わせられない    買わなければ
                       買いませんで      買えます        買わせられます    買われない
                       買いませんでした    買えません       買わせられません   買われます
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form      Inflected forms (not exhaustive)
                           買いなさい      買いませんでしたら   買える        買わせられる


        買う                 買いなさるな
                           買いましたら
                                      買いませんでしたり
                                      買いませんなら
                                                  買おう
                                                  買った
                                                             買わせる
                                                             買わない
                           買いましたり     買うだろう       買ったら       買わないだろう


       kau                 買いまして
                           買いましょう
                                      買うでしょう
                                      買うな
                                                  買ったり
                                                  買って
                                                             買わないで
                                                             買わないでしょう
                                                  買わせない

      to buy
                           買います       買うまい                   買わなかった
                           買いますまい     買え          買わせます      買わなかったら
                           買いませば      買えない        買わせません     買わなかったり
                           買いません      買えば         買わせられない    買わなければ
                           買いませんで     買えます        買わせられます    買われない
                           買いませんでした   買えません       買わせられません   買われます




 ! Use JapaneseBaseformFilter to normalise
   inflected adjectives and verbs to dictionary form
   (lemmatisation by reduction)
User dictionaries
•   Own dictionaries can be used for ad hoc
    segmentation, i.e. to override default model
•   File format is simple and there’s no need to
    assign weights, etc. before using them
•   Example custom dictionary:
# Custom segmentation and POS entry for long entries
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞

# Custom reading and POS former sumo wrestler Asashoryu
朝青龍,朝青龍,アサショウリュウ,カスタム人名
Japanese focus in 4.0
•   Improvements in JapaneseTokenizer
     •   Improved search mode for katakana compounds
     •   Improved unknown word segmentation
     •   Some performance improvements
•   CharFilters for various character normalisations
     •   Dates and numbers
     •   Repetition marks (odoriji)
•   Japanese spell-checker
     •   Robert and Koji almost got this into 3.6, but it got
         postponed because of API changes being necessary
Acknowledgements
Robert Muir
Thanks for the heavy lifting integrating Kuromoji into Lucene
and always reviewing my patches quickly and friendly help
Michael McCandless
Thanks for streaming Viterbi and synonym compounds!
Uwe Schindler
Thanks for performance improvements + being the policeman
Simon Willnauer
Thanks for doing the Kuromoji code donation process so well
Gaute Lambertsen & Gerry Hocks
Thanks for presentation feedback and being great colleagues
Q&A
ありがとうございました!
 arigatō gozaimashita!

Thank you very much!

More Related Content

What's hot

온라인 게임과 소셜 게임 서버는 어떻게 다른가?
온라인 게임과 소셜 게임 서버는 어떻게 다른가?온라인 게임과 소셜 게임 서버는 어떻게 다른가?
온라인 게임과 소셜 게임 서버는 어떻게 다른가?Seok-ju Yun
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Spark Summit
 
Wavelet matrix implementation
Wavelet matrix implementationWavelet matrix implementation
Wavelet matrix implementationMITSUNARI Shigeo
 
Building REST API using Akka HTTP with Scala
Building REST API using Akka HTTP with ScalaBuilding REST API using Akka HTTP with Scala
Building REST API using Akka HTTP with ScalaKnoldus Inc.
 
RustによるGPUプログラミング環境
RustによるGPUプログラミング環境RustによるGPUプログラミング環境
RustによるGPUプログラミング環境KiyotomoHiroyasu
 
Luigi PyData presentation
Luigi PyData presentationLuigi PyData presentation
Luigi PyData presentationElias Freider
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaJiangjie Qin
 
ドメイン名の話 (データベース/SQL)
ドメイン名の話 (データベース/SQL)ドメイン名の話 (データベース/SQL)
ドメイン名の話 (データベース/SQL)tsudaa
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafkaconfluent
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaJoe Stein
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLYugabyte
 
Domain Modeling Made Functional (DevTernity 2022)
Domain Modeling Made Functional (DevTernity 2022)Domain Modeling Made Functional (DevTernity 2022)
Domain Modeling Made Functional (DevTernity 2022)Scott Wlaschin
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafkaJiangjie Qin
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...HostedbyConfluent
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
 
NDC12_Lockless게임서버설계와구현
NDC12_Lockless게임서버설계와구현NDC12_Lockless게임서버설계와구현
NDC12_Lockless게임서버설계와구현noerror
 
C#을 사용한 빠른 툴 개발
C#을 사용한 빠른 툴 개발C#을 사용한 빠른 툴 개발
C#을 사용한 빠른 툴 개발흥배 최
 

What's hot (20)

온라인 게임과 소셜 게임 서버는 어떻게 다른가?
온라인 게임과 소셜 게임 서버는 어떻게 다른가?온라인 게임과 소셜 게임 서버는 어떻게 다른가?
온라인 게임과 소셜 게임 서버는 어떻게 다른가?
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Wavelet matrix implementation
Wavelet matrix implementationWavelet matrix implementation
Wavelet matrix implementation
 
Building REST API using Akka HTTP with Scala
Building REST API using Akka HTTP with ScalaBuilding REST API using Akka HTTP with Scala
Building REST API using Akka HTTP with Scala
 
RustによるGPUプログラミング環境
RustによるGPUプログラミング環境RustによるGPUプログラミング環境
RustによるGPUプログラミング環境
 
Luigi PyData presentation
Luigi PyData presentationLuigi PyData presentation
Luigi PyData presentation
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
ドメイン名の話 (データベース/SQL)
ドメイン名の話 (データベース/SQL)ドメイン名の話 (データベース/SQL)
ドメイン名の話 (データベース/SQL)
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache Kafka
 
Memory in go
Memory in goMemory in go
Memory in go
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQL
 
Domain Modeling Made Functional (DevTernity 2022)
Domain Modeling Made Functional (DevTernity 2022)Domain Modeling Made Functional (DevTernity 2022)
Domain Modeling Made Functional (DevTernity 2022)
 
pg_trgmと全文検索
pg_trgmと全文検索pg_trgmと全文検索
pg_trgmと全文検索
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
NDC12_Lockless게임서버설계와구현
NDC12_Lockless게임서버설계와구현NDC12_Lockless게임서버설계와구현
NDC12_Lockless게임서버설계와구현
 
暗号技術入門
暗号技術入門暗号技術入門
暗号技術入門
 
C#을 사용한 빠른 툴 개발
C#을 사용한 빠른 툴 개발C#을 사용한 빠른 툴 개발
C#을 사용한 빠른 툴 개발
 

Viewers also liked

形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介Toshinori Sato
 
機械学習の全般について 4
機械学習の全般について 4機械学習の全般について 4
機械学習の全般について 4Masato Nakai
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案Yahoo!デベロッパーネットワーク
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemlucenerevolution
 
Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Koki Shibata
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーションYuya Unno
 

Viewers also liked (7)

形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
 
機械学習の全般について 4
機械学習の全般について 4機械学習の全般について 4
機械学習の全般について 4
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco system
 
Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション
 
深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfAnubhavMangla3
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligencePrecisely
 

Recently uploaded (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
 

Japanese Linguistics in Lucene and Solr

  • 1. Japanese linguistics in Apache Lucene™ and Apache Solr™ May 9th, 2012 Christian Moen christian@atilika.com
  • 2. About me • MSc. in computer science, University of Oslo, Norway • Worked with search at FAST (now Microsoft) for 10 years • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway • 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan • Founded アティリカ株式会社 in 2009 • We help companies innovate using search technologies and good ideas • We know information retrieval, natural language processing and big data • We are based in Tokyo, but we have clients everywhere • Newbie Lucene & Solr Committer • Mostly been working on Japanese language support (Kuromoji) so far • Please write me on christian@atilika.com or cm@apache.org
  • 4. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 5. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 6. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 15. JR新宿駅の近くにビールを飲みに行こうか? JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka? Shall we go for a beer near JR Shinjuku station?
  • 17. Romaji - ローマ字 ・Latin characters (26+) ・Used for proper nouns, etc. JR新宿駅の近くにビールを飲みに行こうか?
  • 18. Katakana - カタカナ ・Phonetic script (~50) ・Typically used for loan words JR新宿駅の近くにビールを飲みに行こうか?
  • 20. JR新宿駅の近くにビールを飲みに行こうか? Hiragana - ひらがな ・Phonetic script (~50) ・Used for inflections & particles
  • 21. Romaji - ローマ字 Katakana - カタカナ ・Latin characters (26+) ・Phonetic script (~50) ・Used for proper nouns, etc. ・Typically used for loan words JR新宿駅の近くにビールを飲みに行こうか? Kanji - 漢字 Hiragana - ひらがな ・Chinese characters (50,000+) ・Phonetic script (~50) ・Used for stems & proper nouns ・Used for inflections & particles
  • 24. JR新宿駅の近くにビールを飲みに行こうか? ? What are the words in this sentence? ! Words are implicit in Japanese - there is no white space that separates them
  • 26. JR新宿駅の近くにビールを飲みに行こうか? ? How do we index this for search, then? ! We need to segment text into tokens first
  • 27. ! Two major approaches for segmentation 1. n-gramming 2. morphological analysis (statistical approach)
  • 28. n-gramming (n=2) JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  • 29. n-gramming (n=2) JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 JR
  • 30. n-gramming (n=2) J R新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 JR R新
  • 31. n-gramming (n=2) J R 新宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 JR R新 新宿
  • 32. n-gramming (n=2) J R 新 宿駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 JR R新 新宿 宿駅
  • 33. n-gramming (n=2) J R 新 宿 駅の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の JR R新 新宿 宿駅 駅の
  • 34. n-gramming (n=2) J R 新 宿 駅 の近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近 JR R新 新宿 宿駅 駅の の近
  • 35. n-gramming (n=2) J R 新 宿 駅 の 近く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近 近く JR R新 新宿 宿駅 駅の の近 近く
  • 37. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ...
  • 38. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ●
  • 39. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● ×
  • 40. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ●
  • 41. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 42. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 43. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 44. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 45. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) Generates many terms per document or query Impacts on index size and search performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  • 46. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  • 47. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance • Still sometimes appropriate for certain search applications • Compliance, e-commerce with special product names, ...
  • 48. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  • 49. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
  • 50. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • 51. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • CRFs decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 52. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, extract readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 53. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 54. How does this actually work?
  • 55. Demo
  • 56. Japanese support in Lucene and Solr
  • 58. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6
  • 59. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box
  • 60. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults
  • 61. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics
  • 62. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics ! Customisable
  • 63. How do we use it?
  • 64. How do we use it? ! Use JapaneseAnalyzer
  • 65. How do we use it? ! Use JapaneseAnalyzer ! Use field type “text_ja” in example schema.xml
  • 66. Demo
  • 67. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries
  • 68. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
  • 69. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt
  • 70. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
  • 71. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt
  • 72. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations
  • 73. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations LowerCaseFilter Lowercases
  • 75. Compound nouns ? How do we deal with compound nouns?
  • 76. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer
  • 77. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer ! These are one word in Japanese, so searching for 空港 (airport) doesn’t match
  • 78. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer ! These are one word in Japanese, so searching for 空港 (airport) doesn’t match ! We need to segment the compounds, too
  • 79. Compound segmentation 関西国際空港 Kansai International Airport シニアソフトウェアエンジニナ Senior Software Engineer ! We are using a heuristic to implement this
  • 80. Compound segmentation 関西国際空港 関西 Kansai International Airport Kansai シニアソフトウェアエンジニナ シニア Senior Software Engineer Senior ! We are using a heuristic to implement this
  • 81. Compound segmentation 関西国際空港 関西 国際 Kansai International Airport Kansai International シニアソフトウェアエンジニナ シニア ソフトウェア Senior Software Engineer Senior Software ! We are using a heuristic to implement this
  • 82. Compound segmentation 関西国際空港 関西 国際 空港 Kansai International Airport Kansai International Airport シニアソフトウェアエンジニナ シニア ソフトウェア エンジニナ Senior Software Engineer Senior Software Engineer ! We are using a heuristic to implement this
  • 83. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its part • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 84. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 85. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 86. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 87. Character width normalisation ? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123
  • 88. Character width normalisation ? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123 ! Use CJKWidthFilter to normalise them (Unicode NFKC subset) Input text Lucene カタカナ 123 CJKWidthFilter Lucene カタカナ 123 half-width full-width half-width
  • 89. Katakana end-vowel stemming ? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー
  • 90. Katakana end-vowel stemming ? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー ! We JapaneseKatakanaStemFilter to normalise/stem end-vowel for long terms Input text コピー マネージャー マネージャ マネジャー JapaneseKatakanaStemFilter コピー マネージャ マネージャ マネジャ copy manager manager “manager”
  • 91. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that?
  • 92. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form 買う kau to buy
  • 93. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます
  • 94. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます ! Use JapaneseBaseformFilter to normalise inflected adjectives and verbs to dictionary form (lemmatisation by reduction)
  • 95. User dictionaries • Own dictionaries can be used for ad hoc segmentation, i.e. to override default model • File format is simple and there’s no need to assign weights, etc. before using them • Example custom dictionary: # Custom segmentation and POS entry for long entries 関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞 # Custom reading and POS former sumo wrestler Asashoryu 朝青龍,朝青龍,アサショウリュウ,カスタム人名
  • 96. Japanese focus in 4.0 • Improvements in JapaneseTokenizer • Improved search mode for katakana compounds • Improved unknown word segmentation • Some performance improvements • CharFilters for various character normalisations • Dates and numbers • Repetition marks (odoriji) • Japanese spell-checker • Robert and Koji almost got this into 3.6, but it got postponed because of API changes being necessary
  • 97. Acknowledgements Robert Muir Thanks for the heavy lifting integrating Kuromoji into Lucene and always reviewing my patches quickly and friendly help Michael McCandless Thanks for streaming Viterbi and synonym compounds! Uwe Schindler Thanks for performance improvements + being the policeman Simon Willnauer Thanks for doing the Kuromoji code donation process so well Gaute Lambertsen & Gerry Hocks Thanks for presentation feedback and being great colleagues
  • 98. Q&A