Japanese Linguistics in Lucene and Solr

4,110 views
3,899 views

Published on

Presented by Christian Moen, Founder and CEO Atilika Inc - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

This talk gives an introduction to searching Japanese text and an overview of the new Japanese search features available out-of-the-box in Lucene and Solr.

Atilika developed a new Japanese morphological analyzer (Kuromoji) in 2010 when they couldn't find any easy-to-use, high-quality morphological analyzer in Java that was good for both search and other Japanese NLP tasks. Kuromoji was built with the goal of donating it to the Apache Software Foundation in order to make Japanese work well for both Lucene and Solr, and is now a standard part of these software packages.

Published in: Technology, Education
1 Comment
9 Likes
Statistics
Notes
No Downloads
Views
Total views
4,110
On SlideShare
0
From Embeds
0
Number of Embeds
36
Actions
Shares
0
Downloads
53
Comments
1
Likes
9
Embeds 0
No embeds

No notes for slide

Japanese Linguistics in Lucene and Solr

  1. 1. Japanese linguisticsin Apache Lucene™ and Apache Solr™ May 9th, 2012 Christian Moen christian@atilika.com
  2. 2. About me• MSc. in computer science, University of Oslo, Norway• Worked with search at FAST (now Microsoft) for 10 years • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway • 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan• Founded アティリカ株式会社 in 2009 • We help companies innovate using search technologies and good ideas • We know information retrieval, natural language processing and big data • We are based in Tokyo, but we have clients everywhere• Newbie Lucene & Solr Committer • Mostly been working on Japanese language support (Kuromoji) so far• Please write me on christian@atilika.com or cm@apache.org
  3. 3. Today’s topics
  4. 4. Today’s topics• Japanese 101 - ordering beer and toasting• Japanese language processing• Japanese features in Lucene/Solr
  5. 5. Today’s topics• Japanese 101 - ordering beer and toasting• Japanese language processing• Japanese features in Lucene/Solr
  6. 6. Today’s topics• Japanese 101 - ordering beer and toasting• Japanese language processing• Japanese features in Lucene/Solr
  7. 7. Japanese 101
  8. 8. ビールください bi-ru kudasai
  9. 9. ビールください bi-ru kudasaiA beer, please
  10. 10. ありがとうございます! arigatō gozaimasu!
  11. 11. ありがとうございます! arigatō gozaimasu!Thank you very much!
  12. 12. 乾杯!kanpai!
  13. 13. 乾杯!kanpai!Cheers!
  14. 14. JR新宿駅の近くにビールを飲みに行こうか?JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?
  15. 15. JR新宿駅の近くにビールを飲みに行こうか?JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka? Shall we go for a beer near JR Shinjuku station?
  16. 16. JR新宿駅の近くにビールを飲みに行こうか?
  17. 17. Romaji - ローマ字・Latin characters (26+)・Used for proper nouns, etc. JR新宿駅の近くにビールを飲みに行こうか?
  18. 18. Katakana - カタカナ ・Phonetic script (~50) ・Typically used for loan wordsJR新宿駅の近くにビールを飲みに行こうか?
  19. 19. JR新宿駅の近くにビールを飲みに行こうか?Kanji - 漢字・Chinese characters (50,000+)・Used for stems & proper nouns
  20. 20. JR新宿駅の近くにビールを飲みに行こうか? Hiragana - ひらがな ・Phonetic script (~50) ・Used for inflections & particles
  21. 21. Romaji - ローマ字 Katakana - カタカナ・Latin characters (26+) ・Phonetic script (~50)・Used for proper nouns, etc. ・Typically used for loan words JR新宿駅の近くにビールを飲みに行こうか?Kanji - 漢字 Hiragana - ひらがな・Chinese characters (50,000+) ・Phonetic script (~50)・Used for stems & proper nouns ・Used for inflections & particles
  22. 22. JR新宿駅の近くにビールを飲みに行こうか?
  23. 23. JR新宿駅の近くにビールを飲みに行こうか?? What are the words in this sentence?
  24. 24. JR新宿駅の近くにビールを飲みに行こうか?? What are the words in this sentence?! Words are implicit in Japanese - there is no white space that separates them
  25. 25. JR新宿駅の近くにビールを飲みに行こうか?? How do we index this for search, then?
  26. 26. JR新宿駅の近くにビールを飲みに行こうか?? How do we index this for search, then?! We need to segment text into tokens first
  27. 27. ! Two major approaches for segmentation 1. n-gramming 2. morphological analysis (statistical approach)
  28. 28. n-gramming (n=2)JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  29. 29. n-gramming (n=2)JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2JR
  30. 30. n-gramming (n=2)J R新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新JR R新
  31. 31. n-gramming (n=2)J R 新宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿JR R新 新宿
  32. 32. n-gramming (n=2)J R 新 宿駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅JR R新 新宿 宿駅
  33. 33. n-gramming (n=2)J R 新 宿 駅の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅のJR R新 新宿 宿駅 駅の
  34. 34. n-gramming (n=2)J R 新 宿 駅 の近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近JR R新 新宿 宿駅 駅の の近
  35. 35. n-gramming (n=2)J R 新 宿 駅 の 近く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近 近くJR R新 新宿 宿駅 駅の の近 近く
  36. 36. Problems with n-gramming
  37. 37. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ...
  38. 38. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ●
  39. 39. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● ×
  40. 40. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ●
  41. 41. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  42. 42. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  43. 43. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  44. 44. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  45. 45. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’• Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives)Generates many terms per document or queryImpacts on index size and search performanceSometimes appropriate for certain search applicationsCompliance, e-commerce with non product names, ...
  46. 46. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’• Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives)• Also generates many terms per document or query • Impacts on index size and performanceSometimes appropriate for certain search applicationsCompliance, e-commerce with non product names, ...
  47. 47. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’• Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives)• Also generates many terms per document or query • Impacts on index size and performance• Still sometimes appropriate for certain search applications • Compliance, e-commerce with special product names, ...
  48. 48. Morphological analysisJR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  49. 49. Morphological analysisJR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
  50. 50. Morphological analysisJR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  51. 51. Morphological analysisJR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • CRFs decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  52. 52. Morphological analysisJR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, extract readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  53. 53. Morphological analysisJR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  54. 54. How does this actually work?
  55. 55. Demo
  56. 56. Japanese support in Lucene and Solr
  57. 57. Japanese in Lucene/Solr
  58. 58. Japanese in Lucene/Solr! New feature in Lucene/Solr 3.6
  59. 59. Japanese in Lucene/Solr! New feature in Lucene/Solr 3.6! Available out-of-the-box
  60. 60. Japanese in Lucene/Solr! New feature in Lucene/Solr 3.6! Available out-of-the-box! Easy to use with reasonable defaults
  61. 61. Japanese in Lucene/Solr! New feature in Lucene/Solr 3.6! Available out-of-the-box! Easy to use with reasonable defaults! Provides sophisticated Japanese linguistics
  62. 62. Japanese in Lucene/Solr! New feature in Lucene/Solr 3.6! Available out-of-the-box! Easy to use with reasonable defaults! Provides sophisticated Japanese linguistics! Customisable
  63. 63. How do we use it?
  64. 64. How do we use it? ! Use JapaneseAnalyzer
  65. 65. How do we use it? ! Use JapaneseAnalyzer ! Use field type “text_ja” in example schema.xml
  66. 66. Demo
  67. 67. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries
  68. 68. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionariesJapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
  69. 69. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tagsJapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt
  70. 70. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tagsJapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
  71. 71. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tagsJapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt
  72. 72. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tagsJapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations
  73. 73. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tagsJapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations LowerCaseFilter Lowercases
  74. 74. Feature details
  75. 75. Compound nouns? How do we deal with compound nouns?
  76. 76. Compound nouns? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airportシニアソフトウェアエンジニア Senior Software Engineer
  77. 77. Compound nouns? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airportシニアソフトウェアエンジニア Senior Software Engineer! These are one word in Japanese, so searching for 空港 (airport) doesn’t match
  78. 78. Compound nouns? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airportシニアソフトウェアエンジニア Senior Software Engineer! These are one word in Japanese, so searching for 空港 (airport) doesn’t match! We need to segment the compounds, too
  79. 79. Compound segmentation 関西国際空港Kansai International Airportシニアソフトウェアエンジニナ Senior Software Engineer ! We are using a heuristic to implement this
  80. 80. Compound segmentation 関西国際空港 関西Kansai International Airport Kansaiシニアソフトウェアエンジニナ シニア Senior Software Engineer Senior ! We are using a heuristic to implement this
  81. 81. Compound segmentation 関西国際空港 関西 国際Kansai International Airport Kansai Internationalシニアソフトウェアエンジニナ シニア ソフトウェア Senior Software Engineer Senior Software ! We are using a heuristic to implement this
  82. 82. Compound segmentation 関西国際空港 関西 国際 空港Kansai International Airport Kansai International Airportシニアソフトウェアエンジニナ シニア ソフトウェア エンジニナ Senior Software Engineer Senior Software Engineer ! We are using a heuristic to implement this
  83. 83. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港• Segment the compounds into its part • Good for recall - we can also search and match 空港 (airport)• We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF• Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  84. 84. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港• Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport)• We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF• Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  85. 85. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港• Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport)• We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF• Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  86. 86. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港• Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport)• We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF• Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  87. 87. Character width normalisation? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123
  88. 88. Character width normalisation? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123! Use CJKWidthFilter to normalise them (Unicode NFKC subset) Input text Lucene カタカナ 123 CJKWidthFilter Lucene カタカナ 123 half-width full-width half-width
  89. 89. Katakana end-vowel stemming? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー
  90. 90. Katakana end-vowel stemming ? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー ! We JapaneseKatakanaStemFilter to normalise/stem end-vowel for long terms Input text コピー マネージャー マネージャ マネジャーJapaneseKatakanaStemFilter コピー マネージャ マネージャ マネジャ copy manager manager “manager”
  91. 91. Lemmatisation? Japanese adjectives and verbs are highly inflected, how do we deal with that?
  92. 92. Lemmatisation? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form 買う kau to buy
  93. 93. Lemmatisation? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます
  94. 94. Lemmatisation? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます ! Use JapaneseBaseformFilter to normalise inflected adjectives and verbs to dictionary form (lemmatisation by reduction)
  95. 95. User dictionaries• Own dictionaries can be used for ad hoc segmentation, i.e. to override default model• File format is simple and there’s no need to assign weights, etc. before using them• Example custom dictionary:# Custom segmentation and POS entry for long entries関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞# Custom reading and POS former sumo wrestler Asashoryu朝青龍,朝青龍,アサショウリュウ,カスタム人名
  96. 96. Japanese focus in 4.0• Improvements in JapaneseTokenizer • Improved search mode for katakana compounds • Improved unknown word segmentation • Some performance improvements• CharFilters for various character normalisations • Dates and numbers • Repetition marks (odoriji)• Japanese spell-checker • Robert and Koji almost got this into 3.6, but it got postponed because of API changes being necessary
  97. 97. AcknowledgementsRobert MuirThanks for the heavy lifting integrating Kuromoji into Luceneand always reviewing my patches quickly and friendly helpMichael McCandlessThanks for streaming Viterbi and synonym compounds!Uwe SchindlerThanks for performance improvements + being the policemanSimon WillnauerThanks for doing the Kuromoji code donation process so wellGaute Lambertsen & Gerry HocksThanks for presentation feedback and being great colleagues
  98. 98. Q&A
  99. 99. ありがとうございました! arigatō gozaimashita!Thank you very much!

×