Successfully reported this slideshow.
Your SlideShare is downloading. ×

Introduction to japanese tokenizer

Ad

Introduction to Japanese
tokenizers
WebHack 2020-11-10
by Wanasit T.

Ad

About Me
● Github: @wanasit
○ Text / NLP projects
● Manager, Software Engineer @ Indeed
○ Search Quality (Metadata) team
○...

Ad

Disclaimer
1. This talk NOT related to any of Indeed’s technology
2. I’m not a Japanese (or a native-speaker)
○ But I buil...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 36 Ad
1 of 36 Ad

Introduction to japanese tokenizer

Download to read offline

In this talk, Wanasit will share what he learn about Japanese NLP after trying to build a Japanese tokenizer from scratch.

Doing Natural Language Processing (NLP) or text processing for Japanese has many challenges. One of the most basic and obvious problems is tokenization (aka. splitting text into a list of words).

Unlike English that the words typically separated by space, splitting Japanese text (e.g. 日本語の自然言語処理を行うには…) doesn’t have such a rule-of-thumb. It requires the tokenizers and NLP tools to be a lot more sophisticated.

In this talk, Wanasit will share what he learn about Japanese NLP after trying to build a Japanese tokenizer from scratch.

Doing Natural Language Processing (NLP) or text processing for Japanese has many challenges. One of the most basic and obvious problems is tokenization (aka. splitting text into a list of words).

Unlike English that the words typically separated by space, splitting Japanese text (e.g. 日本語の自然言語処理を行うには…) doesn’t have such a rule-of-thumb. It requires the tokenizers and NLP tools to be a lot more sophisticated.

More Related Content

Introduction to japanese tokenizer

  1. 1. Introduction to Japanese tokenizers WebHack 2020-11-10 by Wanasit T.
  2. 2. About Me ● Github: @wanasit ○ Text / NLP projects ● Manager, Software Engineer @ Indeed ○ Search Quality (Metadata) team ○ Work on NLP problems for Jobs / Resumes
  3. 3. Disclaimer 1. This talk NOT related to any of Indeed’s technology 2. I’m not a Japanese (or a native-speaker) ○ But I built a Japanese tokenizer on my free time
  4. 4. Today Topics ● NLP and Tokenization (for Japanese) ● Lattice-based Tokenizers (MeCab -style tokenizers) ● How it works ○ Dictionary ○ Tokenization
  5. 5. NLP and Tokenization
  6. 6. NLP and Tokenization ● How does computer represent text? ● String (or Char[ ] or Byte[ ] ) ■ "Abc" ■ "Hello World"
  7. 7. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" Source: NBC News
  8. 8. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" ● What’s the topic? ● Who is winning? where? Source: NBC News
  9. 9. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" ● What’s the topic? ● Who is winning? where? Source: NBC News
  10. 10. NLP and Tokenization ● Tokenization / Segmentation ● The first step to solve NLP problems is usually identifying words from the string ○ Input: string, char[ ] (or byte[ ]) ○ Output: a list of meaningful words (or tokens)
  11. 11. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally".split(/W+/) > ["Biden", "is", "projected", "winner", "in", ...]
  12. 12. Japanese Tokenization "バイデン氏がミシガン州勝利、大統領にむけ“王手" Source: TBS News
  13. 13. Japanese Tokenization "バイデン氏がミシガン州勝利、大統領にむけ“王手" Source: TBS News
  14. 14. Japanese Tokenization "バイデン氏がミシガン州勝利、大統領にむけ“王手" ● No punctuations ● Q: How do you split this into words? Source: TBS News
  15. 15. Japanese Tokenization ● Use prior Japanese knowledge (Dictionary) ○ が, に, …, 氏, 州, …, バイデン ● Consider the context and combination of characters ● Consider the likelihood ○ e.g. 東京都 => [東京, 都], or [東, 京都]
  16. 16. Lattice-based Tokenizers
  17. 17. Lattice-based Tokenizers ● aka. MeCab -based tokenizer (or Viterbi tokenizer) ● How: ○ From a Dictionary (required) ○ Build a Lattice (or a graph) from surface dictionary terms ○ Run Viterbi algorithm to find the best connected path
  18. 18. Lattice-Based Tokenizers ● Most tokenizers are MeCab (C/C++)’s re-implementation on different platforms: ○ Kuromoji, Sudachi (Java), Kotori (Kotlin) ○ Janome, SudachiPy (Python) ○ Kagome (Go) ○ ...
  19. 19. Non- Lattice-Based Tokenizers ● Is Lattice-based the only approach? ● Mostly yes, but there are also: ○ Juman++, Nagisa (RNN) ○ SentencePiece (Unsupervised, used in BERT) ● Out-of-scope of this presentation
  20. 20. How it works > Dictionary
  21. 21. Dictionary ● Lattice-based tokenizers need dictionary ○ To recognize predefined terms and grammar ● Dictionaries are often can be downloaded as Plugins e.g. ○ $ brew install mecab ○ $ brew install mecab-ipadic
  22. 22. Dictionary ● Recommended beginner dictionary is MeCab’s IPADIC ● Available from this website
  23. 23. Dictionary - Term Table / Lexicon / CSV files Surface Form Context ID (left) Context ID (right) Cost Type Form Spelling ... 東京 1293 1293 3003 名詞 (place) - トウキョウ ... 京都 1293 1293 2135 名詞 (place) - キョウト ... 東京塚 1293 1293 8676 名詞 (place) - ヒガシキョウ ヅカ ... 行く 992 992 8852 動詞 (v) 基本形 イク ... 行か 1002 1002 7754 動詞 (v) 未然形 イカ ... いく 992 992 9672 動詞 (v) 基本形 イク ...
  24. 24. Dictionary - Term Table ● Surface Form: How the term should appear in the string ● Context ID (left/right): ID used for connecting terms together (see. later) ● Cost: How commonly used the term ○ The more the cost, the less common or less likely
  25. 25. Dictionary - Connection Table / Connection Cost Context ID (from) Context ID (to) Cost ... ... 992 992 3003 992 993 2135 ... ... 992 1293 -1000 992 1294 -1000 ... ... ● Connection cost between type of terms. ● The lower, the more likely ● e.g. ● 992 (v-ru) then 992 (v-ru) ○ Cost = 3000 (unlikely) ● 992 (v-ru) then 1294 (noun) ○ Cost = -1000 (likely)
  26. 26. Dictionary - Term Table Term table size: ● Kotori (default) ~380,000 terms (3.7 MB) ● MeCab-IPADict ~400,000 terms (12.2 MB) ● Sudachi - Small ~750,000 terms (39.8 MB) ● Sudachi - Full ~2,800,000 terms (121 MB)
  27. 27. Dictionary - Term Table Term table size: ● Kotori (default) ~380,000 terms (3.7 MB) ● MeCab-IPADict ~400,000 terms (12.2 MB) ● Sudachi - Small ~750,000 terms (39.8 MB) ● Sudachi - Full ~2,800,000 terms (121 MB) ○ Include term like: "ヽ(`ー`)ノ"
  28. 28. Dictionary - Term Table ● What about words not in the table? ○ e.g. "ワナシット タナキットルンアン" ○ “Unknown-Term Extraction” Problem ○ Typically, some heuristic rules ■ e.g. if there are consecutive katana, it’s a Noun. ● Out-of-scope of this presentation
  29. 29. How it works > Tokenization
  30. 30. Lattice-Based Tokenization Given: ● The Dictionary ● Input:"東京都に住む" Tokenizer: 1. Find all terms in the input and build a lattice 2. Find the minimum cost path through the lattice
  31. 31. Step 1: Finding all terms
  32. 32. Step 1: Finding all terms ● For each index i-th ○ find all terms in dictionary starting at i-th location ● String / Pattern Matching problem ○ Require efficient lookup data structure for the dictionary ○ e.g. Trie, Finite-State-Transidual
  33. 33. Step 2: Finding minimum cost ● Viterbi Algorithm (Dynamic Programing) ● For each node from the left to right ○ Find the minimum cost path leading to that node ○ Reuse the selected path when consider the following nodes
  34. 34. Summary
  35. 35. Introduction to Japanese Tokenizers ● Introduction to NLP and Tokenization ● Lattice-based tokenizers (MeCab and others) ○ Dictionary ■ Term table, Connection Cost, ... ○ Tokenization Algorithms ■ Pattern Matching, Viterbi Algorithm, ...
  36. 36. Learn more: ● Kotori (on Github), A Japanese tokenizer written in Kotlin ○ Small and performant (fastest among JVM-based) ○ Support multiple dictionary formats ● Article: How Japanese Tokenizers Work (by Wanasit) ● Article: 日本語形態素解析の裏側を覗く! (by Cookpad Developer) ● Book: 自然言語処理の基礎 (by Manabu Okumura)

×