Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Kuromoji FST
2015/06/25
Yoshinari Fujinuma
Overview
• Motivation
• Building FST
• How freezing works
• How equivalent detection works
• Compiled FST and Virtual Mach...
Motivation
• Efficient Key value store for dictionary look up
during tokenization
• String -> integers
• int -> token info
Why FST and not Trie?
• Finite State Transducer (FST) = Finite State Automaton +
Output
• Able to merge both prefixes and s...
Overview of how the build
works
List of sorted
words,
list of integers
FST Builder
FST
Compiler
Object-based
FST
Compiled
...
How Building / Compiling
works
• two variables are the key
• previous word (prev)
• current word (current)
1. Skip common ...
Toy example
• cat -> 0
• cats -> 1
• catx -> 2
Initializing states
• Initialization
Frozen states
Temp states
Freezing states
• prev word = “”, current word = cat
Frozen states
Temp states
t/0a/0c/0
Add Arc to suffix
Frozen states
Temp states
s/1t/0a/0c/0
• prev word = cat, current word = cats
Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
HashCode 1
Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
HashCode 1
Merge Equivalent states
• pre word = catx, current word =“”
Frozen states
Temp states s/1
t/0a/0c/0
x/2
Freezing states
• pre word = catx, current word = “”
Frozen states s/1
t/0a/0c/0
x/2
Temp states
Equivalent state detection
• We want to merge equivalent states!
• Key-value store using HashMap
• Key: State.hashCode()
•...
Arc Equivalence
c/0
• Same transition character
• Same destination state
• Same output
c/0
State Equivalence
• All the outgoing set of arcs are equivalent
• Both states are of the same type of state
c/0
c/0
How Compiled FST works
• Generates a “Program”
• Running a Program = look up a word in a dictionary
• Program runs on a Vi...
Program
• List of Instructions, 11 bytes each
• Operation code (Op code)
• Math or Accept, Match, Fail
Op code
1byte
trans...
Match
• Transition to a given address
• Accumulator += output
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 M...
Fail
• Stop running the Program and return -1
• e.g. “tss”
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Matc...
Match or Accept
• If the current character is the final char,
• Ends running the program and returns the
accumulator
• Else...
Instructions vs. Arcs
• What instructions represent
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2...
Virtual Machine running
backwards
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
5 Fail
6 Match a ...
Use of Cache
• The lookup for next state is done by linear search
• The num. of outgoing arcs from the start state is larg...
Summary
• FST is theoretically more compact than tries
• Implemented FST Builder which builds
• Object-based FST
• Compile...
References
• Direct Construction of Minimal Acyclic Subsequential
Transducers, http://citeseerx.ist.psu.edu/viewdoc/
summa...
Upcoming SlideShare
Loading in …5
×

Kuromoji FST

1,480 views

Published on

Kuromoji FST

Published in: Technology
  • Be the first to comment

Kuromoji FST

  1. 1. Kuromoji FST 2015/06/25 Yoshinari Fujinuma
  2. 2. Overview • Motivation • Building FST • How freezing works • How equivalent detection works • Compiled FST and Virtual Machine
  3. 3. Motivation • Efficient Key value store for dictionary look up during tokenization • String -> integers • int -> token info
  4. 4. Why FST and not Trie? • Finite State Transducer (FST) = Finite State Automaton + Output • Able to merge both prefixes and suffixes too • e.g. “can”, “cats”, “dogs”
  5. 5. Overview of how the build works List of sorted words, list of integers FST Builder FST Compiler Object-based FST Compiled FST
  6. 6. How Building / Compiling works • two variables are the key • previous word (prev) • current word (current) 1. Skip common prefix between prev and current 2. make arcs to the temp states 3. Freeze (Finalize) states which suffix differ betw. prev and current
  7. 7. Toy example • cat -> 0 • cats -> 1 • catx -> 2
  8. 8. Initializing states • Initialization Frozen states Temp states
  9. 9. Freezing states • prev word = “”, current word = cat Frozen states Temp states t/0a/0c/0
  10. 10. Add Arc to suffix Frozen states Temp states s/1t/0a/0c/0 • prev word = cat, current word = cats
  11. 11. Freeze differing suffix • pre word = cats, current word = catx Frozen states Temp states s/1 t/0a/0c/0 x/2
  12. 12. Freeze differing suffix • pre word = cats, current word = catx Frozen states Temp states s/1 t/0a/0c/0 x/2 HashCode 1
  13. 13. Freeze differing suffix • pre word = cats, current word = catx Frozen states Temp states s/1 t/0a/0c/0 x/2 HashCode 1
  14. 14. Merge Equivalent states • pre word = catx, current word =“” Frozen states Temp states s/1 t/0a/0c/0 x/2
  15. 15. Freezing states • pre word = catx, current word = “” Frozen states s/1 t/0a/0c/0 x/2 Temp states
  16. 16. Equivalent state detection • We want to merge equivalent states! • Key-value store using HashMap • Key: State.hashCode() • Value: State Object • Collisions are resolved by chaining
  17. 17. Arc Equivalence c/0 • Same transition character • Same destination state • Same output c/0
  18. 18. State Equivalence • All the outgoing set of arcs are equivalent • Both states are of the same type of state c/0 c/0
  19. 19. How Compiled FST works • Generates a “Program” • Running a Program = look up a word in a dictionary • Program runs on a Virtual Machine which we implemented Compiled FST = “Program” Virtual Machine Word e.g. “cat” Integer if exists in dictionary -1, it not OR
  20. 20. Program • List of Instructions, 11 bytes each • Operation code (Op code) • Math or Accept, Match, Fail Op code 1byte transition char 2 bytes output 4 bytes target address 4 bytes
  21. 21. Match • Transition to a given address • Accumulator += output 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 ….
  22. 22. Fail • Stop running the Program and return -1 • e.g. “tss” 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 ….
  23. 23. Match or Accept • If the current character is the final char, • Ends running the program and returns the accumulator • Else Match
  24. 24. Instructions vs. Arcs • What instructions represent 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 …. s/1 x/2 t/0
  25. 25. Virtual Machine running backwards 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 5 Fail 6 Match a 0 4 7 Fail 8 Match c 0 6 • Because of freezing from suffixes
  26. 26. Use of Cache • The lookup for next state is done by linear search • The num. of outgoing arcs from the start state is large • Therefore, we cache those outgoing arcs
  27. 27. Summary • FST is theoretically more compact than tries • Implemented FST Builder which builds • Object-based FST • Compiled FST, compact form • Uses Virtual Machine to run the compiled program (= lookup a word)
  28. 28. References • Direct Construction of Minimal Acyclic Subsequential Transducers, http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.24.3698 • Smaller representation of finite-state automata http:// www.sciencedirect.com/science/article/pii/ S0304397512003787 • Blog post by Ikawa-san http://qiita.com/ikawaha/items/ be95304a803020e1b2d1 • This code is available at https://github.com/atilika/fst

×