SlideShare a Scribd company logo
Kuromoji FST
2015/06/25
Yoshinari Fujinuma
Overview
• Motivation
• Building FST
• How freezing works
• How equivalent detection works
• Compiled FST and Virtual Machine
Motivation
• Efficient Key value store for dictionary look up
during tokenization
• String -> integers
• int -> token info
Why FST and not Trie?
• Finite State Transducer (FST) = Finite State Automaton +
Output
• Able to merge both prefixes and suffixes too
• e.g. “can”, “cats”, “dogs”
Overview of how the build
works
List of sorted
words,
list of integers
FST Builder
FST
Compiler
Object-based
FST
Compiled
FST
How Building / Compiling
works
• two variables are the key
• previous word (prev)
• current word (current)
1. Skip common prefix between prev and current
2. make arcs to the temp states
3. Freeze (Finalize) states which suffix differ betw.
prev and current
Toy example
• cat -> 0
• cats -> 1
• catx -> 2
Initializing states
• Initialization
Frozen states
Temp states
Freezing states
• prev word = “”, current word = cat
Frozen states
Temp states
t/0a/0c/0
Add Arc to suffix
Frozen states
Temp states
s/1t/0a/0c/0
• prev word = cat, current word = cats
Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
HashCode 1
Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
HashCode 1
Merge Equivalent states
• pre word = catx, current word =“”
Frozen states
Temp states s/1
t/0a/0c/0
x/2
Freezing states
• pre word = catx, current word = “”
Frozen states s/1
t/0a/0c/0
x/2
Temp states
Equivalent state detection
• We want to merge equivalent states!
• Key-value store using HashMap
• Key: State.hashCode()
• Value: State Object
• Collisions are resolved by chaining
Arc Equivalence
c/0
• Same transition character
• Same destination state
• Same output
c/0
State Equivalence
• All the outgoing set of arcs are equivalent
• Both states are of the same type of state
c/0
c/0
How Compiled FST works
• Generates a “Program”
• Running a Program = look up a word in a dictionary
• Program runs on a Virtual Machine which we implemented
Compiled FST
= “Program”
Virtual
Machine
Word
e.g. “cat”
Integer if
exists in
dictionary
-1, it not
OR
Program
• List of Instructions, 11 bytes each
• Operation code (Op code)
• Math or Accept, Match, Fail
Op code
1byte
transition char
2 bytes
output
4 bytes
target address
4 bytes
Match
• Transition to a given address
• Accumulator += output
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
….
Fail
• Stop running the Program and return -1
• e.g. “tss”
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
….
Match or Accept
• If the current character is the final char,
• Ends running the program and returns the
accumulator
• Else Match
Instructions vs. Arcs
• What instructions represent
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
….
s/1
x/2
t/0
Virtual Machine running
backwards
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
5 Fail
6 Match a 0 4
7 Fail
8 Match c 0 6
• Because of freezing from suffixes
Use of Cache
• The lookup for next state is done by linear search
• The num. of outgoing arcs from the start state is large
• Therefore, we cache those outgoing arcs
Summary
• FST is theoretically more compact than tries
• Implemented FST Builder which builds
• Object-based FST
• Compiled FST, compact form
• Uses Virtual Machine to run the compiled program
(= lookup a word)
References
• Direct Construction of Minimal Acyclic Subsequential
Transducers, http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.24.3698
• Smaller representation of finite-state automata http://
www.sciencedirect.com/science/article/pii/
S0304397512003787
• Blog post by Ikawa-san http://qiita.com/ikawaha/items/
be95304a803020e1b2d1
• This code is available at https://github.com/atilika/fst

More Related Content

Similar to Kuromoji FST

Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Spark Summit
 
Cassandra and drivers
Cassandra and driversCassandra and drivers
Cassandra and drivers
Ben Bromhead
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
confluent
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
Zahra Eskandari
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
Gal Marder
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
Yoni Farin
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit Hole
Christophe Grand
 
C language
C languageC language
C language
Robo India
 
Verification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLAVerification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLA
Universität Rostock
 
Think in linq
Think in linqThink in linq
Think in linq
Sudipta Mukherjee
 
Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrency
Alex Miller
 
More Pointers and Arrays
More Pointers and ArraysMore Pointers and Arrays
More Pointers and Arrays
emartinez.romero
 
Memory Management with Java and C++
Memory Management with Java and C++Memory Management with Java and C++
Memory Management with Java and C++
Mohammad Shaker
 
Blockchain meets database
Blockchain meets databaseBlockchain meets database
Blockchain meets database
YongraeJo
 
Transaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, CaskTransaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, Cask
Cask Data
 
FFW Gabrovo PMG - JavaScript 1
FFW Gabrovo PMG - JavaScript 1FFW Gabrovo PMG - JavaScript 1
FFW Gabrovo PMG - JavaScript 1
Toni Kolev
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
Shyam Raj
 
Operators loops conditional and statements
Operators loops conditional and statementsOperators loops conditional and statements
Operators loops conditional and statements
Vladislav Hadzhiyski
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)
Aljoscha Krettek
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ververica
 

Similar to Kuromoji FST (20)

Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
 
Cassandra and drivers
Cassandra and driversCassandra and drivers
Cassandra and drivers
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit Hole
 
C language
C languageC language
C language
 
Verification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLAVerification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLA
 
Think in linq
Think in linqThink in linq
Think in linq
 
Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrency
 
More Pointers and Arrays
More Pointers and ArraysMore Pointers and Arrays
More Pointers and Arrays
 
Memory Management with Java and C++
Memory Management with Java and C++Memory Management with Java and C++
Memory Management with Java and C++
 
Blockchain meets database
Blockchain meets databaseBlockchain meets database
Blockchain meets database
 
Transaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, CaskTransaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, Cask
 
FFW Gabrovo PMG - JavaScript 1
FFW Gabrovo PMG - JavaScript 1FFW Gabrovo PMG - JavaScript 1
FFW Gabrovo PMG - JavaScript 1
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Operators loops conditional and statements
Operators loops conditional and statementsOperators loops conditional and statements
Operators loops conditional and statements
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 

More from Yoshinari Fujinuma

Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4
Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4
Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4
Yoshinari Fujinuma
 
IT業界における英語とプログラミングの関係性
IT業界における英語とプログラミングの関係性IT業界における英語とプログラミングの関係性
IT業界における英語とプログラミングの関係性
Yoshinari Fujinuma
 
言語モデル入門 (第二版)
言語モデル入門 (第二版)言語モデル入門 (第二版)
言語モデル入門 (第二版)
Yoshinari Fujinuma
 
Yokoさん
YokoさんYokoさん
Yokoさん
Yoshinari Fujinuma
 
Panotさん
PanotさんPanotさん
Panotさん
Yoshinari Fujinuma
 
研究室紹介用ポスター
研究室紹介用ポスター研究室紹介用ポスター
研究室紹介用ポスターYoshinari Fujinuma
 
Minhさん
MinhさんMinhさん
Minhさん
Yoshinari Fujinuma
 
Pascualさん
PascualさんPascualさん
Pascualさん
Yoshinari Fujinuma
 
Pontusさん
PontusさんPontusさん
Pontusさん
Yoshinari Fujinuma
 
hara-san's research
hara-san's researchhara-san's research
hara-san's research
Yoshinari Fujinuma
 
Tweet Recommendation with Graph Co-Ranking
Tweet Recommendation with Graph Co-RankingTweet Recommendation with Graph Co-Ranking
Tweet Recommendation with Graph Co-Ranking
Yoshinari Fujinuma
 

More from Yoshinari Fujinuma (16)

Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4
Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4
Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4
 
IT業界における英語とプログラミングの関係性
IT業界における英語とプログラミングの関係性IT業界における英語とプログラミングの関係性
IT業界における英語とプログラミングの関係性
 
言語モデル入門 (第二版)
言語モデル入門 (第二版)言語モデル入門 (第二版)
言語モデル入門 (第二版)
 
言語モデル入門
言語モデル入門言語モデル入門
言語モデル入門
 
Liさん
LiさんLiさん
Liさん
 
冨田さん
冨田さん冨田さん
冨田さん
 
藤沼さん
藤沼さん藤沼さん
藤沼さん
 
Yokoさん
YokoさんYokoさん
Yokoさん
 
Panotさん
PanotさんPanotさん
Panotさん
 
大橋さん
大橋さん大橋さん
大橋さん
 
研究室紹介用ポスター
研究室紹介用ポスター研究室紹介用ポスター
研究室紹介用ポスター
 
Minhさん
MinhさんMinhさん
Minhさん
 
Pascualさん
PascualさんPascualさん
Pascualさん
 
Pontusさん
PontusさんPontusさん
Pontusさん
 
hara-san's research
hara-san's researchhara-san's research
hara-san's research
 
Tweet Recommendation with Graph Co-Ranking
Tweet Recommendation with Graph Co-RankingTweet Recommendation with Graph Co-Ranking
Tweet Recommendation with Graph Co-Ranking
 

Recently uploaded

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 

Recently uploaded (20)

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 

Kuromoji FST

  • 2. Overview • Motivation • Building FST • How freezing works • How equivalent detection works • Compiled FST and Virtual Machine
  • 3. Motivation • Efficient Key value store for dictionary look up during tokenization • String -> integers • int -> token info
  • 4. Why FST and not Trie? • Finite State Transducer (FST) = Finite State Automaton + Output • Able to merge both prefixes and suffixes too • e.g. “can”, “cats”, “dogs”
  • 5. Overview of how the build works List of sorted words, list of integers FST Builder FST Compiler Object-based FST Compiled FST
  • 6. How Building / Compiling works • two variables are the key • previous word (prev) • current word (current) 1. Skip common prefix between prev and current 2. make arcs to the temp states 3. Freeze (Finalize) states which suffix differ betw. prev and current
  • 7. Toy example • cat -> 0 • cats -> 1 • catx -> 2
  • 9. Freezing states • prev word = “”, current word = cat Frozen states Temp states t/0a/0c/0
  • 10. Add Arc to suffix Frozen states Temp states s/1t/0a/0c/0 • prev word = cat, current word = cats
  • 11. Freeze differing suffix • pre word = cats, current word = catx Frozen states Temp states s/1 t/0a/0c/0 x/2
  • 12. Freeze differing suffix • pre word = cats, current word = catx Frozen states Temp states s/1 t/0a/0c/0 x/2 HashCode 1
  • 13. Freeze differing suffix • pre word = cats, current word = catx Frozen states Temp states s/1 t/0a/0c/0 x/2 HashCode 1
  • 14. Merge Equivalent states • pre word = catx, current word =“” Frozen states Temp states s/1 t/0a/0c/0 x/2
  • 15. Freezing states • pre word = catx, current word = “” Frozen states s/1 t/0a/0c/0 x/2 Temp states
  • 16. Equivalent state detection • We want to merge equivalent states! • Key-value store using HashMap • Key: State.hashCode() • Value: State Object • Collisions are resolved by chaining
  • 17. Arc Equivalence c/0 • Same transition character • Same destination state • Same output c/0
  • 18. State Equivalence • All the outgoing set of arcs are equivalent • Both states are of the same type of state c/0 c/0
  • 19. How Compiled FST works • Generates a “Program” • Running a Program = look up a word in a dictionary • Program runs on a Virtual Machine which we implemented Compiled FST = “Program” Virtual Machine Word e.g. “cat” Integer if exists in dictionary -1, it not OR
  • 20. Program • List of Instructions, 11 bytes each • Operation code (Op code) • Math or Accept, Match, Fail Op code 1byte transition char 2 bytes output 4 bytes target address 4 bytes
  • 21. Match • Transition to a given address • Accumulator += output 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 ….
  • 22. Fail • Stop running the Program and return -1 • e.g. “tss” 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 ….
  • 23. Match or Accept • If the current character is the final char, • Ends running the program and returns the accumulator • Else Match
  • 24. Instructions vs. Arcs • What instructions represent 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 …. s/1 x/2 t/0
  • 25. Virtual Machine running backwards 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 5 Fail 6 Match a 0 4 7 Fail 8 Match c 0 6 • Because of freezing from suffixes
  • 26. Use of Cache • The lookup for next state is done by linear search • The num. of outgoing arcs from the start state is large • Therefore, we cache those outgoing arcs
  • 27. Summary • FST is theoretically more compact than tries • Implemented FST Builder which builds • Object-based FST • Compiled FST, compact form • Uses Virtual Machine to run the compiled program (= lookup a word)
  • 28. References • Direct Construction of Minimal Acyclic Subsequential Transducers, http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.24.3698 • Smaller representation of finite-state automata http:// www.sciencedirect.com/science/article/pii/ S0304397512003787 • Blog post by Ikawa-san http://qiita.com/ikawaha/items/ be95304a803020e1b2d1 • This code is available at https://github.com/atilika/fst