SlideShare a Scribd company logo
1 of 36
Download to read offline
Intelligent Pinyin IME
                 Peng Wu
Content

Pinyin IM Background
OSS Pinyin IM Survey
NLP-based Pinyin IM vs non-NLP
sunpinyin vs novel-pinyin
libpinyin - joint efforts
libpinyin - rational
libpinyin roadmap
libpinyin goals
1. Pinyin IM Background
Pinyin IM Background
1. On Windows:
a. most Pinyin IME use sophisticated
techniques (such as NLP...) to improve the
correction rate of pinyin to characters
conversion, without user manually choose
every Chinese word.
b. but ABC Pinyin IM (likes ibus-table) is still
there, some users refused to use new input
approach.
2. On Linux:
a. We are improving, but there are still a long
way to go.
2. OSS Pinyin IM Survey
OSS Pinyin IM Survey

1. Maximum Forward Match:
a. Fcitx
b. ibus-pinyin
2. Uni-gram(or word frequencies based.):
a. scim-pinyin
3. N-gram:
a. sunpinyin
b. novel-pinyin
Maximum Forward Match



Match steps:
1. match the first longest word and forward the
cursor,
2. repeat above steps until all pinyin has been
converted.
uni-gram



1. Statistical-based.
2. Match steps:
1. Try to find the sentence with the most frequent
words.
Concrete Example


Example: zhong'guo'ren
P(中国人|zhong'guo'ren)
= P(中国人)
= P(中国)*P(人)
= 0.001 * 0.001
= 1e-6
Concrete Example(Cont.)


P(种果人|zhong'guo'ren)
= P(种果人)
= P(种果)*P(人)
= 0.0001 * 0.001
= 1e-7
< 1e-6 = P(中国人|zhong'guo'ren)
So we will choose 中国人 as a result.
n-gram



1. Use more sophisticated Statistical Math Model.
2. Match steps:
Find the most possible sentence by using
Statistical Language Model.
(n-gram will be explained later.)
3. NLP-based Pinyin IM
     vs non-NLP
A hypothesis pinyin IM

1. Record all Chinese pinyin/sentence pairs in a
2TB database.
Pros:
1. Nearly 100% correction rates.
Cons:
1. No such huge disk storage to store all pinyin/
sentence pairs.
2. No such powerful CPU can do a pinyin-
sentence conversion in 1 second.
NLP-based Pinyin IM
            vs non-NLP

1. non-NLP based:
Logic is simple, but have problems on
performance and correction rate.
2. NLP based:
Use mathematic models, and computing
intensive, yields more correction rates; or yields
similar correction rates with less computation
time and less disk space.
4. sunpinyin vs
 novel-pinyin
sunpinyin overview


* Built on a back-off n-gram language model (3-gram)
* Supports multiple pinyin schemes (Quanpin & Double Pinyin)
* Supports fuzzy-segmentation, fuzzy-syllables, and auto-
correcting.
* Supports 2-gram history cache, or user's customized model
* Ported to various OS/platforms: iBus, XIM, MacOS
* dual-licensed with CDDL+LGPLv2.1
* Applying the acceptance of Debian, available on Ubuntu,
Fedora and Mandriva ...
sunpinyin intro


1. n-gram based algorithms.
2. Mathematic Model describes:
Try to find the maximum possibility sentence
which can pronounces the given sentence.
3. In recent 3 years, sun-pinyin has been re-
factored to easier be understood.
sunpinyin Math Model

To calculate the probability of sentence S=(W1,W2,W3...
Wn)

P(S) = P(W1).P(W2|W1).P(W3|W1,W2).P(W4|W1,W2,W3)...P(Wn|W1,W2,...Wn-1)




In reality, due to the data sparseness, it's impossible to calculate the
probability in this way. A particle method is, to assume the P(Wi|W1,
W2,...Wi-1) only depends on the previous N words, i.e., Wi-N+1,Wi-N+2,...
Wi-1.

Particularly, we have unigram (N=0,context-free grammar), bigram (N=1),
trigram (N=2), and fourgram (N=3). The most commonly used is bigram,
trigram.
Concrete Example


Example: zhong'guo'ren
P(中国人|zhong'guo'ren)
= P(中国人)
= P(中国)*P(人|中国)
= 0.01 * 0.1
= 0.001
Concrete Example(Cont.)


P(种果人|zhong'guo'ren)
= P(种果人)
= P(种果)*P(人|种果)
= 0.01 * 0.01
= 0.0001
< 0.001 = P(中国人|zhong'guo'ren)
So we will choose 中国人 as a result.
novel-pinyin overview

* Based on an interpolation smoothing of Hidden
Markov Model. (bi-gram)
* Complete support for ShuangPin schemes,
fuzzy pinyin and incomplete pinyin, inherited from
scim-pinyin.
* Improved user input self-learning.
* Appears on some distro, SUSE, Mandriva, aur
etc.
novel-pinyin intro


1. HMM-based (Hidden Markov Model)
algorithms.
2. Mathematic Model describes:
Try to find the maximum possibility sentence
which really pronounces the given sentence.
3. Clear interface design from the beginning,
although algorithms are also complicated as
sunpinyin.
novel-pinyin Math Model



    H stands for Hanzi sequences, P stands for pinyin
                        sequences.
P(H|P) stands for the possibility of the corresponding Hanzi
                   when Pinyin is given.
Math Model (Cont.)
Concrete Example



Example: zhong'guo'ren
P(中国人|zhong'guo'ren)
= P(中国人) *P(zhong'guo'ren|中国人)
= P(中国)*P(人|中国)*P(zhong'guo|中国)*P(ren|人)
= 0.01 * 0.1 * 0.7 * 0.5
= 3.5*10^-4
Concrete Example(Cont.)


P(种果人|zhong'guo'ren)
= P(种果人)*P(zhong'guo'ren|种果人)
= P(种果)*P(人|种果)*P(zhong'guo|种果)*P(ren|人)
= 0.01 * 0.01 * 0.8 * 0.5
= 4.0*10^-5
< 3.5*10^-4 = P(中国人|zhong'guo'ren)
So we will choose 中国人 as a result.
5. libpinyin - joint efforts
libpinyin - joint efforts

1. sunpinyin and novel-pinyin has been competing for 2-3
years.
2. when one input method finished some features, users will
request the other one to implement such features. Combines
the core part can save efforts to re-invent the wheels.
3. in the past two projects both have very limited developer
resources, merging the core part will greatly alleviate this
problem.
6. libpinyin - rational
libpinyin - rational


1. Unique library to process pinyin relative tasks for Chinese,
just likes libchewing, libanthy.
2. NLP-based approach which provides higher corrections
with reasonable speed and space cost.
3. Save efforts to avoid re-developing similar functionality
when need to deal with pinyin or other Chinese tasks, just
use libpinyin directly.
7. libpinyin roadmap
libpinyin roadmap

1. Compare features between ibus-pinyin, novel-pinyin
and sunpinyin, try to figure out the super set of features.
2. Discuss various possible implementations of each
component.
3. Pick up best component implementations from ibus-
pinyin, novel-pinyin and sunpinyin, different
implementations can co-exist.
4. Source code is compiling time configurable, by
specifying the needed flags when compiling, this library
can precisely simulates the behaviors of ibus-pinyin,
novel-pinyin or sunpinyin.
5. Merge source codes, and coding more...
8. libpinyin goals
libpinyin goals

1. Code must be easily maintainable.
2. Consistent interface between components, each component
can evolves independently.
3. Project code can make progress, without re-writing the entire
code base for brand new idea.
4. Ability to include new NLP models, and work seamless with
related components.
5. Consistent coding styles, and the most important thing is to
ease code reading.
6. For hard algorithms, better to write in a separate module,
instead to keep two similar algorithms together(eg, back-off and
interpolation), to ease code reading. Everything which makes
code ease to understand is welcome. :)
Q&A
Thanks

More Related Content

What's hot

10 things I learned building Nomad packs
10 things I learned building Nomad packs10 things I learned building Nomad packs
10 things I learned building Nomad packsBram Vogelaar
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedAnne Nicolas
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time OptimizationKan-Ru Chen
 
Introduction to char device driver
Introduction to char device driverIntroduction to char device driver
Introduction to char device driverVandana Salve
 
Comprendre les scripts shell auto-extractible
Comprendre les scripts shell auto-extractibleComprendre les scripts shell auto-extractible
Comprendre les scripts shell auto-extractibleThierry Gayet
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal BootloaderSatpal Parmar
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCKernel TLV
 
Linux SD/MMC device driver
Linux SD/MMC device driverLinux SD/MMC device driver
Linux SD/MMC device driver艾鍗科技
 
Interception de signal avec dump de la pile d'appel
Interception de signal avec dump de la pile d'appelInterception de signal avec dump de la pile d'appel
Interception de signal avec dump de la pile d'appelThierry Gayet
 
LAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLinaro
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Jian-Hong Pan
 
Power shell の基本操作と処理の自動化 v2_20120514
Power shell の基本操作と処理の自動化 v2_20120514Power shell の基本操作と処理の自動化 v2_20120514
Power shell の基本操作と処理の自動化 v2_20120514junichi anno
 
A whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerA whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerNikita Popov
 
Applications Android - cours 7 : Ressources et adaptation au matériel
Applications Android - cours 7 : Ressources et adaptation au matérielApplications Android - cours 7 : Ressources et adaptation au matériel
Applications Android - cours 7 : Ressources et adaptation au matérielAhmed-Chawki Chaouche
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixDocker, Inc.
 

What's hot (20)

10 things I learned building Nomad packs
10 things I learned building Nomad packs10 things I learned building Nomad packs
10 things I learned building Nomad packs
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time Optimization
 
Introduction to char device driver
Introduction to char device driverIntroduction to char device driver
Introduction to char device driver
 
Comprendre les scripts shell auto-extractible
Comprendre les scripts shell auto-extractibleComprendre les scripts shell auto-extractible
Comprendre les scripts shell auto-extractible
 
Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linux
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal Bootloader
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 
Linux SD/MMC device driver
Linux SD/MMC device driverLinux SD/MMC device driver
Linux SD/MMC device driver
 
Interception de signal avec dump de la pile d'appel
Interception de signal avec dump de la pile d'appelInterception de signal avec dump de la pile d'appel
Interception de signal avec dump de la pile d'appel
 
LAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel Awareness
 
A practical guide to buildroot
A practical guide to buildrootA practical guide to buildroot
A practical guide to buildroot
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021
 
Power shell の基本操作と処理の自動化 v2_20120514
Power shell の基本操作と処理の自動化 v2_20120514Power shell の基本操作と処理の自動化 v2_20120514
Power shell の基本操作と処理の自動化 v2_20120514
 
A whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerA whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizer
 
Applications Android - cours 7 : Ressources et adaptation au matériel
Applications Android - cours 7 : Ressources et adaptation au matérielApplications Android - cours 7 : Ressources et adaptation au matériel
Applications Android - cours 7 : Ressources et adaptation au matériel
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
 
tmux
tmuxtmux
tmux
 
Embedded Android : System Development - Part II (Linux device drivers)
Embedded Android : System Development - Part II (Linux device drivers)Embedded Android : System Development - Part II (Linux device drivers)
Embedded Android : System Development - Part II (Linux device drivers)
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
 

Similar to libpinyin

Python2 unicode-pt1
Python2 unicode-pt1Python2 unicode-pt1
Python2 unicode-pt1abadger1999
 
Python cs.1.12 week 9 10 2020-2021 covid 19 for g10 by eng.osama ghandour
Python cs.1.12 week 9 10  2020-2021 covid 19 for g10 by eng.osama ghandourPython cs.1.12 week 9 10  2020-2021 covid 19 for g10 by eng.osama ghandour
Python cs.1.12 week 9 10 2020-2021 covid 19 for g10 by eng.osama ghandourOsama Ghandour Geris
 
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Yusuke Oda
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmMeetupDataScienceRoma
 
Intellectual technologies
Intellectual technologiesIntellectual technologies
Intellectual technologiesPolad Saruxanov
 
summer training report on python
summer training report on pythonsummer training report on python
summer training report on pythonShubham Yadav
 
Sure interview algorithm-1103
Sure interview algorithm-1103Sure interview algorithm-1103
Sure interview algorithm-1103Sure Interview
 
Feng’s classification
Feng’s classificationFeng’s classification
Feng’s classificationNarayan Kandel
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingParrotAI
 
2021 04-04-google nmt
2021 04-04-google nmt2021 04-04-google nmt
2021 04-04-google nmtJAEMINJEONG5
 
Unlocking the Power of Integer Programming
Unlocking the Power of Integer ProgrammingUnlocking the Power of Integer Programming
Unlocking the Power of Integer ProgrammingFlorian Wilhelm
 
P, NP, NP-Complete, and NP-Hard
P, NP, NP-Complete, and NP-HardP, NP, NP-Complete, and NP-Hard
P, NP, NP-Complete, and NP-HardAnimesh Chaturvedi
 
SPIRE2015-Improved Practical Compact Dynamic Tries
SPIRE2015-Improved Practical Compact Dynamic TriesSPIRE2015-Improved Practical Compact Dynamic Tries
SPIRE2015-Improved Practical Compact Dynamic TriesAndreas Poyias
 
Python programming for Oceanography
Python programming for OceanographyPython programming for Oceanography
Python programming for OceanographyHafez Ahmad
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Python cs.1.12 week 10 2020 2021 covid 19 for g10 by eng.osama mansour
Python cs.1.12 week 10  2020 2021 covid 19 for g10 by eng.osama mansourPython cs.1.12 week 10  2020 2021 covid 19 for g10 by eng.osama mansour
Python cs.1.12 week 10 2020 2021 covid 19 for g10 by eng.osama mansourOsama Ghandour Geris
 

Similar to libpinyin (20)

NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Python2 unicode-pt1
Python2 unicode-pt1Python2 unicode-pt1
Python2 unicode-pt1
 
Python cs.1.12 week 9 10 2020-2021 covid 19 for g10 by eng.osama ghandour
Python cs.1.12 week 9 10  2020-2021 covid 19 for g10 by eng.osama ghandourPython cs.1.12 week 9 10  2020-2021 covid 19 for g10 by eng.osama ghandour
Python cs.1.12 week 9 10 2020-2021 covid 19 for g10 by eng.osama ghandour
 
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 
AI applications in education, Pascal Zoleko, Flexudy
AI applications in education, Pascal Zoleko, FlexudyAI applications in education, Pascal Zoleko, Flexudy
AI applications in education, Pascal Zoleko, Flexudy
 
Intellectual technologies
Intellectual technologiesIntellectual technologies
Intellectual technologies
 
summer training report on python
summer training report on pythonsummer training report on python
summer training report on python
 
Chapter08.pptx
Chapter08.pptxChapter08.pptx
Chapter08.pptx
 
Sure interview algorithm-1103
Sure interview algorithm-1103Sure interview algorithm-1103
Sure interview algorithm-1103
 
Feng’s classification
Feng’s classificationFeng’s classification
Feng’s classification
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
2021 04-04-google nmt
2021 04-04-google nmt2021 04-04-google nmt
2021 04-04-google nmt
 
Unlocking the Power of Integer Programming
Unlocking the Power of Integer ProgrammingUnlocking the Power of Integer Programming
Unlocking the Power of Integer Programming
 
P, NP, NP-Complete, and NP-Hard
P, NP, NP-Complete, and NP-HardP, NP, NP-Complete, and NP-Hard
P, NP, NP-Complete, and NP-Hard
 
SPIRE2015-Improved Practical Compact Dynamic Tries
SPIRE2015-Improved Practical Compact Dynamic TriesSPIRE2015-Improved Practical Compact Dynamic Tries
SPIRE2015-Improved Practical Compact Dynamic Tries
 
Python programming for Oceanography
Python programming for OceanographyPython programming for Oceanography
Python programming for Oceanography
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Python cs.1.12 week 10 2020 2021 covid 19 for g10 by eng.osama mansour
Python cs.1.12 week 10  2020 2021 covid 19 for g10 by eng.osama mansourPython cs.1.12 week 10  2020 2021 covid 19 for g10 by eng.osama mansour
Python cs.1.12 week 10 2020 2021 covid 19 for g10 by eng.osama mansour
 
Unit -1 CAP.pptx
Unit -1 CAP.pptxUnit -1 CAP.pptx
Unit -1 CAP.pptx
 

Recently uploaded

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

libpinyin

  • 2. Content Pinyin IM Background OSS Pinyin IM Survey NLP-based Pinyin IM vs non-NLP sunpinyin vs novel-pinyin libpinyin - joint efforts libpinyin - rational libpinyin roadmap libpinyin goals
  • 3. 1. Pinyin IM Background
  • 4. Pinyin IM Background 1. On Windows: a. most Pinyin IME use sophisticated techniques (such as NLP...) to improve the correction rate of pinyin to characters conversion, without user manually choose every Chinese word. b. but ABC Pinyin IM (likes ibus-table) is still there, some users refused to use new input approach. 2. On Linux: a. We are improving, but there are still a long way to go.
  • 5. 2. OSS Pinyin IM Survey
  • 6. OSS Pinyin IM Survey 1. Maximum Forward Match: a. Fcitx b. ibus-pinyin 2. Uni-gram(or word frequencies based.): a. scim-pinyin 3. N-gram: a. sunpinyin b. novel-pinyin
  • 7. Maximum Forward Match Match steps: 1. match the first longest word and forward the cursor, 2. repeat above steps until all pinyin has been converted.
  • 8. uni-gram 1. Statistical-based. 2. Match steps: 1. Try to find the sentence with the most frequent words.
  • 9. Concrete Example Example: zhong'guo'ren P(中国人|zhong'guo'ren) = P(中国人) = P(中国)*P(人) = 0.001 * 0.001 = 1e-6
  • 10. Concrete Example(Cont.) P(种果人|zhong'guo'ren) = P(种果人) = P(种果)*P(人) = 0.0001 * 0.001 = 1e-7 < 1e-6 = P(中国人|zhong'guo'ren) So we will choose 中国人 as a result.
  • 11. n-gram 1. Use more sophisticated Statistical Math Model. 2. Match steps: Find the most possible sentence by using Statistical Language Model. (n-gram will be explained later.)
  • 12. 3. NLP-based Pinyin IM vs non-NLP
  • 13. A hypothesis pinyin IM 1. Record all Chinese pinyin/sentence pairs in a 2TB database. Pros: 1. Nearly 100% correction rates. Cons: 1. No such huge disk storage to store all pinyin/ sentence pairs. 2. No such powerful CPU can do a pinyin- sentence conversion in 1 second.
  • 14. NLP-based Pinyin IM vs non-NLP 1. non-NLP based: Logic is simple, but have problems on performance and correction rate. 2. NLP based: Use mathematic models, and computing intensive, yields more correction rates; or yields similar correction rates with less computation time and less disk space.
  • 15. 4. sunpinyin vs novel-pinyin
  • 16. sunpinyin overview * Built on a back-off n-gram language model (3-gram) * Supports multiple pinyin schemes (Quanpin & Double Pinyin) * Supports fuzzy-segmentation, fuzzy-syllables, and auto- correcting. * Supports 2-gram history cache, or user's customized model * Ported to various OS/platforms: iBus, XIM, MacOS * dual-licensed with CDDL+LGPLv2.1 * Applying the acceptance of Debian, available on Ubuntu, Fedora and Mandriva ...
  • 17. sunpinyin intro 1. n-gram based algorithms. 2. Mathematic Model describes: Try to find the maximum possibility sentence which can pronounces the given sentence. 3. In recent 3 years, sun-pinyin has been re- factored to easier be understood.
  • 18. sunpinyin Math Model To calculate the probability of sentence S=(W1,W2,W3... Wn) P(S) = P(W1).P(W2|W1).P(W3|W1,W2).P(W4|W1,W2,W3)...P(Wn|W1,W2,...Wn-1) In reality, due to the data sparseness, it's impossible to calculate the probability in this way. A particle method is, to assume the P(Wi|W1, W2,...Wi-1) only depends on the previous N words, i.e., Wi-N+1,Wi-N+2,... Wi-1. Particularly, we have unigram (N=0,context-free grammar), bigram (N=1), trigram (N=2), and fourgram (N=3). The most commonly used is bigram, trigram.
  • 19. Concrete Example Example: zhong'guo'ren P(中国人|zhong'guo'ren) = P(中国人) = P(中国)*P(人|中国) = 0.01 * 0.1 = 0.001
  • 20. Concrete Example(Cont.) P(种果人|zhong'guo'ren) = P(种果人) = P(种果)*P(人|种果) = 0.01 * 0.01 = 0.0001 < 0.001 = P(中国人|zhong'guo'ren) So we will choose 中国人 as a result.
  • 21. novel-pinyin overview * Based on an interpolation smoothing of Hidden Markov Model. (bi-gram) * Complete support for ShuangPin schemes, fuzzy pinyin and incomplete pinyin, inherited from scim-pinyin. * Improved user input self-learning. * Appears on some distro, SUSE, Mandriva, aur etc.
  • 22. novel-pinyin intro 1. HMM-based (Hidden Markov Model) algorithms. 2. Mathematic Model describes: Try to find the maximum possibility sentence which really pronounces the given sentence. 3. Clear interface design from the beginning, although algorithms are also complicated as sunpinyin.
  • 23. novel-pinyin Math Model H stands for Hanzi sequences, P stands for pinyin sequences. P(H|P) stands for the possibility of the corresponding Hanzi when Pinyin is given.
  • 25. Concrete Example Example: zhong'guo'ren P(中国人|zhong'guo'ren) = P(中国人) *P(zhong'guo'ren|中国人) = P(中国)*P(人|中国)*P(zhong'guo|中国)*P(ren|人) = 0.01 * 0.1 * 0.7 * 0.5 = 3.5*10^-4
  • 26. Concrete Example(Cont.) P(种果人|zhong'guo'ren) = P(种果人)*P(zhong'guo'ren|种果人) = P(种果)*P(人|种果)*P(zhong'guo|种果)*P(ren|人) = 0.01 * 0.01 * 0.8 * 0.5 = 4.0*10^-5 < 3.5*10^-4 = P(中国人|zhong'guo'ren) So we will choose 中国人 as a result.
  • 27. 5. libpinyin - joint efforts
  • 28. libpinyin - joint efforts 1. sunpinyin and novel-pinyin has been competing for 2-3 years. 2. when one input method finished some features, users will request the other one to implement such features. Combines the core part can save efforts to re-invent the wheels. 3. in the past two projects both have very limited developer resources, merging the core part will greatly alleviate this problem.
  • 29. 6. libpinyin - rational
  • 30. libpinyin - rational 1. Unique library to process pinyin relative tasks for Chinese, just likes libchewing, libanthy. 2. NLP-based approach which provides higher corrections with reasonable speed and space cost. 3. Save efforts to avoid re-developing similar functionality when need to deal with pinyin or other Chinese tasks, just use libpinyin directly.
  • 32. libpinyin roadmap 1. Compare features between ibus-pinyin, novel-pinyin and sunpinyin, try to figure out the super set of features. 2. Discuss various possible implementations of each component. 3. Pick up best component implementations from ibus- pinyin, novel-pinyin and sunpinyin, different implementations can co-exist. 4. Source code is compiling time configurable, by specifying the needed flags when compiling, this library can precisely simulates the behaviors of ibus-pinyin, novel-pinyin or sunpinyin. 5. Merge source codes, and coding more...
  • 34. libpinyin goals 1. Code must be easily maintainable. 2. Consistent interface between components, each component can evolves independently. 3. Project code can make progress, without re-writing the entire code base for brand new idea. 4. Ability to include new NLP models, and work seamless with related components. 5. Consistent coding styles, and the most important thing is to ease code reading. 6. For hard algorithms, better to write in a separate module, instead to keep two similar algorithms together(eg, back-off and interpolation), to ease code reading. Everything which makes code ease to understand is welcome. :)
  • 35. Q&A