SlideShare a Scribd company logo
ELUTE
Essential Libraries and Utilities of
Text Engineering
Tian-Jian Jiang
Why bother?
Since we already have...
(lib)TaBE
• Traditional Chinese Word Segmentation
• with Big5 encoding
• Traditional Chinese Syllable-to-Word Conversion
• with Big5 encoding
• for bo-po-mo-fo transcription system
1999 - 2001?1999 - 2001?
How about...
libchewing
• Now hacking for
• UTF-8 encoding
• Pinyin transcription system
• and looking for
• an alternative algorithm
• a better dictionary
We got a problem
•23:43 < s*****> 到底 (3023) + 螫舌 (0) 麼 (2536)
東西 (6024) = 11583
•23:43 < s*****> 到底是 (829) + 什麼東西 (337) =
1166
•23:43 < s*****> 到底螫舌麼東西 大勝 到底這什麼東
西
•00:02 < s*****> k***: 「什麼」會被「什麼東西」排
擠掉
•00:02 < s*****> k***: 結果是 20445 活生生的被
337 幹掉 :P
Word Segmentation
Review
Heuristic Rules*
• Maximum matching -- Simple vs. Complex: 下雨天真正討厭
• 下雨 天真 正 討厭 vs. 下雨天 真正 討厭
• Maximum average word length
• 國際化
• Minimum variance of word lengths
• 研究 生命 起源
• Maximum degree of morphemic freedom of single-character word
• 主要 是 因為
* Refer to MMSEG by C. H. Tsai: http://technology.chtsai.org/mmseg/
Graphical Models
• Markov chain family
• Statistical Language Model (SLM)
• Hidden Markov Model (HMM)
• Exponential models
• Maximum Entropy (ME)
• Conditional Random Fields (CRF)
• Applications
• Probabilistic Context-Free Grammar (PCFG) Parser
• Head-driven Phrase Structure Grammar (HPSG) Parser
• Link Grammar Parser
What is a language model?
A probability distribution
over surface patterns of texts.
The Italian Who Went to Malta
•One day ima gonna Malta to bigga hotel.
•Ina morning I go down to eat breakfast.
•I tella waitress I wanna two pissis toasts.
•She brings me only one piss.
•I tella her I want two piss. She say go to the toilet.
•I say, you no understand, I wanna piss onna my plate.
•She say you better no piss onna plate, you sonna ma bitch.
•I don’t even know the lady and she call me sonna ma bitch!
P(“I want to piss”) > P(“I want two pieces”)
For that Malta waitress,
Do the Math
• Conditional probability:
•
• Bayes’ theorem:
•
• Information theory:
• Noisy channel model
•
• Language model: P(i)
Noisy channel
p(o|i)
Decoder
I O Î
Shannon’s Game
• Predict next word by history
•
• Maximum Likelihood Estimation
•
• C(w1…wn) : Frequency of n-gram w1…wn
Once in a Blue Moon
• A cat has seen...
• 10 sparrows
• 4 barn swallows
• 1 Chinese Bulbul
• 1 Pacific Swallow
• How likely is it that next
bird is unseen?
(1+1) / (10 + 4 + 1 + 1)
But I’ve seen a moon
and I’m blue
• Simple linear interpolation
• PLi(wn|wn-2 , wn-1) = λ1P1(wn) + λ2P2(wn|wn-1) + λ3P2(wn|wn-1 , wn-2)
• 0 ≤λi ≤ 1, Σiλi = 1
• Katz’s backing-off
• Back-off through progressively shorter histories.
• Pbo(wi|wi-(n-1)…wi-1) =
•
•
Good Luck!
• Place a bet remotely on a horse
race within 8 horses by passing
encoded messages.
• Past bet distribution
• horse 1: 1/2
• horse 2: 1/4
• horse 3: 1/8
• horse 4: 1/16
• the rest: 1/64
Foreversoul: http://flickr.com/photos/foreversouls/
CC: BY-NC-ND
3 bits? No, only 2!
0, 10, 110, 1110, 111100, 111101, 111110, 111111
Alright, let’s ELUTE
Bi-gram MLE Flow Chart
Permute candidates
right_gram
In LM?
has
left_gram?
bi_gram
In LM?
left_gram
In LM?
temp_score =
LogProb(right_gram)
temp_score =
LogProb(bi_gram)
temp_score =
LogProb(left_gram) +
BackOff(right_gram)
temp_score =
LogProb(Unknown) +
BackOff(right_gram)
temp_score =
LogProb(Unknown)
Update scores
temp_score +=
previous_score
have 2 grams?
Yes
Yes
Yes
Yes Yes
No
No
No
No No
INPUT input_syllables; len = Length(input_syllables); Load(language_model);
scores[len + 1]; tracks[len + 1]; words[len + 1];
FOR i = 0 TO len
scores[i] = 0.0; tracks[i] = -1; words[i] = "";
FOR index = 1 TO len
best_score = 0.0; best_prefix = -1; best_word = "";
FOR prefix = index - 1 TO 0
right_grams[] = Homophones(Substring(input_syllabes, prefix, index - prefix));
FOR EACH right_gram IN right_grams[]
IF right_gram IN language_model
left = tracks[prefix];
IF left >= 0 AND left != prefix
left_grams[] = Homophones(Substring(input_syllables, left, prefix - left));
FOR EACH left_gram IN left_grams[]
temp_score = 0.0;
bigram = left_gram + " " + right_gram;
IF bigram IN language_model
bigram_score = LogProb(bigram);
temp_score += bigram_score;
ELSE IF left_gram IN language_model
bigram_backoff = LogProb(left_gram) + BackOff(right_gram);
temp_score += bigram_backoff;
ELSE
temp_score += LogProb(Unknown) + BackOff(right_gram);
temp_score += scores[prefix];
Scoring
ELSE
temp_score = LogProb(right_gram);
Scoring
ELSE
temp_score = LogProb(Unknown) + scores[prefix];
Scoring
scores[index] = best_score; tracks[index] = best_prefix_index; words[index] = best_prefix;
IF tracks[index] == -1
tracks[index] = index - 1;
boundary = len; output_words = "";
WHILE boundary > 0
output_words = words[boundary] + output_words;
boundary = tracks[boundary];
RETURN output_words;
SUBROUTINE Scoring
IF best_score == 0.0 OR temp_score > best_score
best_score = temp_score;
best_prefix = prefix;
best_word = right_gram;
Bi-gram Syllable-to-Word
Show me the…
William’s Requests
And My Suggestions
• Convenient API
• Plain text I/O (in UTF-8)
• More linguistic information
• Algorithm: CRF
• Corpus: we needYOU!
• Flexible to different applications
• Composite, Iterator, and Adapter Patterns
• IDL support
• SWIG
• Open Source
• Open Corpus, too
Thank YOU

More Related Content

Similar to ELUTE

NLP
NLPNLP
Python教程 / Python tutorial
Python教程 / Python tutorialPython教程 / Python tutorial
Python教程 / Python tutorial
ee0703
 
[YIDLUG] Programming Languages Differences, The Underlying Implementation 1 of 2
[YIDLUG] Programming Languages Differences, The Underlying Implementation 1 of 2[YIDLUG] Programming Languages Differences, The Underlying Implementation 1 of 2
[YIDLUG] Programming Languages Differences, The Underlying Implementation 1 of 2Yo Halb
 
An Introduction to Tuple List Dictionary in Python
An Introduction to Tuple List Dictionary in PythonAn Introduction to Tuple List Dictionary in Python
An Introduction to Tuple List Dictionary in Python
yashar Aliabasi
 
Future work on adaptive computer-assisted translation (拋磚引玉; throwing brick t...
Future work on adaptive computer-assisted translation (拋磚引玉; throwing brick t...Future work on adaptive computer-assisted translation (拋磚引玉; throwing brick t...
Future work on adaptive computer-assisted translation (拋磚引玉; throwing brick t...
Mike Tian-Jian Jiang
 
NLTK Python Basic Natural Language Processing.ppt
NLTK Python Basic Natural Language Processing.pptNLTK Python Basic Natural Language Processing.ppt
NLTK Python Basic Natural Language Processing.ppt
abdul623429
 
Python Fundamentals - Basic
Python Fundamentals - BasicPython Fundamentals - Basic
Python Fundamentals - Basic
Wei-Yuan Chang
 
Basic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and programBasic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and program
JyotiprakashMishra18
 
MATERI TENSES BAHASA INGGRIS PEMINATAN KELAS X
MATERI TENSES BAHASA INGGRIS PEMINATAN KELAS X MATERI TENSES BAHASA INGGRIS PEMINATAN KELAS X
MATERI TENSES BAHASA INGGRIS PEMINATAN KELAS X
Nesha Mutiara
 
Text Mining, Association Rules and Decision Tree Learning
Text Mining, Association Rules and Decision Tree LearningText Mining, Association Rules and Decision Tree Learning
Text Mining, Association Rules and Decision Tree LearningAdrian Cuyugan
 
Blockchain 101 - Introduction for Developers
Blockchain 101 - Introduction for DevelopersBlockchain 101 - Introduction for Developers
Blockchain 101 - Introduction for Developers
Razi Rais
 
<Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)
<Little Big Data #1> 한국어 채팅 데이터로  머신러닝 하기 (한국어 보이게 수정)<Little Big Data #1> 한국어 채팅 데이터로  머신러닝 하기 (한국어 보이게 수정)
<Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)
Han-seok Jo
 

Similar to ELUTE (14)

NLP
NLPNLP
NLP
 
Python教程 / Python tutorial
Python教程 / Python tutorialPython教程 / Python tutorial
Python教程 / Python tutorial
 
[YIDLUG] Programming Languages Differences, The Underlying Implementation 1 of 2
[YIDLUG] Programming Languages Differences, The Underlying Implementation 1 of 2[YIDLUG] Programming Languages Differences, The Underlying Implementation 1 of 2
[YIDLUG] Programming Languages Differences, The Underlying Implementation 1 of 2
 
An Introduction to Tuple List Dictionary in Python
An Introduction to Tuple List Dictionary in PythonAn Introduction to Tuple List Dictionary in Python
An Introduction to Tuple List Dictionary in Python
 
Future work on adaptive computer-assisted translation (拋磚引玉; throwing brick t...
Future work on adaptive computer-assisted translation (拋磚引玉; throwing brick t...Future work on adaptive computer-assisted translation (拋磚引玉; throwing brick t...
Future work on adaptive computer-assisted translation (拋磚引玉; throwing brick t...
 
NLTK Python Basic Natural Language Processing.ppt
NLTK Python Basic Natural Language Processing.pptNLTK Python Basic Natural Language Processing.ppt
NLTK Python Basic Natural Language Processing.ppt
 
week7.ppt
week7.pptweek7.ppt
week7.ppt
 
Python Fundamentals - Basic
Python Fundamentals - BasicPython Fundamentals - Basic
Python Fundamentals - Basic
 
Basic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and programBasic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and program
 
MATERI TENSES BAHASA INGGRIS PEMINATAN KELAS X
MATERI TENSES BAHASA INGGRIS PEMINATAN KELAS X MATERI TENSES BAHASA INGGRIS PEMINATAN KELAS X
MATERI TENSES BAHASA INGGRIS PEMINATAN KELAS X
 
Text Mining, Association Rules and Decision Tree Learning
Text Mining, Association Rules and Decision Tree LearningText Mining, Association Rules and Decision Tree Learning
Text Mining, Association Rules and Decision Tree Learning
 
Blockchain 101 - Introduction for Developers
Blockchain 101 - Introduction for DevelopersBlockchain 101 - Introduction for Developers
Blockchain 101 - Introduction for Developers
 
013 LISTS.pdf
013 LISTS.pdf013 LISTS.pdf
013 LISTS.pdf
 
<Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)
<Little Big Data #1> 한국어 채팅 데이터로  머신러닝 하기 (한국어 보이게 수정)<Little Big Data #1> 한국어 채팅 데이터로  머신러닝 하기 (한국어 보이게 수정)
<Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)
 

More from Mike Tian-Jian Jiang

From minimal feedback vertex set to democracy
From minimal feedback vertex set to democracyFrom minimal feedback vertex set to democracy
From minimal feedback vertex set to democracy
Mike Tian-Jian Jiang
 
Robustness Analysis of Adaptive Chinese Input Methods @ WTIM2011
Robustness Analysis of Adaptive Chinese Input Methods @ WTIM2011Robustness Analysis of Adaptive Chinese Input Methods @ WTIM2011
Robustness Analysis of Adaptive Chinese Input Methods @ WTIM2011
Mike Tian-Jian Jiang
 
Evaluation via Negativa of Chinese Word Segmentation for Information Retrieva...
Evaluation via Negativa of Chinese Word Segmentation for Information Retrieva...Evaluation via Negativa of Chinese Word Segmentation for Information Retrieva...
Evaluation via Negativa of Chinese Word Segmentation for Information Retrieva...
Mike Tian-Jian Jiang
 
HLT
HLTHLT
NLP
NLPNLP
Japanese-English Composite Translation Memory of Number Phrases ─ An Imitatio...
Japanese-English Composite Translation Memory of Number Phrases ─ An Imitatio...Japanese-English Composite Translation Memory of Number Phrases ─ An Imitatio...
Japanese-English Composite Translation Memory of Number Phrases ─ An Imitatio...
Mike Tian-Jian Jiang
 

More from Mike Tian-Jian Jiang (6)

From minimal feedback vertex set to democracy
From minimal feedback vertex set to democracyFrom minimal feedback vertex set to democracy
From minimal feedback vertex set to democracy
 
Robustness Analysis of Adaptive Chinese Input Methods @ WTIM2011
Robustness Analysis of Adaptive Chinese Input Methods @ WTIM2011Robustness Analysis of Adaptive Chinese Input Methods @ WTIM2011
Robustness Analysis of Adaptive Chinese Input Methods @ WTIM2011
 
Evaluation via Negativa of Chinese Word Segmentation for Information Retrieva...
Evaluation via Negativa of Chinese Word Segmentation for Information Retrieva...Evaluation via Negativa of Chinese Word Segmentation for Information Retrieva...
Evaluation via Negativa of Chinese Word Segmentation for Information Retrieva...
 
HLT
HLTHLT
HLT
 
NLP
NLPNLP
NLP
 
Japanese-English Composite Translation Memory of Number Phrases ─ An Imitatio...
Japanese-English Composite Translation Memory of Number Phrases ─ An Imitatio...Japanese-English Composite Translation Memory of Number Phrases ─ An Imitatio...
Japanese-English Composite Translation Memory of Number Phrases ─ An Imitatio...
 

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 

ELUTE

  • 1. ELUTE Essential Libraries and Utilities of Text Engineering Tian-Jian Jiang
  • 2. Why bother? Since we already have...
  • 3. (lib)TaBE • Traditional Chinese Word Segmentation • with Big5 encoding • Traditional Chinese Syllable-to-Word Conversion • with Big5 encoding • for bo-po-mo-fo transcription system
  • 6. libchewing • Now hacking for • UTF-8 encoding • Pinyin transcription system • and looking for • an alternative algorithm • a better dictionary
  • 7. We got a problem •23:43 < s*****> 到底 (3023) + 螫舌 (0) 麼 (2536) 東西 (6024) = 11583 •23:43 < s*****> 到底是 (829) + 什麼東西 (337) = 1166 •23:43 < s*****> 到底螫舌麼東西 大勝 到底這什麼東 西 •00:02 < s*****> k***: 「什麼」會被「什麼東西」排 擠掉 •00:02 < s*****> k***: 結果是 20445 活生生的被 337 幹掉 :P
  • 9. Heuristic Rules* • Maximum matching -- Simple vs. Complex: 下雨天真正討厭 • 下雨 天真 正 討厭 vs. 下雨天 真正 討厭 • Maximum average word length • 國際化 • Minimum variance of word lengths • 研究 生命 起源 • Maximum degree of morphemic freedom of single-character word • 主要 是 因為 * Refer to MMSEG by C. H. Tsai: http://technology.chtsai.org/mmseg/
  • 10. Graphical Models • Markov chain family • Statistical Language Model (SLM) • Hidden Markov Model (HMM) • Exponential models • Maximum Entropy (ME) • Conditional Random Fields (CRF) • Applications • Probabilistic Context-Free Grammar (PCFG) Parser • Head-driven Phrase Structure Grammar (HPSG) Parser • Link Grammar Parser
  • 11. What is a language model?
  • 12. A probability distribution over surface patterns of texts.
  • 13. The Italian Who Went to Malta •One day ima gonna Malta to bigga hotel. •Ina morning I go down to eat breakfast. •I tella waitress I wanna two pissis toasts. •She brings me only one piss. •I tella her I want two piss. She say go to the toilet. •I say, you no understand, I wanna piss onna my plate. •She say you better no piss onna plate, you sonna ma bitch. •I don’t even know the lady and she call me sonna ma bitch!
  • 14. P(“I want to piss”) > P(“I want two pieces”) For that Malta waitress,
  • 15. Do the Math • Conditional probability: • • Bayes’ theorem: • • Information theory: • Noisy channel model • • Language model: P(i) Noisy channel p(o|i) Decoder I O Î
  • 16. Shannon’s Game • Predict next word by history • • Maximum Likelihood Estimation • • C(w1…wn) : Frequency of n-gram w1…wn
  • 17. Once in a Blue Moon • A cat has seen... • 10 sparrows • 4 barn swallows • 1 Chinese Bulbul • 1 Pacific Swallow • How likely is it that next bird is unseen?
  • 18. (1+1) / (10 + 4 + 1 + 1)
  • 19. But I’ve seen a moon and I’m blue • Simple linear interpolation • PLi(wn|wn-2 , wn-1) = λ1P1(wn) + λ2P2(wn|wn-1) + λ3P2(wn|wn-1 , wn-2) • 0 ≤λi ≤ 1, Σiλi = 1 • Katz’s backing-off • Back-off through progressively shorter histories. • Pbo(wi|wi-(n-1)…wi-1) = • •
  • 20. Good Luck! • Place a bet remotely on a horse race within 8 horses by passing encoded messages. • Past bet distribution • horse 1: 1/2 • horse 2: 1/4 • horse 3: 1/8 • horse 4: 1/16 • the rest: 1/64 Foreversoul: http://flickr.com/photos/foreversouls/ CC: BY-NC-ND
  • 21. 3 bits? No, only 2! 0, 10, 110, 1110, 111100, 111101, 111110, 111111
  • 23. Bi-gram MLE Flow Chart Permute candidates right_gram In LM? has left_gram? bi_gram In LM? left_gram In LM? temp_score = LogProb(right_gram) temp_score = LogProb(bi_gram) temp_score = LogProb(left_gram) + BackOff(right_gram) temp_score = LogProb(Unknown) + BackOff(right_gram) temp_score = LogProb(Unknown) Update scores temp_score += previous_score have 2 grams? Yes Yes Yes Yes Yes No No No No No
  • 24. INPUT input_syllables; len = Length(input_syllables); Load(language_model); scores[len + 1]; tracks[len + 1]; words[len + 1]; FOR i = 0 TO len scores[i] = 0.0; tracks[i] = -1; words[i] = ""; FOR index = 1 TO len best_score = 0.0; best_prefix = -1; best_word = ""; FOR prefix = index - 1 TO 0 right_grams[] = Homophones(Substring(input_syllabes, prefix, index - prefix)); FOR EACH right_gram IN right_grams[] IF right_gram IN language_model left = tracks[prefix]; IF left >= 0 AND left != prefix left_grams[] = Homophones(Substring(input_syllables, left, prefix - left)); FOR EACH left_gram IN left_grams[] temp_score = 0.0; bigram = left_gram + " " + right_gram; IF bigram IN language_model bigram_score = LogProb(bigram); temp_score += bigram_score; ELSE IF left_gram IN language_model bigram_backoff = LogProb(left_gram) + BackOff(right_gram); temp_score += bigram_backoff; ELSE temp_score += LogProb(Unknown) + BackOff(right_gram); temp_score += scores[prefix]; Scoring ELSE temp_score = LogProb(right_gram); Scoring ELSE temp_score = LogProb(Unknown) + scores[prefix]; Scoring scores[index] = best_score; tracks[index] = best_prefix_index; words[index] = best_prefix; IF tracks[index] == -1 tracks[index] = index - 1; boundary = len; output_words = ""; WHILE boundary > 0 output_words = words[boundary] + output_words; boundary = tracks[boundary]; RETURN output_words; SUBROUTINE Scoring IF best_score == 0.0 OR temp_score > best_score best_score = temp_score; best_prefix = prefix; best_word = right_gram; Bi-gram Syllable-to-Word
  • 27. And My Suggestions • Convenient API • Plain text I/O (in UTF-8) • More linguistic information • Algorithm: CRF • Corpus: we needYOU! • Flexible to different applications • Composite, Iterator, and Adapter Patterns • IDL support • SWIG • Open Source • Open Corpus, too

Editor's Notes

  1. Maximum matching can also be “backward.” Consider that if we try to diff and merge forward/backward maximum matching results...
  2. Since we are not native speakers of English, it’s also a problem to us. Oh, we got a problem, again!
  3. Shannon’s noisy channel was modeled for a real world problem in Bell Lab. It cares about not only error rates of “decoding” but also efficiencies of “encoding.” This matches Zipf’s Law naturally. Zipf’s Law, however, is EXPERIMENTAL, not theoretical.
  4. In average, cross-entropy represents bit rates of encoding for noisy channel, and perplexity means branch (candidate) numbers. 8 horses are equally likely: 000, 001, 010, 011, 100, 101, 110, 111 8 horses are biased: 0, 10, 110, 1110, 111100, 111101, 111110, 111111.