SlideShare a Scribd company logo
Query Word Labeling and Back Transliteration for Indian Languages:
MSRI Shared task system description
Spandana Gella1,2 , Jatin Sharma1 , Kalika Bali1
1

Microsoft Research, India

2

University of Melbourne, Australia

December 9, 2013

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
SubTask1: Query Word Labeling

Many Indian languages esp. in social media is written using romanized script

Table: Shared Task description in two separate steps of query labeling and back
transliteration

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Our Methodology

Word level language identification
based on character n-gram features learned from wordlists extracted from
monolingual corpus (King and "Abney, 2013)
Adding context switch probability to indirectly learn the language sequence
patterns
Frequency based filtering

Back-Transliteration
Hash based mapping between source and target languages (Kumar and
Udupa, 2011)
Use indic character mapping to create training data in poor-resource
languages

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Terminology, Datasets and Tools

Character n-gram features: hello :’h’,’e’,..,’o’,’he’,el’..,’hel’..,’hell’,’ello’,’hello’
Training resources: Word lists (from Leipzig Corpus, Anandbazar Patrika),
word frequencies and transliterated pairs given as part of shared task
Training size from 100 - 5000 words (Always <=546 for Gujarati)
(McCallum, 2002) for learning classifiers, MSRI Name Search Tool for
Transliteration

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
100

200

500

1000

2000

3000

5000

1.00
accuracy %

0.80

0.85

0.90

0.95

1.00
0.95
0.90

accuracy %

NaiveBayes
Max−Ent
DTree

0.75

0.75

0.80

0.85

0.90
0.85
0.75

0.80

accuracy %

0.95

1.00

Word label prediction based on n-gram features

100

(a) Hindi−English sampled size

(a) Hindi

200

500

1000

2000

3000

(b) Gujarati−English sampled size

(b) Gujarati

5000

100

200

500 1000 2000 3000 5000

(c) Bangla−English sampled size

(c) Bangla

Figure: Learning curves for maximum entropy, naive Bayes and decision tree on word
labeling for Hindi, Gujarati and Bangla language on development data

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
100

200

500

1000

2000

3000

(a) Hindi - Maxent

5000

s,
−
<
#
*
+

s
,
<
−

−
#
*
+

#
+
*

100

200

500

1000

2000

3000

(b) Gujarati - Maxent

5000

,
s

,
s
<

0.87

<
−
s,
#
*
+

0.86

,
−
<
s
#
+
*

0.85

s

s
,
<

,
−
<
#
s
+
*

accuracy %

0.95
0.90

s

*
+
<
−
#
,
s

#
−
,
<
s
+
*

0.84

,
s

+
−
<
#
*,
s

accuracy %

<

+
#
<
−
*,
s

0.85

0.92
0.90
0.88

+
−
#
*
<
,

+
−
*
#

−
<
#
+
*,
s

0.86

accuracy %

0.94

#
<
−
+
*,

0.88

Adding context-switch probability

s
<
,
−

s
,
<
−

,
s
*
<
#
−
,
s

,
<
−
s

+
*
#

−
<
#
+
*
<
+
#
*
−

+
#
*
+
#
*
−

#
*
+

+

100

200

+ 0.6
* 0.65
# 0.7
− 0.75
< 0.8
, 0.85
s 0.9
None

500 1000 2000 3000 5000

(c) Bangla - Naive

Figure: Learning curves with varying context switch probabilities

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Language Identification Errors

Type

Romanized

Predicted

Reference

Short Words
Ambiguous Words
Erroneous Words
Mixed Numerals Words

i; ve
the; ate
emosal
zara2; duwan2

H; E
E; E
H
E; E

E; H
H; H
E
H; H

Table: Annotation Errors

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Back Transliteration

MSRI Name Search Tool, built based on n-gram based feature hashing
Used indic character mapping between Hindi-Bangla and Hindi-Gujarati
All 3 systems for Gujarati and Bangla uses indic character mapping

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Test set Results

System

MSRI-1
MSRI-2
MSRI-3
Maximum
Median

LA
0.9823
0.9848
0.9826
0.9848
0.9540

Hindi
TF
0.8127
0.8130
0.8101
0.8130
0.4160

TQM
0.1940
0.1980
0.1860
0.1980
0.0290

LA
0.9614
0.9755
0.9661
0.9755
0.9661

Gujarati
TF
TQM
0.4711 0.0800
0.4803 0.0733
0.4748 0.0667
0.4803 0.0800
0.4748 0.0733

LA
0.9259
0.9499
0.9459
0.9499
0.9359

Bangla
TF
0.4914
0.5033
0.5137
0.5137
0.4973

TQM
0.0100
0.0100
0.0100
0.0100
0.0100

Table: Language labeling analysis on submitted runs in all three languages, along with
maximum and median scores. Our runs which had maximum scores are presented in
bold. LA - Labeling Accuracy, TF- Transliteration F-score, TQM - % of queries that
had exact labeling and transliteration

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Transliteration Error Analysis

Table: Transliteration Errors

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Summary

Contributions:
Using context switch probability increases the performance of language
labeling in code-mixed language.
Cross-language character mapping to increase transliteration accuracy promising direction for resource-poor languages

Future Work:
Extending it to text with spelling variations (covering text normalization)
Working on multiple languages esp. poor resource languages by exploiting
resources from related languages

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Questions?

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Bibliography I

King, B. and "Abney, S. (2013). Labeling the languages of words in
mixed-language documents using weakly supervised methods. In
Proceedings of NAACL-HLT, pages 1110–1119.
Kumar, S. and Udupa, R. (2011). Learning hash functions for cross-view
similarity search. In Proceedings of the Twenty-Second international joint
conference on Artificial Intelligence-Volume Volume Two, pages 1360–1365.
AAAI Press.
McCallum, A. K. (2002). Mallet: A machine learning for language toolkit.

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages

More Related Content

What's hot

Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Waqas Tariq
 
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET Journal
 
Aruna kumar K R resume
Aruna kumar K R  resumeAruna kumar K R  resume
Aruna kumar K R resume
Arun Kasaraguppe
 
Ijetcas14 575
Ijetcas14 575Ijetcas14 575
Ijetcas14 575
Iasir Journals
 
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
Waqas Tariq
 
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONIjnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
ijnlc
 
Development of morphological analyzer for hindi
Development of morphological analyzer for hindiDevelopment of morphological analyzer for hindi
Development of morphological analyzer for hindi
Made Artha
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
IJARTES
 
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODELNERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
ijnlc
 
Language Identification from a Tri-lingual Printed Document: A Simple Approach
Language Identification from a Tri-lingual Printed Document: A Simple ApproachLanguage Identification from a Tri-lingual Printed Document: A Simple Approach
Language Identification from a Tri-lingual Printed Document: A Simple Approach
IJERA Editor
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDI
cscpconf
 
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATIONHANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
ijnlc
 
A Tool to Search and Convert Reduplicate Words from Hindi to Punjabi
A Tool to Search and Convert Reduplicate Words from Hindi to PunjabiA Tool to Search and Convert Reduplicate Words from Hindi to Punjabi
A Tool to Search and Convert Reduplicate Words from Hindi to Punjabi
IJERA Editor
 
An Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalAn Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity Removal
Waqas Tariq
 
Anaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer methodAnaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer method
ijcsa
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET Journal
 
Pronominal anaphora resolution in
Pronominal anaphora resolution inPronominal anaphora resolution in
Pronominal anaphora resolution in
ijfcstjournal
 
Spell checker for Kannada OCR
Spell checker for Kannada OCRSpell checker for Kannada OCR
Spell checker for Kannada OCR
dbpublications
 
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
kevig
 
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
kevig
 

What's hot (20)

Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
 
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
 
Aruna kumar K R resume
Aruna kumar K R  resumeAruna kumar K R  resume
Aruna kumar K R resume
 
Ijetcas14 575
Ijetcas14 575Ijetcas14 575
Ijetcas14 575
 
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
 
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONIjnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
 
Development of morphological analyzer for hindi
Development of morphological analyzer for hindiDevelopment of morphological analyzer for hindi
Development of morphological analyzer for hindi
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
 
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODELNERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
 
Language Identification from a Tri-lingual Printed Document: A Simple Approach
Language Identification from a Tri-lingual Printed Document: A Simple ApproachLanguage Identification from a Tri-lingual Printed Document: A Simple Approach
Language Identification from a Tri-lingual Printed Document: A Simple Approach
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDI
 
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATIONHANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
 
A Tool to Search and Convert Reduplicate Words from Hindi to Punjabi
A Tool to Search and Convert Reduplicate Words from Hindi to PunjabiA Tool to Search and Convert Reduplicate Words from Hindi to Punjabi
A Tool to Search and Convert Reduplicate Words from Hindi to Punjabi
 
An Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalAn Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity Removal
 
Anaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer methodAnaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer method
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
 
Pronominal anaphora resolution in
Pronominal anaphora resolution inPronominal anaphora resolution in
Pronominal anaphora resolution in
 
Spell checker for Kannada OCR
Spell checker for Kannada OCRSpell checker for Kannada OCR
Spell checker for Kannada OCR
 
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
 
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
 

Similar to Query Word Labeling and Back Transliteration for Indian Languages

Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabi
kevig
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian Languages
Algoscale Technologies Inc.
 
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Nischal Lal Shrestha
 
Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languages
kevig
 
Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languages
kevig
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
Shubham Kumar
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random Field
Waqas Tariq
 
Intrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsIntrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word Embeddings
Jinho Choi
 

Similar to Query Word Labeling and Back Transliteration for Indian Languages (8)

Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabi
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian Languages
 
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
 
Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languages
 
Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languages
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random Field
 
Intrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsIntrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word Embeddings
 

Recently uploaded

Top Google Tools for SEO: Enhance Your Website's Performance
Top Google Tools for SEO: Enhance Your Website's PerformanceTop Google Tools for SEO: Enhance Your Website's Performance
Top Google Tools for SEO: Enhance Your Website's Performance
Elysian Digital Services Pvt. Ltd.
 
UR BHATTI ACADEMY AND ONLINE COURSES.pdf
UR BHATTI ACADEMY AND ONLINE COURSES.pdfUR BHATTI ACADEMY AND ONLINE COURSES.pdf
UR BHATTI ACADEMY AND ONLINE COURSES.pdf
urbhattiacademy
 
ChatGPT 4o for social media step by step Guide.pdf
ChatGPT 4o for social media step by step Guide.pdfChatGPT 4o for social media step by step Guide.pdf
ChatGPT 4o for social media step by step Guide.pdf
almutabbil
 
CYBER SECURITY ENHANCEMENT IN NIGERIA. A CASE STUDY OF SIX STATES IN THE NORT...
CYBER SECURITY ENHANCEMENT IN NIGERIA. A CASE STUDY OF SIX STATES IN THE NORT...CYBER SECURITY ENHANCEMENT IN NIGERIA. A CASE STUDY OF SIX STATES IN THE NORT...
CYBER SECURITY ENHANCEMENT IN NIGERIA. A CASE STUDY OF SIX STATES IN THE NORT...
AJHSSR Journal
 
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
9u4xjk4w
 
一比一原版(CSULB毕业证)加州州立大学长滩分校毕业证如何办理
一比一原版(CSULB毕业证)加州州立大学长滩分校毕业证如何办理一比一原版(CSULB毕业证)加州州立大学长滩分校毕业证如何办理
一比一原版(CSULB毕业证)加州州立大学长滩分校毕业证如何办理
exqfuhe
 
一比一原版(UCSD毕业证)加利福尼亚大学圣迭戈分校毕业证如何办理
一比一原版(UCSD毕业证)加利福尼亚大学圣迭戈分校毕业证如何办理一比一原版(UCSD毕业证)加利福尼亚大学圣迭戈分校毕业证如何办理
一比一原版(UCSD毕业证)加利福尼亚大学圣迭戈分校毕业证如何办理
wozek1
 
On Storytelling & Magic Realism in Rushdie’s Midnight’s Children, Shame, and ...
On Storytelling & Magic Realism in Rushdie’s Midnight’s Children, Shame, and ...On Storytelling & Magic Realism in Rushdie’s Midnight’s Children, Shame, and ...
On Storytelling & Magic Realism in Rushdie’s Midnight’s Children, Shame, and ...
AJHSSR Journal
 
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
anubug
 
Using Playlists to Increase YouTube Watch Time
Using Playlists to Increase YouTube Watch TimeUsing Playlists to Increase YouTube Watch Time
Using Playlists to Increase YouTube Watch Time
SocioCosmos
 
SOCIAL MEDIA MARKETING agency and service
SOCIAL MEDIA MARKETING agency and serviceSOCIAL MEDIA MARKETING agency and service
SOCIAL MEDIA MARKETING agency and service
viralbusinessmarketi
 

Recently uploaded (11)

Top Google Tools for SEO: Enhance Your Website's Performance
Top Google Tools for SEO: Enhance Your Website's PerformanceTop Google Tools for SEO: Enhance Your Website's Performance
Top Google Tools for SEO: Enhance Your Website's Performance
 
UR BHATTI ACADEMY AND ONLINE COURSES.pdf
UR BHATTI ACADEMY AND ONLINE COURSES.pdfUR BHATTI ACADEMY AND ONLINE COURSES.pdf
UR BHATTI ACADEMY AND ONLINE COURSES.pdf
 
ChatGPT 4o for social media step by step Guide.pdf
ChatGPT 4o for social media step by step Guide.pdfChatGPT 4o for social media step by step Guide.pdf
ChatGPT 4o for social media step by step Guide.pdf
 
CYBER SECURITY ENHANCEMENT IN NIGERIA. A CASE STUDY OF SIX STATES IN THE NORT...
CYBER SECURITY ENHANCEMENT IN NIGERIA. A CASE STUDY OF SIX STATES IN THE NORT...CYBER SECURITY ENHANCEMENT IN NIGERIA. A CASE STUDY OF SIX STATES IN THE NORT...
CYBER SECURITY ENHANCEMENT IN NIGERIA. A CASE STUDY OF SIX STATES IN THE NORT...
 
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
 
一比一原版(CSULB毕业证)加州州立大学长滩分校毕业证如何办理
一比一原版(CSULB毕业证)加州州立大学长滩分校毕业证如何办理一比一原版(CSULB毕业证)加州州立大学长滩分校毕业证如何办理
一比一原版(CSULB毕业证)加州州立大学长滩分校毕业证如何办理
 
一比一原版(UCSD毕业证)加利福尼亚大学圣迭戈分校毕业证如何办理
一比一原版(UCSD毕业证)加利福尼亚大学圣迭戈分校毕业证如何办理一比一原版(UCSD毕业证)加利福尼亚大学圣迭戈分校毕业证如何办理
一比一原版(UCSD毕业证)加利福尼亚大学圣迭戈分校毕业证如何办理
 
On Storytelling & Magic Realism in Rushdie’s Midnight’s Children, Shame, and ...
On Storytelling & Magic Realism in Rushdie’s Midnight’s Children, Shame, and ...On Storytelling & Magic Realism in Rushdie’s Midnight’s Children, Shame, and ...
On Storytelling & Magic Realism in Rushdie’s Midnight’s Children, Shame, and ...
 
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
 
Using Playlists to Increase YouTube Watch Time
Using Playlists to Increase YouTube Watch TimeUsing Playlists to Increase YouTube Watch Time
Using Playlists to Increase YouTube Watch Time
 
SOCIAL MEDIA MARKETING agency and service
SOCIAL MEDIA MARKETING agency and serviceSOCIAL MEDIA MARKETING agency and service
SOCIAL MEDIA MARKETING agency and service
 

Query Word Labeling and Back Transliteration for Indian Languages

  • 1. Query Word Labeling and Back Transliteration for Indian Languages: MSRI Shared task system description Spandana Gella1,2 , Jatin Sharma1 , Kalika Bali1 1 Microsoft Research, India 2 University of Melbourne, Australia December 9, 2013 Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 2. SubTask1: Query Word Labeling Many Indian languages esp. in social media is written using romanized script Table: Shared Task description in two separate steps of query labeling and back transliteration Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 3. Our Methodology Word level language identification based on character n-gram features learned from wordlists extracted from monolingual corpus (King and "Abney, 2013) Adding context switch probability to indirectly learn the language sequence patterns Frequency based filtering Back-Transliteration Hash based mapping between source and target languages (Kumar and Udupa, 2011) Use indic character mapping to create training data in poor-resource languages Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 4. Terminology, Datasets and Tools Character n-gram features: hello :’h’,’e’,..,’o’,’he’,el’..,’hel’..,’hell’,’ello’,’hello’ Training resources: Word lists (from Leipzig Corpus, Anandbazar Patrika), word frequencies and transliterated pairs given as part of shared task Training size from 100 - 5000 words (Always <=546 for Gujarati) (McCallum, 2002) for learning classifiers, MSRI Name Search Tool for Transliteration Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 5. 100 200 500 1000 2000 3000 5000 1.00 accuracy % 0.80 0.85 0.90 0.95 1.00 0.95 0.90 accuracy % NaiveBayes Max−Ent DTree 0.75 0.75 0.80 0.85 0.90 0.85 0.75 0.80 accuracy % 0.95 1.00 Word label prediction based on n-gram features 100 (a) Hindi−English sampled size (a) Hindi 200 500 1000 2000 3000 (b) Gujarati−English sampled size (b) Gujarati 5000 100 200 500 1000 2000 3000 5000 (c) Bangla−English sampled size (c) Bangla Figure: Learning curves for maximum entropy, naive Bayes and decision tree on word labeling for Hindi, Gujarati and Bangla language on development data Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 6. 100 200 500 1000 2000 3000 (a) Hindi - Maxent 5000 s, − < # * + s , < − − # * + # + * 100 200 500 1000 2000 3000 (b) Gujarati - Maxent 5000 , s , s < 0.87 < − s, # * + 0.86 , − < s # + * 0.85 s s , < , − < # s + * accuracy % 0.95 0.90 s * + < − # , s # − , < s + * 0.84 , s + − < # *, s accuracy % < + # < − *, s 0.85 0.92 0.90 0.88 + − # * < , + − * # − < # + *, s 0.86 accuracy % 0.94 # < − + *, 0.88 Adding context-switch probability s < , − s , < − , s * < # − , s , < − s + * # − < # + * < + # * − + # * + # * − # * + + 100 200 + 0.6 * 0.65 # 0.7 − 0.75 < 0.8 , 0.85 s 0.9 None 500 1000 2000 3000 5000 (c) Bangla - Naive Figure: Learning curves with varying context switch probabilities Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 7. Language Identification Errors Type Romanized Predicted Reference Short Words Ambiguous Words Erroneous Words Mixed Numerals Words i; ve the; ate emosal zara2; duwan2 H; E E; E H E; E E; H H; H E H; H Table: Annotation Errors Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 8. Back Transliteration MSRI Name Search Tool, built based on n-gram based feature hashing Used indic character mapping between Hindi-Bangla and Hindi-Gujarati All 3 systems for Gujarati and Bangla uses indic character mapping Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 9. Test set Results System MSRI-1 MSRI-2 MSRI-3 Maximum Median LA 0.9823 0.9848 0.9826 0.9848 0.9540 Hindi TF 0.8127 0.8130 0.8101 0.8130 0.4160 TQM 0.1940 0.1980 0.1860 0.1980 0.0290 LA 0.9614 0.9755 0.9661 0.9755 0.9661 Gujarati TF TQM 0.4711 0.0800 0.4803 0.0733 0.4748 0.0667 0.4803 0.0800 0.4748 0.0733 LA 0.9259 0.9499 0.9459 0.9499 0.9359 Bangla TF 0.4914 0.5033 0.5137 0.5137 0.4973 TQM 0.0100 0.0100 0.0100 0.0100 0.0100 Table: Language labeling analysis on submitted runs in all three languages, along with maximum and median scores. Our runs which had maximum scores are presented in bold. LA - Labeling Accuracy, TF- Transliteration F-score, TQM - % of queries that had exact labeling and transliteration Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 10. Transliteration Error Analysis Table: Transliteration Errors Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 11. Summary Contributions: Using context switch probability increases the performance of language labeling in code-mixed language. Cross-language character mapping to increase transliteration accuracy promising direction for resource-poor languages Future Work: Extending it to text with spelling variations (covering text normalization) Working on multiple languages esp. poor resource languages by exploiting resources from related languages Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 12. Questions? Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 13. Bibliography I King, B. and "Abney, S. (2013). Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of NAACL-HLT, pages 1110–1119. Kumar, S. and Udupa, R. (2011). Learning hash functions for cross-view similarity search. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Two, pages 1360–1365. AAAI Press. McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages