SlideShare a Scribd company logo
1 of 13
Download to read offline
Query Word Labeling and Back Transliteration for Indian Languages:
MSRI Shared task system description
Spandana Gella1,2 , Jatin Sharma1 , Kalika Bali1
1

Microsoft Research, India

2

University of Melbourne, Australia

December 9, 2013

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
SubTask1: Query Word Labeling

Many Indian languages esp. in social media is written using romanized script

Table: Shared Task description in two separate steps of query labeling and back
transliteration

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Our Methodology

Word level language identification
based on character n-gram features learned from wordlists extracted from
monolingual corpus (King and "Abney, 2013)
Adding context switch probability to indirectly learn the language sequence
patterns
Frequency based filtering

Back-Transliteration
Hash based mapping between source and target languages (Kumar and
Udupa, 2011)
Use indic character mapping to create training data in poor-resource
languages

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Terminology, Datasets and Tools

Character n-gram features: hello :’h’,’e’,..,’o’,’he’,el’..,’hel’..,’hell’,’ello’,’hello’
Training resources: Word lists (from Leipzig Corpus, Anandbazar Patrika),
word frequencies and transliterated pairs given as part of shared task
Training size from 100 - 5000 words (Always <=546 for Gujarati)
(McCallum, 2002) for learning classifiers, MSRI Name Search Tool for
Transliteration

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
100

200

500

1000

2000

3000

5000

1.00
accuracy %

0.80

0.85

0.90

0.95

1.00
0.95
0.90

accuracy %

NaiveBayes
Max−Ent
DTree

0.75

0.75

0.80

0.85

0.90
0.85
0.75

0.80

accuracy %

0.95

1.00

Word label prediction based on n-gram features

100

(a) Hindi−English sampled size

(a) Hindi

200

500

1000

2000

3000

(b) Gujarati−English sampled size

(b) Gujarati

5000

100

200

500 1000 2000 3000 5000

(c) Bangla−English sampled size

(c) Bangla

Figure: Learning curves for maximum entropy, naive Bayes and decision tree on word
labeling for Hindi, Gujarati and Bangla language on development data

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
100

200

500

1000

2000

3000

(a) Hindi - Maxent

5000

s,
−
<
#
*
+

s
,
<
−

−
#
*
+

#
+
*

100

200

500

1000

2000

3000

(b) Gujarati - Maxent

5000

,
s

,
s
<

0.87

<
−
s,
#
*
+

0.86

,
−
<
s
#
+
*

0.85

s

s
,
<

,
−
<
#
s
+
*

accuracy %

0.95
0.90

s

*
+
<
−
#
,
s

#
−
,
<
s
+
*

0.84

,
s

+
−
<
#
*,
s

accuracy %

<

+
#
<
−
*,
s

0.85

0.92
0.90
0.88

+
−
#
*
<
,

+
−
*
#

−
<
#
+
*,
s

0.86

accuracy %

0.94

#
<
−
+
*,

0.88

Adding context-switch probability

s
<
,
−

s
,
<
−

,
s
*
<
#
−
,
s

,
<
−
s

+
*
#

−
<
#
+
*
<
+
#
*
−

+
#
*
+
#
*
−

#
*
+

+

100

200

+ 0.6
* 0.65
# 0.7
− 0.75
< 0.8
, 0.85
s 0.9
None

500 1000 2000 3000 5000

(c) Bangla - Naive

Figure: Learning curves with varying context switch probabilities

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Language Identification Errors

Type

Romanized

Predicted

Reference

Short Words
Ambiguous Words
Erroneous Words
Mixed Numerals Words

i; ve
the; ate
emosal
zara2; duwan2

H; E
E; E
H
E; E

E; H
H; H
E
H; H

Table: Annotation Errors

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Back Transliteration

MSRI Name Search Tool, built based on n-gram based feature hashing
Used indic character mapping between Hindi-Bangla and Hindi-Gujarati
All 3 systems for Gujarati and Bangla uses indic character mapping

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Test set Results

System

MSRI-1
MSRI-2
MSRI-3
Maximum
Median

LA
0.9823
0.9848
0.9826
0.9848
0.9540

Hindi
TF
0.8127
0.8130
0.8101
0.8130
0.4160

TQM
0.1940
0.1980
0.1860
0.1980
0.0290

LA
0.9614
0.9755
0.9661
0.9755
0.9661

Gujarati
TF
TQM
0.4711 0.0800
0.4803 0.0733
0.4748 0.0667
0.4803 0.0800
0.4748 0.0733

LA
0.9259
0.9499
0.9459
0.9499
0.9359

Bangla
TF
0.4914
0.5033
0.5137
0.5137
0.4973

TQM
0.0100
0.0100
0.0100
0.0100
0.0100

Table: Language labeling analysis on submitted runs in all three languages, along with
maximum and median scores. Our runs which had maximum scores are presented in
bold. LA - Labeling Accuracy, TF- Transliteration F-score, TQM - % of queries that
had exact labeling and transliteration

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Transliteration Error Analysis

Table: Transliteration Errors

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Summary

Contributions:
Using context switch probability increases the performance of language
labeling in code-mixed language.
Cross-language character mapping to increase transliteration accuracy promising direction for resource-poor languages

Future Work:
Extending it to text with spelling variations (covering text normalization)
Working on multiple languages esp. poor resource languages by exploiting
resources from related languages

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Questions?

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages
Bibliography I

King, B. and "Abney, S. (2013). Labeling the languages of words in
mixed-language documents using weakly supervised methods. In
Proceedings of NAACL-HLT, pages 1110–1119.
Kumar, S. and Udupa, R. (2011). Learning hash functions for cross-view
similarity search. In Proceedings of the Twenty-Second international joint
conference on Artificial Intelligence-Volume Volume Two, pages 1360–1365.
AAAI Press.
McCallum, A. K. (2002). Mallet: A machine learning for language toolkit.

Authors: Spandana Gella, Jatin Sharma, Kalika Bali

Query Word Labeling and Back Transliteration for Indian Languages

More Related Content

What's hot

Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
 
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET Journal
 
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Waqas Tariq
 
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONIjnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONijnlc
 
Development of morphological analyzer for hindi
Development of morphological analyzer for hindiDevelopment of morphological analyzer for hindi
Development of morphological analyzer for hindiMade Artha
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002IJARTES
 
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODELNERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODELijnlc
 
Language Identification from a Tri-lingual Printed Document: A Simple Approach
Language Identification from a Tri-lingual Printed Document: A Simple ApproachLanguage Identification from a Tri-lingual Printed Document: A Simple Approach
Language Identification from a Tri-lingual Printed Document: A Simple ApproachIJERA Editor
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIcscpconf
 
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATIONHANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATIONijnlc
 
A Tool to Search and Convert Reduplicate Words from Hindi to Punjabi
A Tool to Search and Convert Reduplicate Words from Hindi to PunjabiA Tool to Search and Convert Reduplicate Words from Hindi to Punjabi
A Tool to Search and Convert Reduplicate Words from Hindi to PunjabiIJERA Editor
 
An Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalAn Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
 
Anaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer methodAnaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer methodijcsa
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET Journal
 
Pronominal anaphora resolution in
Pronominal anaphora resolution inPronominal anaphora resolution in
Pronominal anaphora resolution inijfcstjournal
 
Spell checker for Kannada OCR
Spell checker for Kannada OCRSpell checker for Kannada OCR
Spell checker for Kannada OCRdbpublications
 
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
 
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
 

What's hot (20)

Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
 
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
IRJET -Survey on Named Entity Recognition using Syntactic Parsing for Hindi L...
 
Aruna kumar K R resume
Aruna kumar K R  resumeAruna kumar K R  resume
Aruna kumar K R resume
 
Ijetcas14 575
Ijetcas14 575Ijetcas14 575
Ijetcas14 575
 
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
 
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONIjnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
 
Development of morphological analyzer for hindi
Development of morphological analyzer for hindiDevelopment of morphological analyzer for hindi
Development of morphological analyzer for hindi
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
 
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODELNERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
 
Language Identification from a Tri-lingual Printed Document: A Simple Approach
Language Identification from a Tri-lingual Printed Document: A Simple ApproachLanguage Identification from a Tri-lingual Printed Document: A Simple Approach
Language Identification from a Tri-lingual Printed Document: A Simple Approach
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDI
 
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATIONHANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
 
A Tool to Search and Convert Reduplicate Words from Hindi to Punjabi
A Tool to Search and Convert Reduplicate Words from Hindi to PunjabiA Tool to Search and Convert Reduplicate Words from Hindi to Punjabi
A Tool to Search and Convert Reduplicate Words from Hindi to Punjabi
 
An Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalAn Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity Removal
 
Anaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer methodAnaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer method
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
 
Pronominal anaphora resolution in
Pronominal anaphora resolution inPronominal anaphora resolution in
Pronominal anaphora resolution in
 
Spell checker for Kannada OCR
Spell checker for Kannada OCRSpell checker for Kannada OCR
Spell checker for Kannada OCR
 
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
 
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
 

Similar to Query Word Labeling and Back Transliteration for Indian Languages

Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabikevig
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesAlgoscale Technologies Inc.
 
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Nischal Lal Shrestha
 
Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languageskevig
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
 
Intrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsIntrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsJinho Choi
 

Similar to Query Word Labeling and Back Transliteration for Indian Languages (7)

Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabi
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian Languages
 
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
 
Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languages
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random Field
 
Intrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsIntrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word Embeddings
 

Recently uploaded

Capstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdfCapstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdfeliklein8
 
Film the city investagation powerpoint :)
Film the city investagation powerpoint :)Film the city investagation powerpoint :)
Film the city investagation powerpoint :)AshtonCains
 
International Airport Call Girls 🥰 8617370543 Service Offer VIP Hot Model
International Airport Call Girls 🥰 8617370543 Service Offer VIP Hot ModelInternational Airport Call Girls 🥰 8617370543 Service Offer VIP Hot Model
International Airport Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceVellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceDamini Dixit
 
Capstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdfCapstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdfeliklein8
 
Production diary Film the city powerpoint
Production diary Film the city powerpointProduction diary Film the city powerpoint
Production diary Film the city powerpointAshtonCains
 
Film show production powerpoint for site
Film show production powerpoint for siteFilm show production powerpoint for site
Film show production powerpoint for siteAshtonCains
 
💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG
💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG
💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANGCara Menggugurkan Kandungan 087776558899
 
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...Delhi Call girls
 
Panjim Goa Escort Girls ✿✸ 9971646499 ₢♚ Call Girls Panjim Goa Direct Cash O...
Panjim Goa Escort Girls ✿✸ 9971646499  ₢♚ Call Girls Panjim Goa Direct Cash O...Panjim Goa Escort Girls ✿✸ 9971646499  ₢♚ Call Girls Panjim Goa Direct Cash O...
Panjim Goa Escort Girls ✿✸ 9971646499 ₢♚ Call Girls Panjim Goa Direct Cash O...ritikaroy0888
 
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...ZurliaSoop
 
Karol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verified
Karol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verifiedKarol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verified
Karol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<Health
 
College & House wife Call Girls in Paharganj 9634446618 -Best Escort call gi...
College & House wife  Call Girls in Paharganj 9634446618 -Best Escort call gi...College & House wife  Call Girls in Paharganj 9634446618 -Best Escort call gi...
College & House wife Call Girls in Paharganj 9634446618 -Best Escort call gi...Heena Escort Service
 
Unlock the power of Instagram with SocioCosmos. Start your journey towards so...
Unlock the power of Instagram with SocioCosmos. Start your journey towards so...Unlock the power of Instagram with SocioCosmos. Start your journey towards so...
Unlock the power of Instagram with SocioCosmos. Start your journey towards so...SocioCosmos
 
Sociocosmos empowers you to go trendy on social media with a few clicks..pdf
Sociocosmos empowers you to go trendy on social media with a few clicks..pdfSociocosmos empowers you to go trendy on social media with a few clicks..pdf
Sociocosmos empowers you to go trendy on social media with a few clicks..pdfSocioCosmos
 
Enhancing Consumer Trust Through Strategic Content Marketing
Enhancing Consumer Trust Through Strategic Content MarketingEnhancing Consumer Trust Through Strategic Content Marketing
Enhancing Consumer Trust Through Strategic Content MarketingDigital Marketing Lab
 
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 

Recently uploaded (20)

Capstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdfCapstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdf
 
Film the city investagation powerpoint :)
Film the city investagation powerpoint :)Film the city investagation powerpoint :)
Film the city investagation powerpoint :)
 
International Airport Call Girls 🥰 8617370543 Service Offer VIP Hot Model
International Airport Call Girls 🥰 8617370543 Service Offer VIP Hot ModelInternational Airport Call Girls 🥰 8617370543 Service Offer VIP Hot Model
International Airport Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceVellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
 
Capstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdfCapstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdf
 
Production diary Film the city powerpoint
Production diary Film the city powerpointProduction diary Film the city powerpoint
Production diary Film the city powerpoint
 
Film show production powerpoint for site
Film show production powerpoint for siteFilm show production powerpoint for site
Film show production powerpoint for site
 
💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG
💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG
💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG
 
The Butterfly Effect
The Butterfly EffectThe Butterfly Effect
The Butterfly Effect
 
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
 
Panjim Goa Escort Girls ✿✸ 9971646499 ₢♚ Call Girls Panjim Goa Direct Cash O...
Panjim Goa Escort Girls ✿✸ 9971646499  ₢♚ Call Girls Panjim Goa Direct Cash O...Panjim Goa Escort Girls ✿✸ 9971646499  ₢♚ Call Girls Panjim Goa Direct Cash O...
Panjim Goa Escort Girls ✿✸ 9971646499 ₢♚ Call Girls Panjim Goa Direct Cash O...
 
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
 
Karol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verified
Karol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verifiedKarol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verified
Karol Bagh, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
 
College & House wife Call Girls in Paharganj 9634446618 -Best Escort call gi...
College & House wife  Call Girls in Paharganj 9634446618 -Best Escort call gi...College & House wife  Call Girls in Paharganj 9634446618 -Best Escort call gi...
College & House wife Call Girls in Paharganj 9634446618 -Best Escort call gi...
 
Unlock the power of Instagram with SocioCosmos. Start your journey towards so...
Unlock the power of Instagram with SocioCosmos. Start your journey towards so...Unlock the power of Instagram with SocioCosmos. Start your journey towards so...
Unlock the power of Instagram with SocioCosmos. Start your journey towards so...
 
Sociocosmos empowers you to go trendy on social media with a few clicks..pdf
Sociocosmos empowers you to go trendy on social media with a few clicks..pdfSociocosmos empowers you to go trendy on social media with a few clicks..pdf
Sociocosmos empowers you to go trendy on social media with a few clicks..pdf
 
Enhancing Consumer Trust Through Strategic Content Marketing
Enhancing Consumer Trust Through Strategic Content MarketingEnhancing Consumer Trust Through Strategic Content Marketing
Enhancing Consumer Trust Through Strategic Content Marketing
 
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service
 
Call Girls in Chattarpur (delhi) call me [9953056974] escort service 24X7
Call Girls in Chattarpur (delhi) call me [9953056974] escort service 24X7Call Girls in Chattarpur (delhi) call me [9953056974] escort service 24X7
Call Girls in Chattarpur (delhi) call me [9953056974] escort service 24X7
 

Query Word Labeling and Back Transliteration for Indian Languages

  • 1. Query Word Labeling and Back Transliteration for Indian Languages: MSRI Shared task system description Spandana Gella1,2 , Jatin Sharma1 , Kalika Bali1 1 Microsoft Research, India 2 University of Melbourne, Australia December 9, 2013 Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 2. SubTask1: Query Word Labeling Many Indian languages esp. in social media is written using romanized script Table: Shared Task description in two separate steps of query labeling and back transliteration Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 3. Our Methodology Word level language identification based on character n-gram features learned from wordlists extracted from monolingual corpus (King and "Abney, 2013) Adding context switch probability to indirectly learn the language sequence patterns Frequency based filtering Back-Transliteration Hash based mapping between source and target languages (Kumar and Udupa, 2011) Use indic character mapping to create training data in poor-resource languages Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 4. Terminology, Datasets and Tools Character n-gram features: hello :’h’,’e’,..,’o’,’he’,el’..,’hel’..,’hell’,’ello’,’hello’ Training resources: Word lists (from Leipzig Corpus, Anandbazar Patrika), word frequencies and transliterated pairs given as part of shared task Training size from 100 - 5000 words (Always <=546 for Gujarati) (McCallum, 2002) for learning classifiers, MSRI Name Search Tool for Transliteration Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 5. 100 200 500 1000 2000 3000 5000 1.00 accuracy % 0.80 0.85 0.90 0.95 1.00 0.95 0.90 accuracy % NaiveBayes Max−Ent DTree 0.75 0.75 0.80 0.85 0.90 0.85 0.75 0.80 accuracy % 0.95 1.00 Word label prediction based on n-gram features 100 (a) Hindi−English sampled size (a) Hindi 200 500 1000 2000 3000 (b) Gujarati−English sampled size (b) Gujarati 5000 100 200 500 1000 2000 3000 5000 (c) Bangla−English sampled size (c) Bangla Figure: Learning curves for maximum entropy, naive Bayes and decision tree on word labeling for Hindi, Gujarati and Bangla language on development data Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 6. 100 200 500 1000 2000 3000 (a) Hindi - Maxent 5000 s, − < # * + s , < − − # * + # + * 100 200 500 1000 2000 3000 (b) Gujarati - Maxent 5000 , s , s < 0.87 < − s, # * + 0.86 , − < s # + * 0.85 s s , < , − < # s + * accuracy % 0.95 0.90 s * + < − # , s # − , < s + * 0.84 , s + − < # *, s accuracy % < + # < − *, s 0.85 0.92 0.90 0.88 + − # * < , + − * # − < # + *, s 0.86 accuracy % 0.94 # < − + *, 0.88 Adding context-switch probability s < , − s , < − , s * < # − , s , < − s + * # − < # + * < + # * − + # * + # * − # * + + 100 200 + 0.6 * 0.65 # 0.7 − 0.75 < 0.8 , 0.85 s 0.9 None 500 1000 2000 3000 5000 (c) Bangla - Naive Figure: Learning curves with varying context switch probabilities Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 7. Language Identification Errors Type Romanized Predicted Reference Short Words Ambiguous Words Erroneous Words Mixed Numerals Words i; ve the; ate emosal zara2; duwan2 H; E E; E H E; E E; H H; H E H; H Table: Annotation Errors Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 8. Back Transliteration MSRI Name Search Tool, built based on n-gram based feature hashing Used indic character mapping between Hindi-Bangla and Hindi-Gujarati All 3 systems for Gujarati and Bangla uses indic character mapping Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 9. Test set Results System MSRI-1 MSRI-2 MSRI-3 Maximum Median LA 0.9823 0.9848 0.9826 0.9848 0.9540 Hindi TF 0.8127 0.8130 0.8101 0.8130 0.4160 TQM 0.1940 0.1980 0.1860 0.1980 0.0290 LA 0.9614 0.9755 0.9661 0.9755 0.9661 Gujarati TF TQM 0.4711 0.0800 0.4803 0.0733 0.4748 0.0667 0.4803 0.0800 0.4748 0.0733 LA 0.9259 0.9499 0.9459 0.9499 0.9359 Bangla TF 0.4914 0.5033 0.5137 0.5137 0.4973 TQM 0.0100 0.0100 0.0100 0.0100 0.0100 Table: Language labeling analysis on submitted runs in all three languages, along with maximum and median scores. Our runs which had maximum scores are presented in bold. LA - Labeling Accuracy, TF- Transliteration F-score, TQM - % of queries that had exact labeling and transliteration Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 10. Transliteration Error Analysis Table: Transliteration Errors Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 11. Summary Contributions: Using context switch probability increases the performance of language labeling in code-mixed language. Cross-language character mapping to increase transliteration accuracy promising direction for resource-poor languages Future Work: Extending it to text with spelling variations (covering text normalization) Working on multiple languages esp. poor resource languages by exploiting resources from related languages Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 12. Questions? Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages
  • 13. Bibliography I King, B. and "Abney, S. (2013). Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of NAACL-HLT, pages 1110–1119. Kumar, S. and Udupa, R. (2011). Learning hash functions for cross-view similarity search. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Two, pages 1360–1365. AAAI Press. McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages