The document describes a system for query word labeling and back transliteration for Indian languages. It uses character n-gram features and context switch probability to label words in code-mixed languages. For back transliteration, it uses a hash-based mapping approach between languages. The system achieved high accuracy scores for language labeling and moderate scores for transliteration on Hindi, Gujarati, and Bangla test sets. Areas for future work include handling spelling variations and working on more under-resourced languages.
Call Girls in Chattarpur (delhi) call me [9953056974] escort service 24X7
Query Word Labeling and Back Transliteration for Indian Languages
1. Query Word Labeling and Back Transliteration for Indian Languages:
MSRI Shared task system description
Spandana Gella1,2 , Jatin Sharma1 , Kalika Bali1
1
Microsoft Research, India
2
University of Melbourne, Australia
December 9, 2013
Authors: Spandana Gella, Jatin Sharma, Kalika Bali
Query Word Labeling and Back Transliteration for Indian Languages
2. SubTask1: Query Word Labeling
Many Indian languages esp. in social media is written using romanized script
Table: Shared Task description in two separate steps of query labeling and back
transliteration
Authors: Spandana Gella, Jatin Sharma, Kalika Bali
Query Word Labeling and Back Transliteration for Indian Languages
3. Our Methodology
Word level language identification
based on character n-gram features learned from wordlists extracted from
monolingual corpus (King and "Abney, 2013)
Adding context switch probability to indirectly learn the language sequence
patterns
Frequency based filtering
Back-Transliteration
Hash based mapping between source and target languages (Kumar and
Udupa, 2011)
Use indic character mapping to create training data in poor-resource
languages
Authors: Spandana Gella, Jatin Sharma, Kalika Bali
Query Word Labeling and Back Transliteration for Indian Languages
4. Terminology, Datasets and Tools
Character n-gram features: hello :’h’,’e’,..,’o’,’he’,el’..,’hel’..,’hell’,’ello’,’hello’
Training resources: Word lists (from Leipzig Corpus, Anandbazar Patrika),
word frequencies and transliterated pairs given as part of shared task
Training size from 100 - 5000 words (Always <=546 for Gujarati)
(McCallum, 2002) for learning classifiers, MSRI Name Search Tool for
Transliteration
Authors: Spandana Gella, Jatin Sharma, Kalika Bali
Query Word Labeling and Back Transliteration for Indian Languages
5. 100
200
500
1000
2000
3000
5000
1.00
accuracy %
0.80
0.85
0.90
0.95
1.00
0.95
0.90
accuracy %
NaiveBayes
Max−Ent
DTree
0.75
0.75
0.80
0.85
0.90
0.85
0.75
0.80
accuracy %
0.95
1.00
Word label prediction based on n-gram features
100
(a) Hindi−English sampled size
(a) Hindi
200
500
1000
2000
3000
(b) Gujarati−English sampled size
(b) Gujarati
5000
100
200
500 1000 2000 3000 5000
(c) Bangla−English sampled size
(c) Bangla
Figure: Learning curves for maximum entropy, naive Bayes and decision tree on word
labeling for Hindi, Gujarati and Bangla language on development data
Authors: Spandana Gella, Jatin Sharma, Kalika Bali
Query Word Labeling and Back Transliteration for Indian Languages
6. 100
200
500
1000
2000
3000
(a) Hindi - Maxent
5000
s,
−
<
#
*
+
s
,
<
−
−
#
*
+
#
+
*
100
200
500
1000
2000
3000
(b) Gujarati - Maxent
5000
,
s
,
s
<
0.87
<
−
s,
#
*
+
0.86
,
−
<
s
#
+
*
0.85
s
s
,
<
,
−
<
#
s
+
*
accuracy %
0.95
0.90
s
*
+
<
−
#
,
s
#
−
,
<
s
+
*
0.84
,
s
+
−
<
#
*,
s
accuracy %
<
+
#
<
−
*,
s
0.85
0.92
0.90
0.88
+
−
#
*
<
,
+
−
*
#
−
<
#
+
*,
s
0.86
accuracy %
0.94
#
<
−
+
*,
0.88
Adding context-switch probability
s
<
,
−
s
,
<
−
,
s
*
<
#
−
,
s
,
<
−
s
+
*
#
−
<
#
+
*
<
+
#
*
−
+
#
*
+
#
*
−
#
*
+
+
100
200
+ 0.6
* 0.65
# 0.7
− 0.75
< 0.8
, 0.85
s 0.9
None
500 1000 2000 3000 5000
(c) Bangla - Naive
Figure: Learning curves with varying context switch probabilities
Authors: Spandana Gella, Jatin Sharma, Kalika Bali
Query Word Labeling and Back Transliteration for Indian Languages
7. Language Identification Errors
Type
Romanized
Predicted
Reference
Short Words
Ambiguous Words
Erroneous Words
Mixed Numerals Words
i; ve
the; ate
emosal
zara2; duwan2
H; E
E; E
H
E; E
E; H
H; H
E
H; H
Table: Annotation Errors
Authors: Spandana Gella, Jatin Sharma, Kalika Bali
Query Word Labeling and Back Transliteration for Indian Languages
8. Back Transliteration
MSRI Name Search Tool, built based on n-gram based feature hashing
Used indic character mapping between Hindi-Bangla and Hindi-Gujarati
All 3 systems for Gujarati and Bangla uses indic character mapping
Authors: Spandana Gella, Jatin Sharma, Kalika Bali
Query Word Labeling and Back Transliteration for Indian Languages
10. Transliteration Error Analysis
Table: Transliteration Errors
Authors: Spandana Gella, Jatin Sharma, Kalika Bali
Query Word Labeling and Back Transliteration for Indian Languages
11. Summary
Contributions:
Using context switch probability increases the performance of language
labeling in code-mixed language.
Cross-language character mapping to increase transliteration accuracy promising direction for resource-poor languages
Future Work:
Extending it to text with spelling variations (covering text normalization)
Working on multiple languages esp. poor resource languages by exploiting
resources from related languages
Authors: Spandana Gella, Jatin Sharma, Kalika Bali
Query Word Labeling and Back Transliteration for Indian Languages
13. Bibliography I
King, B. and "Abney, S. (2013). Labeling the languages of words in
mixed-language documents using weakly supervised methods. In
Proceedings of NAACL-HLT, pages 1110–1119.
Kumar, S. and Udupa, R. (2011). Learning hash functions for cross-view
similarity search. In Proceedings of the Twenty-Second international joint
conference on Artificial Intelligence-Volume Volume Two, pages 1360–1365.
AAAI Press.
McCallum, A. K. (2002). Mallet: A machine learning for language toolkit.
Authors: Spandana Gella, Jatin Sharma, Kalika Bali
Query Word Labeling and Back Transliteration for Indian Languages