SlideShare a Scribd company logo
1 of 5
Download to read offline
Machine Learning Approach for Language Identification &
Transliteration: Shared Task Report of IITP-TS
Deepak Kumar Gupta
Comp. Sc. & Engg. Deptt.
IIT Patna, India
deepak.mtmc13@iitp.ac.in
Shubham Kumar
Comp. Sc. & Engg. Deptt.
IIT Patna, India
shubham.ee12@iitp.ac.in
Asif Ekbal
Comp. Sc. & Engg. Deptt.
IIT Patna, India
asif@iitp.ac.in
ABSTRACT
In this paper, we describe the system that we developed as
part of our participation to the FIRE-2014 Shared Task on
Transliterated Search. We participated only for Subtask 1
that focused on labeling the query words. The entire pro-
cess consists of the following subtasks: language identifica-
tion of each word in the text, named entity recognition and
classification (NERC) and transliteration of the Indian lan-
guage words written in non-native scripts to the correspond-
ing native Indic scripts. The proposed methods of language
identification and NERC are based on the supervised ap-
proaches, where we use several machine learning algorithms.
We develop a transliteration framework which is based on
the modified joint source channel model. Experiments on
the benchmark setup show that we achieve quite encourag-
ing performance for both pairs of languages. It is also to
be noted that we did not make use of any deep domain-
specific resources and/or tools, and therefore this can be
easily adapted to the other domains and/or languages.
Keywords
Language Identification, NERC, Transliteration, Ensemble,
Modified Joint-Source Channel Model
1. INTRODUCTION
Recent decade has seen an upsurge in the social network-
ing and e-commerce sector witnessing an enormous growth
in the volume of data flowing out of media networks, which
can be used by public and private organizations alike to
gain valuable insights. New forms of communication, such
as micro-blogging, Tweets, status, reviews and text messag-
ing have emerged and become ubiquitous. These messages
often are written using Roman script due to various socio-
cultural and technological reasons [4]. Many languages such
as South and South East Asian Languages, Arabic, Russian
etc. make use of indigenous scripts while writing in text
forms. The process of phonetically representing the words
of a language in a non-native script is called transliteration.
Transliteration, especially into Roman script, is used abun-
dantly on the Web not only for documents, but also for user
queries that intend to search for these documents. These
problems were addressed in the FIRE-2013 Shared Task on
Transliteration [5]. More recent studies show that building
computational models for the social media content is more
challenging because of the nature of the mixing as well as the
presence of non-standard variations in spellings and gram-
mar, and transliteration [1]. The work that we present here
is in connection to the shared task that is being conducted
as an continuation to the previous year.
1.1 Task Description
This year, two subtasks on Transliterated Search were con-
ducted: first one is the query labeling, and the second task
is related to the ad-hoc retrieval task for Hindi film lyrics.
We participated for the first task which is described very
briefly as follows:
Subtask 1: Query Word Labeling
Suppose that a query q: w1 w2 w3 . . . wn is written in the
Roman script. The words, w1 w2 etc., could be standard En-
glish words or transliterated from another language L. The
task is to label the words as E or L depending on whether
it is an English word, or a transliterated L-language word.
And then, for each transliterated word, provide the correct
transliteration in the native script (i.e., the script which is
used for writing L). The task also required to identify and
classify the named entities of types person, location, organi-
zation and abbreviation.
2. METHODOLOGY
The overall task for query labeling consists of three ma-
jor components, viz. Language Identification, NERC and
Transliteration. It is to be noted that we did not make use
of any domain-specific resources and/or tools for the sake of
their domain-independence. Below we describe the method-
ologies that we followed for each of these individual modules.
2.1 Language Identification
The problem of language identification concerns with de-
termining the language of a given word. The task can be
modeled as a classification problem, where each word has
to be labeled either with one of the three classes, namely
Hindi (or Bengali), English and Mixed (denotes the mixed
characters of English and non-Latin language scripts). Our
proposed method for language identification is supervised in
nature. In particular we develop the systems based on four
different classifiers, namely random forest, random tree,
support vector machine and decision tree. We use the
Weka implementations1
for these classifiers. In order to fur-
ther improve the performance we construct an ensemble by
combining the decisions of all the classifiers using majority
voting. We followed the similar approach for both Hindi-
English and Bangla-English pairs. The features that we im-
plemented for language identification are described below in
brief:
1. Character n-gram: Character n-gram is a contigu-
ous sequence of n character extracted from a given
word. We extract character n-grams of length one
(unigram), two (bigram) and three (trigram), and use
these as features of the classifiers.
2. Context word: Local contexts help to identify the
type of the current word. We use the contexts of pre-
vious two and next two words as features.
3. Word normalization : Words are normalized in or-
der to capture the similarity between two different
words that share some common properties. Each cap-
italized letter is replaced by ‘A’, small by ’a’ and num-
ber by ’0’.
4. Gazetteer based feature : We compile a list of
Hindi, Bengali and English words from the training
datasets. A feature vector of length two (representing
the respective gazetteer for the language pair: Hindi-
English or Bangla-English). Now for each token we set
a feature value equal to ‘1’ if it is present in the respec-
tive gazetteer, otherwise ‘0’. Hence for the words that
appear in both the gazetteers, the feature vector will
take the values of 1 in both the bit positions. Recent
studies also suggest that gazetteer based features can
be effectively used for the language identification [3].
5. InitCap: This feature checks whether the current to-
ken starts with a capital letter.
6. InitPunDigit: We define a binary-valued feature that
checks whether the current token starts with a punc-
tuation or digit.
7. DigitAlpha: We define this feature in such a way that
checks whether the current token is alphanumeric.
8. Contains# symbol: We define the feature that checks
whether the word in the surrounding context contains
the symbol #.
Last three features help to recognize the tokens which are
mixed in nature (i.e., do not belong to Hindi, English and
Bangla). Some of the examples are: 2mar, #lol, (rozana
etc.
2.2 Named Entity Recognition and Classifica-
tion
Named Entity Recognition and Classification (NERC) in an
unstructured texts such as facebook, blogs etc. are more
challenging compared to the traditional news-wire domains.
1
http://www.cs.waikato.ac.nz/ml/weka/
Here the task was to identify named entities (NEs) and clas-
sify them into the following categories: Person, Organi-
zation, Location and Abbreviation. We use machine
learning model to recognize the first three NE types, and
for the last one we used heuristics. It is to be noted that
there were many inconsistencies in annotation, and hence we
pre-processed the datasets to maintain uniformity. In order
to denote the boundary of NEs we use the BIO encoding
scheme 2
. We implement the following features for NERC.
1. Local context: Local contexts that span the preced-
ing and following few tokens of the current word are
used as the features. Here we use the previous two and
next two tokens as the features.
2. Character n-gram: Similar to the language identifi-
cation we use n-grams of length upto 5 as the features.
3. Prefix and Suffix: Prefix and suffix of fixed length
character sequences (here, 3) are stripped from each
token and used as the features of classifier.
4. Word normalization: This feature is defined exactly
in the same way as we did for language identification.
5. WordClassFeature: This feature was defined to en-
sure that the words having similar structures belong
to the same class. In the first step we normalize all
the words following the process as mentioned above.
Thereafter, consecutive same characters are squeezed
into a single character. For example, the normalized
word AAAaaa is converted to Aa. We found this fea-
ture to be effective for the biomedical domain, and we
directly adapted this without any modification. De-
tailed sensitivity analysis might be useful to study its
effectiveness for the current domain.
6. Typographic features: We define a set of features
depending upon the Typographic constructions of the
words. We implement the following four features: All-
Caps (whether the current word is made up of all cap-
italized letters), AllSmall (word is constructed with
only uncapitalized characters), InitCap (word starts
with a capital letter) and DigitAlpha (word contains
digits and alphabets).
2.3 Transliteration
A transliteration system takes as input a character string in
the source language and generates a character string in the
target language as output. The transliteration algorithm
[2] that we used here is conceptualized as two levels of de-
coding: segmenting source and target language strings into
transliteration units (TUs); and defining appropriate map-
ping between the source and target TUs by resolving differ-
ent combinations of alignments and unit mappings. The TU
is defined based on the regular expression.
For K alligned TUs (X: source TU and T: target TU), we
have
P(X, T) = P(x1, x2 . . . xk, t1, t2 . . . tk)
2
B, I and O denote the beginning, intermediate and outside
the NEs.
= P (<x,t>1, <x,t>2, . . . <x,t>k)
= Πk
k=1 P (<x,t>k | <x,t>k−1
1 )
We implement a number of transliteration models that can
generate the original Indian language word (i.e., Indic script)
from the given English transliteration written in Roman
script. The Indic word is divided into TUs that have the
pattern C+
M, where C represents a vowel or a consonant
or conjunct and M represents the vowel modifier or matra.
An English word is divided into TUs that have the pattern
C*V*, where C represents a consonant and V represents a
vowel [2]. The process considers contextual information in
both the source and target sides in the form of collocated
TUs, and compute the probability of transliteration from
each source TU to various target candidate TUs and chooses
the one with maximum probability. The most appropriate
mappings between the source and target TUs are learned au-
tomatically from the bilingual training corpus. The training
process yields a kind of decision-tree list that outputs for
each source TU a set of possible target TUs along with the
probability of each decision obtained from a training cor-
pus. The transliteration of the input word is obtained using
direct orthographic mapping by identifying the equivalent
target TU for each source TU in the input and then placing
the target TUs in order. We implemented all the six mod-
els as proposed in [2]. Based on some experiments on the
held out datasets we selected the following three models to
submit our runs:
Model-I: This is a kind of monogram model where no con-
text is considered, i.e. P(X,T) = Πk
k=1 P(<x,t>k)
Model-II: This model is built by considering next source
TU as context.
P(X,T) = Πk
k=1 P(<x,t>k | xk+1 )
Model-III: This model incorporates the previous and the
next TUs in the source and the previous target TU as the
context.
P(X,T) = Πk
k=1 P(<x,t>k | <x,t>k−1, xk+1)
The overall transliteration process attempts to produce the
best output for the given input word using Model-III. If the
transliteration is not obtained then we consult Model-II and
then Model-I in sequence. If none of these models produces
the output then we consider a literal transliteration model
developed using a dictionary. This process is shown below:
Input: Token (t) which is labeled as L3
in Language
identification
Output: Transliteration (T) of given token
• Step 1: T<- Model-III (t)4
1.1 If T contains null value
1.1.1 T<- Model-III withAlignment(t)5
• Step 2: If T contains null value
3
Denotes Hindi(H) or Bengali(B)
4
each model takes input token and divide it into several non
native TU and give the native TU for each of them
5
each model withAlignment takes input token and align the
source TU with target TU
2.1 T<- Model-II (t)
2.2 If T contains null value
2.2.1 T<- Model-II withAlignment(t)
• Step 3: If T contains null value
3.1 T<- Model-I (t)
3.2 If T contains null value
3.2.1 T<- Model-I withAlignment(t)
• Step 4: return (T)
3. EXPERIMENTS AND DISCUSSIONS
3.1 Datasets
We submitted runs for the two pair of languages, namely
Hindi-English and Bangla-English. For language identifica-
tion the FIRE 2014 organizers provided three documents for
Hindi-English pair and two documents for Bangla-English
pair. For each of the language pairs, the individual docu-
ments are merged together into a single file for training. The
training sets consist of 1,004 (20,658 tokens) and 800 sen-
tences (27,969 tokens) for Hindi-English and Bangla-English
language pairs, respectively. The test sets consist of 32,270
and 25,698 tokens for Hindi-English and Bangla-English, re-
spectively. For training of transliteration algorithm we make
use of 54,791 Hindi-English and 19,582 Bangla-English par-
allel examples. Details of these datasets can be found in
[6].
3.2 Results and Analysis
In this section we report the results that we obtained for
query word labeling. We submitted three runs which are
defined as below:
Run-1: For language identification and NERC, we con-
struct the ensembles using majority voting. If the token
is labeled as any native language (Hindi or Bangla) then we
perform the transliteration for that token.
Run-2: In this run we perform language identification by
majority ensemble, and NERC by SMO. The word labeled
as native language is transliterated accordingly.
Run-3: In this run both the language identification and
NERC are carried out using SMO. Transliteration is done
following the same method.
Run ID LP LR LF EP ER EF LA
Run-1 0.920 0.843 0.880 0.883 0.932 0.907 0.886
Run-2 0.922 0.843 0.881 0.884 0.931 0.907 0.886
Run-3 0.882 0.841 0.861 0.88 0.896 0.888 0.870
Table 1: Result for language identification of
Bangla-English. Here, LP-Language precision, LR-
Language recall, LF-Language FScore, EP-English
precision, ER-English recall, EF-English FScore,
LA-Labelling accuracy
Run ID LP LR LF EP ER EF LA
Run-1 0.921 0.895 0.908 0.89 0.908 0.899 0.879
Run-2 0.921 0.893 0.907 0.89 0.908 0.899 0.878
Run-3 0.905 0.865 0.885 0.86 0.886 0.873 0.857
Table 2: Results for language identification of Hindi-
English
Run ID EQMF ALL TP TR TF ETPM
Run-1 0.005 0.039 0.574 0.073 228/337
Run-2 0.005 0.039 0.574 0.073 228/337
Run-3 0.005 0.038 0.582 0.071 231/344
Table 3: Results of transliteration for Bangla-
English. Here, EQMF- ALL-Exact query match
fraction, TP-Transliteration precision, TR-
Transliteration recall, TF-Transliteration F-Score,
ETPM-Exact transliteration pair match
The systems were evaluated using the metrics as defined in
[5]. Overall results for language identification are reported
in Table 1 and Table 2 for Bangla-English and Hindi-English
pairs, respectively. Evaluation shows that the first two runs
achieve quite same level of F-scores. Experimental results
for transliteration are reported in Table 3 and Table 4 for
Bangla-English and Hindi-English, respectively. Compar-
isons with the other submitted systems show that we achieve
the performance levels which are in the upper side. For Hindi
we obtain the precision, recall and F-score of 92.10%, 89.5%
and 90.80%, which are very closer to the best performing
system (only inferior by less than one point F-score). For
English, our system attains the precision, recall and F-score
values of 89.00%, 90.80% and 89.90%, respectively. This is
lower than the best system only by 0.2% F-score points. For
Bangla-English pair our system also performs with impres-
sive F-score values. In a separate experiment we evaluated
the model of NERC. We obtain the precision, recall and
F-score values of 61.00%, 44.25% and 51.25%, respectively
for Hindi-English pair using the ensemble framework. For
Bangla-English pair it yields the precision, recall and F-score
values of 54.25%, 43.25% and 48.25%, respectively.
Our close investigation to the results show that our model
developed for language identification suffers due to very short
forms of the words, ambiguities, errorneous words and mixed
wordforms. Short words refer to the tokens which are writ-
ten in their short forms. Ambiguities arise because of the
words, having meanings in both native and non-native scripts.
Errors are encountered because of the wrong spelling. Also,
the alphanumeric wordforms (i.e., mixed) often contribute
to the overall errors. The errors are shown in Table 5.
The transliteration model makes most of the errors because
of spelling variation (e.g., Pahalaa [ vs. ]). Incon-
sistent annotation is the another potential source of errors.
4. CONCLUSION
Run ID EQMF ALL TP TR TF ETPM
Run-1 0.005 0.146 0.76 0.244 1933/2306
Run-2 0.004 0.146 0.76 0.244 1931/2301
Run-3 0.004 0.143 0.736 0.24 1871/2226
Table 4: Results of transliteration for Hindi-English
Type Words Predicted Reference
Short words thrgh H E
Ambiguous words the;ate E;E H;H
Erroneous words implemnt H E
Mixed Numerals Words 2mar O B
Table 5: Language labeling errors. Here, H-Hindi,
E-English, O-others
In this paper we presented a brief overview of the system
that we developed as part of our participation in the query
labeling subtask of transliteration search. The proposed
method classifies the input query words to their native lan-
guage and back-transliterate non-English words to their orig-
inal script. We have used several classification techniques for
solving the problem of language identification and NERC.
For transliteration we have used a modified joint source-
channel model. We submitted three runs. Comparisons with
the other systems show that our system achieves quite en-
couraging performance for both the pairs, Hindi-English and
Bangla-English.
Our detailed analysis suggest that language identification
module suffers most due to the presence of very short word-
forms, ambiguities and alphanumeric words. Errors encoun-
tered in the transliteration model can be reduced a lot by
developing a method for spelling variations.
5. REFERENCES
[1] K. Bali, J. Sharma, M. Choudhury, and Y. Vyas. “I am
borrowing ya mixing?” An analysis of english-hindi
code mixing in facebook. In First Workshop on
Computational Approaches to Code Switching, EMNLP
2014, page 116, 2014.
[2] A. Ekbal, S. K. Naskar, and S. Bandyopadhyay. A
modified joint source-channel model for transliteration.
Proceedings of the COLING/ACL, pages 191–198, 2006.
[3] C.-C. Lin, W. Ammar, L. Levin, and C. Dyer. The
CMU submission for the shared task on language
identification in code-switched data. page 80, 2014.
[4] R. S. Roy, M. Choudhury, P. Majumder, and
K. Agarwal. Challenges in designing input method
editors for indian languages: The role of word-origin
and context. In Advances in Text Input Methods
(WTIM 2011), pages 1–9, 2011.
[5] R. S. Roy, M. Choudhury, P. Majumder, and
K. Agarwal. Overview and datasets of fire 2013 track
on transliterated search. In FIRE-13 Workshop, 2013.
[6] S. V.B., M. Choudhury, K. Bali, T. Dasgupta, and
A. Basu. Resource creation for training and testing of
transliteration systems for indian languages. pages
2902–2905.

More Related Content

What's hot

Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding Systeminscit2006
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Dhabal Sethi
 
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedLidia Pivovarova
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002IJARTES
 
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATIONHANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATIONijnlc
 
A Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration SystemA Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration SystemEditor IJCATR
 
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...IJCI JOURNAL
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
 
Code Mixing computationally bahut challenging hai
Code Mixing computationally bahut challenging haiCode Mixing computationally bahut challenging hai
Code Mixing computationally bahut challenging haiIIIT Hyderabad
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)ThennarasuSakkan
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
 
A Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageA Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageSravanthi Mullapudi
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 

What's hot (19)

Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and Persian
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 
D3 dhanalakshmi
D3 dhanalakshmiD3 dhanalakshmi
D3 dhanalakshmi
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)
 
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
 
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATIONHANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATION
 
A Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration SystemA Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration System
 
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
Code Mixing computationally bahut challenging hai
Code Mixing computationally bahut challenging haiCode Mixing computationally bahut challenging hai
Code Mixing computationally bahut challenging hai
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 
A Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageA Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor Language
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
Jq3616701679
Jq3616701679Jq3616701679
Jq3616701679
 

Viewers also liked

UPCEA MEMS WHY HOW AND WHAT
UPCEA MEMS WHY HOW AND WHATUPCEA MEMS WHY HOW AND WHAT
UPCEA MEMS WHY HOW AND WHATGuy Felder
 
Expired 607 ASOG Standards & Evaluation Program
Expired 607 ASOG Standards & Evaluation ProgramExpired 607 ASOG Standards & Evaluation Program
Expired 607 ASOG Standards & Evaluation ProgramJason Nilsen
 
Merck Millipore Peru - Global Diabetes Awareness LCD Presentations
Merck Millipore Peru - Global Diabetes Awareness LCD PresentationsMerck Millipore Peru - Global Diabetes Awareness LCD Presentations
Merck Millipore Peru - Global Diabetes Awareness LCD PresentationsIvan Hernandez
 
Analysis of professional magazine covers
Analysis of professional magazine coversAnalysis of professional magazine covers
Analysis of professional magazine coversMatthewTurnbull1998
 
HOW TO HACK FACEBOOK BY PC
HOW TO HACK FACEBOOK BY PC HOW TO HACK FACEBOOK BY PC
HOW TO HACK FACEBOOK BY PC June_Johnson
 
Le système de santé à l'épreuve des seniors
Le système de santé à l'épreuve des seniorsLe système de santé à l'épreuve des seniors
Le système de santé à l'épreuve des seniorsJALMAOfficiel
 
Технологічний прорив в ІТ
Технологічний прорив в ІТТехнологічний прорив в ІТ
Технологічний прорив в ІТPaul Okhrem
 
列印玩桌游
列印玩桌游列印玩桌游
列印玩桌游傑森 林
 
дворівнеів тести 10 (1)
дворівнеів тести 10 (1)дворівнеів тести 10 (1)
дворівнеів тести 10 (1)vikusikv12
 
Πνευματική ιδιοκτησία και Creative commons
Πνευματική ιδιοκτησία και Creative commonsΠνευματική ιδιοκτησία και Creative commons
Πνευματική ιδιοκτησία και Creative commonsEfrosini Koemtzopoulou
 

Viewers also liked (16)

prof vivi
prof viviprof vivi
prof vivi
 
UPCEA MEMS WHY HOW AND WHAT
UPCEA MEMS WHY HOW AND WHATUPCEA MEMS WHY HOW AND WHAT
UPCEA MEMS WHY HOW AND WHAT
 
Wireless Social WiFi
Wireless Social WiFiWireless Social WiFi
Wireless Social WiFi
 
Expired 607 ASOG Standards & Evaluation Program
Expired 607 ASOG Standards & Evaluation ProgramExpired 607 ASOG Standards & Evaluation Program
Expired 607 ASOG Standards & Evaluation Program
 
HACKING MERCADONA
HACKING MERCADONAHACKING MERCADONA
HACKING MERCADONA
 
Merck Millipore Peru - Global Diabetes Awareness LCD Presentations
Merck Millipore Peru - Global Diabetes Awareness LCD PresentationsMerck Millipore Peru - Global Diabetes Awareness LCD Presentations
Merck Millipore Peru - Global Diabetes Awareness LCD Presentations
 
Tics
TicsTics
Tics
 
Analysis of professional magazine covers
Analysis of professional magazine coversAnalysis of professional magazine covers
Analysis of professional magazine covers
 
HOW TO HACK FACEBOOK BY PC
HOW TO HACK FACEBOOK BY PC HOW TO HACK FACEBOOK BY PC
HOW TO HACK FACEBOOK BY PC
 
Le système de santé à l'épreuve des seniors
Le système de santé à l'épreuve des seniorsLe système de santé à l'épreuve des seniors
Le système de santé à l'épreuve des seniors
 
Documentary analysis
Documentary analysisDocumentary analysis
Documentary analysis
 
Технологічний прорив в ІТ
Технологічний прорив в ІТТехнологічний прорив в ІТ
Технологічний прорив в ІТ
 
列印玩桌游
列印玩桌游列印玩桌游
列印玩桌游
 
ComColor-X1-9150-Brochure
ComColor-X1-9150-BrochureComColor-X1-9150-Brochure
ComColor-X1-9150-Brochure
 
дворівнеів тести 10 (1)
дворівнеів тести 10 (1)дворівнеів тести 10 (1)
дворівнеів тести 10 (1)
 
Πνευματική ιδιοκτησία και Creative commons
Πνευματική ιδιοκτησία και Creative commonsΠνευματική ιδιοκτησία και Creative commons
Πνευματική ιδιοκτησία και Creative commons
 

Similar to FIRE2014_IIT-P

Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach IJECEIAES
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Taggingkevig
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
 
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
 
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGESSCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGEScscpconf
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfNohaGhoweil
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesAlgoscale Technologies Inc.
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
 
An expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicAn expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicijnlc
 
Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET Journal
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
 

Similar to FIRE2014_IIT-P (20)

Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Tagging
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
 
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
 
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGESSCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdf
 
nlp (1).pptx
nlp (1).pptxnlp (1).pptx
nlp (1).pptx
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian Languages
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
 
An expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicAn expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabic
 
Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
 

FIRE2014_IIT-P

  • 1. Machine Learning Approach for Language Identification & Transliteration: Shared Task Report of IITP-TS Deepak Kumar Gupta Comp. Sc. & Engg. Deptt. IIT Patna, India deepak.mtmc13@iitp.ac.in Shubham Kumar Comp. Sc. & Engg. Deptt. IIT Patna, India shubham.ee12@iitp.ac.in Asif Ekbal Comp. Sc. & Engg. Deptt. IIT Patna, India asif@iitp.ac.in ABSTRACT In this paper, we describe the system that we developed as part of our participation to the FIRE-2014 Shared Task on Transliterated Search. We participated only for Subtask 1 that focused on labeling the query words. The entire pro- cess consists of the following subtasks: language identifica- tion of each word in the text, named entity recognition and classification (NERC) and transliteration of the Indian lan- guage words written in non-native scripts to the correspond- ing native Indic scripts. The proposed methods of language identification and NERC are based on the supervised ap- proaches, where we use several machine learning algorithms. We develop a transliteration framework which is based on the modified joint source channel model. Experiments on the benchmark setup show that we achieve quite encourag- ing performance for both pairs of languages. It is also to be noted that we did not make use of any deep domain- specific resources and/or tools, and therefore this can be easily adapted to the other domains and/or languages. Keywords Language Identification, NERC, Transliteration, Ensemble, Modified Joint-Source Channel Model 1. INTRODUCTION Recent decade has seen an upsurge in the social network- ing and e-commerce sector witnessing an enormous growth in the volume of data flowing out of media networks, which can be used by public and private organizations alike to gain valuable insights. New forms of communication, such as micro-blogging, Tweets, status, reviews and text messag- ing have emerged and become ubiquitous. These messages often are written using Roman script due to various socio- cultural and technological reasons [4]. Many languages such as South and South East Asian Languages, Arabic, Russian etc. make use of indigenous scripts while writing in text forms. The process of phonetically representing the words of a language in a non-native script is called transliteration. Transliteration, especially into Roman script, is used abun- dantly on the Web not only for documents, but also for user queries that intend to search for these documents. These problems were addressed in the FIRE-2013 Shared Task on Transliteration [5]. More recent studies show that building computational models for the social media content is more challenging because of the nature of the mixing as well as the presence of non-standard variations in spellings and gram- mar, and transliteration [1]. The work that we present here is in connection to the shared task that is being conducted as an continuation to the previous year. 1.1 Task Description This year, two subtasks on Transliterated Search were con- ducted: first one is the query labeling, and the second task is related to the ad-hoc retrieval task for Hindi film lyrics. We participated for the first task which is described very briefly as follows: Subtask 1: Query Word Labeling Suppose that a query q: w1 w2 w3 . . . wn is written in the Roman script. The words, w1 w2 etc., could be standard En- glish words or transliterated from another language L. The task is to label the words as E or L depending on whether it is an English word, or a transliterated L-language word. And then, for each transliterated word, provide the correct transliteration in the native script (i.e., the script which is used for writing L). The task also required to identify and classify the named entities of types person, location, organi- zation and abbreviation. 2. METHODOLOGY The overall task for query labeling consists of three ma- jor components, viz. Language Identification, NERC and Transliteration. It is to be noted that we did not make use of any domain-specific resources and/or tools for the sake of their domain-independence. Below we describe the method- ologies that we followed for each of these individual modules. 2.1 Language Identification The problem of language identification concerns with de- termining the language of a given word. The task can be modeled as a classification problem, where each word has to be labeled either with one of the three classes, namely Hindi (or Bengali), English and Mixed (denotes the mixed characters of English and non-Latin language scripts). Our proposed method for language identification is supervised in nature. In particular we develop the systems based on four
  • 2. different classifiers, namely random forest, random tree, support vector machine and decision tree. We use the Weka implementations1 for these classifiers. In order to fur- ther improve the performance we construct an ensemble by combining the decisions of all the classifiers using majority voting. We followed the similar approach for both Hindi- English and Bangla-English pairs. The features that we im- plemented for language identification are described below in brief: 1. Character n-gram: Character n-gram is a contigu- ous sequence of n character extracted from a given word. We extract character n-grams of length one (unigram), two (bigram) and three (trigram), and use these as features of the classifiers. 2. Context word: Local contexts help to identify the type of the current word. We use the contexts of pre- vious two and next two words as features. 3. Word normalization : Words are normalized in or- der to capture the similarity between two different words that share some common properties. Each cap- italized letter is replaced by ‘A’, small by ’a’ and num- ber by ’0’. 4. Gazetteer based feature : We compile a list of Hindi, Bengali and English words from the training datasets. A feature vector of length two (representing the respective gazetteer for the language pair: Hindi- English or Bangla-English). Now for each token we set a feature value equal to ‘1’ if it is present in the respec- tive gazetteer, otherwise ‘0’. Hence for the words that appear in both the gazetteers, the feature vector will take the values of 1 in both the bit positions. Recent studies also suggest that gazetteer based features can be effectively used for the language identification [3]. 5. InitCap: This feature checks whether the current to- ken starts with a capital letter. 6. InitPunDigit: We define a binary-valued feature that checks whether the current token starts with a punc- tuation or digit. 7. DigitAlpha: We define this feature in such a way that checks whether the current token is alphanumeric. 8. Contains# symbol: We define the feature that checks whether the word in the surrounding context contains the symbol #. Last three features help to recognize the tokens which are mixed in nature (i.e., do not belong to Hindi, English and Bangla). Some of the examples are: 2mar, #lol, (rozana etc. 2.2 Named Entity Recognition and Classifica- tion Named Entity Recognition and Classification (NERC) in an unstructured texts such as facebook, blogs etc. are more challenging compared to the traditional news-wire domains. 1 http://www.cs.waikato.ac.nz/ml/weka/ Here the task was to identify named entities (NEs) and clas- sify them into the following categories: Person, Organi- zation, Location and Abbreviation. We use machine learning model to recognize the first three NE types, and for the last one we used heuristics. It is to be noted that there were many inconsistencies in annotation, and hence we pre-processed the datasets to maintain uniformity. In order to denote the boundary of NEs we use the BIO encoding scheme 2 . We implement the following features for NERC. 1. Local context: Local contexts that span the preced- ing and following few tokens of the current word are used as the features. Here we use the previous two and next two tokens as the features. 2. Character n-gram: Similar to the language identifi- cation we use n-grams of length upto 5 as the features. 3. Prefix and Suffix: Prefix and suffix of fixed length character sequences (here, 3) are stripped from each token and used as the features of classifier. 4. Word normalization: This feature is defined exactly in the same way as we did for language identification. 5. WordClassFeature: This feature was defined to en- sure that the words having similar structures belong to the same class. In the first step we normalize all the words following the process as mentioned above. Thereafter, consecutive same characters are squeezed into a single character. For example, the normalized word AAAaaa is converted to Aa. We found this fea- ture to be effective for the biomedical domain, and we directly adapted this without any modification. De- tailed sensitivity analysis might be useful to study its effectiveness for the current domain. 6. Typographic features: We define a set of features depending upon the Typographic constructions of the words. We implement the following four features: All- Caps (whether the current word is made up of all cap- italized letters), AllSmall (word is constructed with only uncapitalized characters), InitCap (word starts with a capital letter) and DigitAlpha (word contains digits and alphabets). 2.3 Transliteration A transliteration system takes as input a character string in the source language and generates a character string in the target language as output. The transliteration algorithm [2] that we used here is conceptualized as two levels of de- coding: segmenting source and target language strings into transliteration units (TUs); and defining appropriate map- ping between the source and target TUs by resolving differ- ent combinations of alignments and unit mappings. The TU is defined based on the regular expression. For K alligned TUs (X: source TU and T: target TU), we have P(X, T) = P(x1, x2 . . . xk, t1, t2 . . . tk) 2 B, I and O denote the beginning, intermediate and outside the NEs.
  • 3. = P (<x,t>1, <x,t>2, . . . <x,t>k) = Πk k=1 P (<x,t>k | <x,t>k−1 1 ) We implement a number of transliteration models that can generate the original Indian language word (i.e., Indic script) from the given English transliteration written in Roman script. The Indic word is divided into TUs that have the pattern C+ M, where C represents a vowel or a consonant or conjunct and M represents the vowel modifier or matra. An English word is divided into TUs that have the pattern C*V*, where C represents a consonant and V represents a vowel [2]. The process considers contextual information in both the source and target sides in the form of collocated TUs, and compute the probability of transliteration from each source TU to various target candidate TUs and chooses the one with maximum probability. The most appropriate mappings between the source and target TUs are learned au- tomatically from the bilingual training corpus. The training process yields a kind of decision-tree list that outputs for each source TU a set of possible target TUs along with the probability of each decision obtained from a training cor- pus. The transliteration of the input word is obtained using direct orthographic mapping by identifying the equivalent target TU for each source TU in the input and then placing the target TUs in order. We implemented all the six mod- els as proposed in [2]. Based on some experiments on the held out datasets we selected the following three models to submit our runs: Model-I: This is a kind of monogram model where no con- text is considered, i.e. P(X,T) = Πk k=1 P(<x,t>k) Model-II: This model is built by considering next source TU as context. P(X,T) = Πk k=1 P(<x,t>k | xk+1 ) Model-III: This model incorporates the previous and the next TUs in the source and the previous target TU as the context. P(X,T) = Πk k=1 P(<x,t>k | <x,t>k−1, xk+1) The overall transliteration process attempts to produce the best output for the given input word using Model-III. If the transliteration is not obtained then we consult Model-II and then Model-I in sequence. If none of these models produces the output then we consider a literal transliteration model developed using a dictionary. This process is shown below: Input: Token (t) which is labeled as L3 in Language identification Output: Transliteration (T) of given token • Step 1: T<- Model-III (t)4 1.1 If T contains null value 1.1.1 T<- Model-III withAlignment(t)5 • Step 2: If T contains null value 3 Denotes Hindi(H) or Bengali(B) 4 each model takes input token and divide it into several non native TU and give the native TU for each of them 5 each model withAlignment takes input token and align the source TU with target TU 2.1 T<- Model-II (t) 2.2 If T contains null value 2.2.1 T<- Model-II withAlignment(t) • Step 3: If T contains null value 3.1 T<- Model-I (t) 3.2 If T contains null value 3.2.1 T<- Model-I withAlignment(t) • Step 4: return (T) 3. EXPERIMENTS AND DISCUSSIONS 3.1 Datasets We submitted runs for the two pair of languages, namely Hindi-English and Bangla-English. For language identifica- tion the FIRE 2014 organizers provided three documents for Hindi-English pair and two documents for Bangla-English pair. For each of the language pairs, the individual docu- ments are merged together into a single file for training. The training sets consist of 1,004 (20,658 tokens) and 800 sen- tences (27,969 tokens) for Hindi-English and Bangla-English language pairs, respectively. The test sets consist of 32,270 and 25,698 tokens for Hindi-English and Bangla-English, re- spectively. For training of transliteration algorithm we make use of 54,791 Hindi-English and 19,582 Bangla-English par- allel examples. Details of these datasets can be found in [6]. 3.2 Results and Analysis In this section we report the results that we obtained for query word labeling. We submitted three runs which are defined as below: Run-1: For language identification and NERC, we con- struct the ensembles using majority voting. If the token is labeled as any native language (Hindi or Bangla) then we perform the transliteration for that token. Run-2: In this run we perform language identification by majority ensemble, and NERC by SMO. The word labeled as native language is transliterated accordingly. Run-3: In this run both the language identification and NERC are carried out using SMO. Transliteration is done following the same method. Run ID LP LR LF EP ER EF LA Run-1 0.920 0.843 0.880 0.883 0.932 0.907 0.886 Run-2 0.922 0.843 0.881 0.884 0.931 0.907 0.886 Run-3 0.882 0.841 0.861 0.88 0.896 0.888 0.870 Table 1: Result for language identification of Bangla-English. Here, LP-Language precision, LR- Language recall, LF-Language FScore, EP-English precision, ER-English recall, EF-English FScore, LA-Labelling accuracy
  • 4. Run ID LP LR LF EP ER EF LA Run-1 0.921 0.895 0.908 0.89 0.908 0.899 0.879 Run-2 0.921 0.893 0.907 0.89 0.908 0.899 0.878 Run-3 0.905 0.865 0.885 0.86 0.886 0.873 0.857 Table 2: Results for language identification of Hindi- English Run ID EQMF ALL TP TR TF ETPM Run-1 0.005 0.039 0.574 0.073 228/337 Run-2 0.005 0.039 0.574 0.073 228/337 Run-3 0.005 0.038 0.582 0.071 231/344 Table 3: Results of transliteration for Bangla- English. Here, EQMF- ALL-Exact query match fraction, TP-Transliteration precision, TR- Transliteration recall, TF-Transliteration F-Score, ETPM-Exact transliteration pair match The systems were evaluated using the metrics as defined in [5]. Overall results for language identification are reported in Table 1 and Table 2 for Bangla-English and Hindi-English pairs, respectively. Evaluation shows that the first two runs achieve quite same level of F-scores. Experimental results for transliteration are reported in Table 3 and Table 4 for Bangla-English and Hindi-English, respectively. Compar- isons with the other submitted systems show that we achieve the performance levels which are in the upper side. For Hindi we obtain the precision, recall and F-score of 92.10%, 89.5% and 90.80%, which are very closer to the best performing system (only inferior by less than one point F-score). For English, our system attains the precision, recall and F-score values of 89.00%, 90.80% and 89.90%, respectively. This is lower than the best system only by 0.2% F-score points. For Bangla-English pair our system also performs with impres- sive F-score values. In a separate experiment we evaluated the model of NERC. We obtain the precision, recall and F-score values of 61.00%, 44.25% and 51.25%, respectively for Hindi-English pair using the ensemble framework. For Bangla-English pair it yields the precision, recall and F-score values of 54.25%, 43.25% and 48.25%, respectively. Our close investigation to the results show that our model developed for language identification suffers due to very short forms of the words, ambiguities, errorneous words and mixed wordforms. Short words refer to the tokens which are writ- ten in their short forms. Ambiguities arise because of the words, having meanings in both native and non-native scripts. Errors are encountered because of the wrong spelling. Also, the alphanumeric wordforms (i.e., mixed) often contribute to the overall errors. The errors are shown in Table 5. The transliteration model makes most of the errors because of spelling variation (e.g., Pahalaa [ vs. ]). Incon- sistent annotation is the another potential source of errors. 4. CONCLUSION Run ID EQMF ALL TP TR TF ETPM Run-1 0.005 0.146 0.76 0.244 1933/2306 Run-2 0.004 0.146 0.76 0.244 1931/2301 Run-3 0.004 0.143 0.736 0.24 1871/2226 Table 4: Results of transliteration for Hindi-English Type Words Predicted Reference Short words thrgh H E Ambiguous words the;ate E;E H;H Erroneous words implemnt H E Mixed Numerals Words 2mar O B Table 5: Language labeling errors. Here, H-Hindi, E-English, O-others In this paper we presented a brief overview of the system that we developed as part of our participation in the query labeling subtask of transliteration search. The proposed method classifies the input query words to their native lan- guage and back-transliterate non-English words to their orig- inal script. We have used several classification techniques for solving the problem of language identification and NERC. For transliteration we have used a modified joint source- channel model. We submitted three runs. Comparisons with the other systems show that our system achieves quite en- couraging performance for both the pairs, Hindi-English and Bangla-English. Our detailed analysis suggest that language identification module suffers most due to the presence of very short word- forms, ambiguities and alphanumeric words. Errors encoun- tered in the transliteration model can be reduced a lot by developing a method for spelling variations. 5. REFERENCES [1] K. Bali, J. Sharma, M. Choudhury, and Y. Vyas. “I am borrowing ya mixing?” An analysis of english-hindi code mixing in facebook. In First Workshop on Computational Approaches to Code Switching, EMNLP 2014, page 116, 2014. [2] A. Ekbal, S. K. Naskar, and S. Bandyopadhyay. A modified joint source-channel model for transliteration. Proceedings of the COLING/ACL, pages 191–198, 2006. [3] C.-C. Lin, W. Ammar, L. Levin, and C. Dyer. The CMU submission for the shared task on language identification in code-switched data. page 80, 2014. [4] R. S. Roy, M. Choudhury, P. Majumder, and K. Agarwal. Challenges in designing input method editors for indian languages: The role of word-origin and context. In Advances in Text Input Methods (WTIM 2011), pages 1–9, 2011. [5] R. S. Roy, M. Choudhury, P. Majumder, and K. Agarwal. Overview and datasets of fire 2013 track on transliterated search. In FIRE-13 Workshop, 2013. [6] S. V.B., M. Choudhury, K. Bali, T. Dasgupta, and A. Basu. Resource creation for training and testing of
  • 5. transliteration systems for indian languages. pages 2902–2905.