Word level language identification in
bilingual code-switched texts
PACLIC 2014 (12-14 December)
Harsh Jhamtani – IIT Roorkee / Adobe Research India
Suleep Kumar Bhogi - IIT Roorkee / Samsung Research
Dr. Vaskar Raychoudhury - IIT Roorkee
Contact: harshjhamtani@gmail.com
1
Introduction
Spanish-
English
Tamil-
French
Hopi-
Tewa
Marta: Ana, if I leave her here would you send her upstairs when you leave?
Zentella: I’ll tell you exactly when I have to leave, at ten o’clock. Y son las nueve y
cuarto. ("And it’s nine fifteen.")
Selvamani: Parce que n’importe quand quand j’enregistre ma voix ça l’aire d’un garçon.
([in French] "Because whenever I record my voice I sound like a guy.")
Alors, TSÉ, je me ferrai pas poigné ("So, you know, I’m not going to be had.")
[laughter]
Selvamani: ennatā, ennatā, enna romba ciritā? ([in Tamil] "What, what, what's so
funny?")
Speaker A: Tututqaykit qanaanawakna. ([in Hopi] "Schools were not wanted.")
Speaker B: Wédít’ókánk’egena’adi imbí akhonidi. ( [in Tewa] "They didn’t want a
school on their land.")
2
Hinglish
An instance of Linguistic Code
Switching – HINGLISH
Why has Hinglish become popular?
Mingling of western culture into Indian culture.
Youth is mostly influenced.
More comfortable to express.
3
Main main temple ke pass hoon
4
मैं main temple पास हूँके
मैं
मैं (H)
Means – “I am
near the main
temple”
Problem statement
Word level identification of language
- In case of Hindi words, identify the authentic script
5
Main main temple ke pass hoon
H E E H HH
Our system
मैं पास हूँके
Why to solve this problem?
How can analyzing such texts help?
 Sentiment analysis
 Opinion Mining
 Information Retrieval – analyzing search
queries
 Machine translation of such texts
6
125 million Indians
know English in
addition to their
mother tongue
More than 50
million
Indians know
both English
and Hindi
43 million
Indians on
facebook
Issue : Modeling and analyzing such language is
tough
Main Challenges faced
 Inconsistent spelling usage
 Ambiguous word usage
 Current technology is not good enough
 Simple English dictionary look up can not help
7
Language Identification
Written TextsSpoken Language
Document Level Word Level
Foreign-InclusionCode-Switched
Identifying LCS
points
Transliteration Similarity
Metric
Related work
8
King et al
• Character n-grams
• Full word
• Non-word characters between words
• Conditional Random Fields trained with Generalized Expectation Criteria
(Druck, et al., 2008)
• Semi- and weakly-supervised training method for CRFs
horse Unigrams
“h”, “o”,
“r”, “s”, “e”
Bigrams
“_h”, “ho”,
“or”, “rs”,
“se”, “e_”
Trigrams
“_ho”, “hor”,
“ors”, “rse”,
“se_”
4-grams
“_hor”,
“hors”,
“orse”, “rse_”
5-grams
“_hors”,
“horse”,
“orse_”
the horse, ‘94 bred Before
“space_present”
After
“comma_present”
“space_present”
“apostrophe_present”
“4_present”
Full Word
“horse”
9
Source: King Abney
Data sets
Data set 1: Hinglish Sentences: This is a dataset of 500 Hinglish sentences
containing a total of 3,287 words. e.g. rupeeE kiH=की valueE giriH=गिरी
We evaluated on additional 1000 Hinglish sentences.
• Data set 2: Transliteration pairs: It has multiple transliterated forms of same
Hindi words. tera, thera, teraa, teraaa : तेरा
• Data set 3: Hindi Dictionary: It is a Hindi word frequency list. e.g. लेने 2226
Data set 4: Transliterated Hindi Dictionary – We generated standard
transliterated forms of all the words in the Dataset 3 using ITRANS rules.
Data set 5: English Dictionary – It is a standard dictionary of 207,824 English
words along with their frequencies computed from a large news corpus.
10
Proposed solution
Classifier 1 Classifier 2
English
Score
Modified
Edit
Distance
Hindi
Score
N-gramsFeatures
HINGLISH
Sentences
P(E,w)
POS Tagger
(English)
POS Tagger
(Hindi)
POS-tagged
HINGLISH Sentences
Identified Language
of Words
FUNCTIONAL DIAGRAM
11
Common spelling substitutions
Edit
Distance
khushbu
khushboo
Dissimilarity
Score =2
Edit
Distance
khushbu
khushi
Dissimilarity
Score =2
Observation: People often tend to substitute, ‘u’ in place of ‘oo’ while writing
transliterated forms.
The Edit Distance algorithm does not capture this fact.
We call this type of substitutions as common spelling substitutions.
Same
word
Different
words
12
Classifier 1
• Modified Edit Distance
• English Score
• Hindi Score
• Ngram
13
Method to generate substitution pairs
Data set 2
– multiple
transliterati
on forms
Enumerate
substitution
pairs
Assign cost
for using a
substitution
pair
14
String dissimilarity feature
Cost for substitution s=c1(s)
Normal cost=c2
Cut-off threshold = subst_th
1. Get a list of possible spelling
substitutions along with with frequency
counts.
2. Extract those substitutions which occur
more than subst_th times.
3. For each Hindi transliteration ht
available, calculate the cost of word w
with ht. The feature is the minimum cost
over all ht. Also store the corresponding
transliteration.
 C1(s) = k/log(freq(s))
15
Classifier 1
• Modified Edit Distance
• English Score
• Hindi Score
• Ngram
16
English and Hindi scores
This feature is to address ambiguous word usage problem.
Example
 दर – door
 door - door
If we know that door(E) is a frequent word and दर is not a
frequent word, then “door” corresponds to door(E) with greater
probability.
Here we also tried to leverage an estimated language
distribution of words in test data.
17
Frequency of words as feature
For a given word w in the test data,
English Score:
eng_score(w) = score(w), if w ∈ English Dictionary
= 0, otherwise
score(w) = log(freq(w))/M
M=max(log(freq(w))) for all w ∈ English Dictionary
Hindi score: First the modified Edit Distance algorithm is used to identify
the closest matching Hindi word hw for a given word w in the test data.
hindi_score(w) = score(hw)
score(hw) = log(freq(hw))/M
M=max(log(freq(q))) for all q ∈ Hindi Dictionary
18
KDE graphs
19
Classifier 1
• Modified Edit Distance
• English Score
• Hindi Score
• Ngram
20
N-grams (sequence of characters)
Sequence of letters occurring more frequently is a characteristic feature
of a language.
Learn the most frequent bigrams, trigrams, 4-grams from the training
data set.
For any term t (N-gram) in word w, the Delta TF-IDF score V is
computed using formula
V=n(t,w) * log2(Ht/Et).
where, n (t, w) is the frequency count of term t in word w.
Ht and Et are the number of occurrences of term t in the English and
Hindi dictionaries.
21
Classifier 2
Classifier 1 Classifier 2
English
Score
Modified
Edit
Distance
Hindi
Score
N-gramsFeatures
HINGLISH
Sentences
P(E,w)
POS Tagger
(English)
POS Tagger
(Hindi)
POS-tagged
HINGLISH Sentences
Identified Language
of Words
FUNCTIONAL DIAGRAM
22
Parts-of-speech (POS)
All the previous features we have discussed focus on individual
words of a code-switched sentence.
Here we see that language usage of word follows certain pattern- POS
based patterns.
We did some analysis based on context based hypothesis on code
switched sentences to show that POS helps to find language usage of
words.
23
Few hypothesis
1. Words on the two sides of conjunctions ‘and’, ‘aur’, etc. are usually of same
languages the conjunction
ready hai? editing and all done?
hello world ya kuch aur?
padmasana yoga technique and posture
2. Direct objects are switched very often
.. court ne election-commission...
3. Determiner and the corresponding noun/verb are of same language
You would rather say “ that chair ” instead of “that kursi”
4. Adjectives/adverbs are usually of the same language as that of noun/verb
they modify
“Beta good boy banke dikha” and not “ .. good ladka ..“
80%
20%
FOLLOW HYPOTHESIS
DO NOT FOLLOW HYPOTHESIS
24
How to leverage this?
Identifier = Language_POS
e.g. English noun => E_NN
e.g. Hindi adjective => H_JJ
From the training data set tagged with POS, learn the most frequent bigrams and
trigrams of identifiers.
Results From Dataset
Most frequent bigram: H_JJ; E_NN -> 222
Most frequent trigram: H_NN; H_PSP; E_NN -> 61
25
An Illustration
“Main(H) main(E) temple ke pass hoon”
Consider following cases:
Case 1: Main(H_PRP) main(E_JJ) temple(E_NN)
Case 2: Main(E_JJ) main(E_JJ) temple(E_NN)
Case 3: Main(H_PRP) main(H_PRP) temple(E_NN)
Case 4: Main(E_JJ) main(H_PRP) temple(E_NN)
*H_PRN = Hindi Pronoun, E_JJ= English adjective, E_NN= English Noun
Relative Count
E_JJ;H_PRP
E_JJ;E_JJ
H_PRP;H_PRP
E_JJ;E_NN
H_PRP;E_NN
26
Classifier 2
w1E w1H w2E w2H w3E w3H
w1 w2 w3
S=w1 w2 w3
1. S1= w1H w2E w3H
 POS(S1) = POS_w1H POS_w2E
POS_w3H
 P(S1)=P(w1=w1H)*P(w2=w2E)*P(w3=w3H)
 CS(S1)=P(POS_w2E| POS_w1H ) *
P(POS_w3H | POS_w2E )
 Score(S1)=P(S1) * CS(S1)
 main(H) main(E) temple(E)
 P(S1)=P(w1=w1H)*P(w2=w2
E)*P(w3=w3H)
 CS(S1)=P(E_JJ| H_PRP) *
P(E_NN| E_JJ )
 Score(S1)=P(S1) * CS(S1)
27
RESULTS
28
Baseline: Presence in English dictionary
No. of words (K) HPRE HRE HF1 EPRE ERE EF1
Top 100 0.74 0.98 0.84 0.31 0.02 0.04
Top 500 0.76 0.96 0.85 0.59 0.15 0.24
Top 1000 0.79 0.95 0.86 0.69 0.28 0.40
Top 5000 0.85 0.93 0.89 0.75 0.55 0.64
Top 10000 0.87 0.86 0.87 0.63 0.64 0.63
ALL 0.92 0.39 0.55 0.35 0.91 0.50
HPRE = Precision for Hindi class , HRE = Recall for the Hindi class, HF1 = f1 score for the Hindi class
EPRE = Precision for English class , ERE = Recall for the English class, EF1 = f1 score for the English
class
29
KingAbney’s performance on the data set
HPRE HRE HF1 EPRE ERE EF1
Decision Tree 0.598 0.783 0.678 0.243 0.117 0.158
Naïve Bayes 0.663 0.832 0.738 0.390 0.202 0.266
Logistic Regression 0.422 0.681 0.521 0.097 0.035 0.052
Random Forest
(n_estimators=10)
0.636 0.793 0.706 0.243 0.128 0.168
Technique HPRE HRE HF1 EPRE ERE EF1
HMM 0.752 0.915 0.826 0.589 0.287 0.386
CRF 0.759 0.963 0.849 0.762 0.278 0.408
30
Classifier 1 - Using n-grams of characters, all
features
HPRE HRE HF1 EPRE ERE EF1
Decision Tree 0.848 0.799 0.823 0.466 0.552 0.505
Naïve Bayes 0.892 0.484 0.627 0.335 0.816 0.475
Logistic Regression 0.941 0.582 0.719 0.403 0.885 0.554
Random Forest 0.896 0.791 0.840 0.521 0.713 0.602
HPRE HRE HF1 EPRE ERE EF1
Decision Tree 0.959 0.934 0.946 0.809 0.874 0.840
Naïve Bayes 0.976 0.744 0.844 0.539 0.943 0.686
Logistic Regression 0.919 1.000 0.958 1.000 0.724 0.840
Random Forest 0.954 0.982 0.968 0.9379 0.851 0.892
31
Performance of Classifier 2
f1-score (Hindi class) f1-score (English class)
Identifier bi-grams 0.966 0.897
Identifier tri-grams 0.974 0.919
32
ErrorAnalysis
Model NE* (% of total misclassified
words)
𝑪 𝑭
𝟏
= n-grams 10.76%
𝑪 𝑭
𝟐
=ngrams + eng_score + hindi_score +
dissimilarity_score
11.49%
𝑪 𝑭
𝟑
= 𝑪 𝑭
𝟏
+ 𝑪 𝑭
𝟐 27.00%
33
*NE -> Named entities
Performance of Modified Edit Distance
40
42
44
46
48
50
52
54
56
Percentage
Percentage correct
34
Comparing Four Methods to Create Substitution Pairs
Method 1 Method 2 Method 3 Method 4
k N P N P N P N P
0.1 1298 53.64 1305 53.93 1235 51.03 1050 43.39
0.5 1298 53.64 1305 53.93 1206 49.83 1073 44.34
1 1296 53.55 1307 54.01 1269 52.44 1139 47.07
1.5 1296 53.55 1309 54.09 1290 53.31 1197 49.46
35
40
45
50
55
60
0.1 0.5 1 1.5
Method 1
Method 2
Method 3
Method 3
35
Conclusions
Word level language identification is important to solve for
further analysis of such texts
We devised efficient techniques for word level language
identification and finding the authentic Hindi script.
We plan to test the work in other language pairs
 And extend it further to cases with more than two languages switched in
the same text.
We plan to extend the work to incorporate sentence level
contextual features.
36

Word level language identification in code-switched texts

  • 1.
    Word level languageidentification in bilingual code-switched texts PACLIC 2014 (12-14 December) Harsh Jhamtani – IIT Roorkee / Adobe Research India Suleep Kumar Bhogi - IIT Roorkee / Samsung Research Dr. Vaskar Raychoudhury - IIT Roorkee Contact: harshjhamtani@gmail.com 1
  • 2.
    Introduction Spanish- English Tamil- French Hopi- Tewa Marta: Ana, ifI leave her here would you send her upstairs when you leave? Zentella: I’ll tell you exactly when I have to leave, at ten o’clock. Y son las nueve y cuarto. ("And it’s nine fifteen.") Selvamani: Parce que n’importe quand quand j’enregistre ma voix ça l’aire d’un garçon. ([in French] "Because whenever I record my voice I sound like a guy.") Alors, TSÉ, je me ferrai pas poigné ("So, you know, I’m not going to be had.") [laughter] Selvamani: ennatā, ennatā, enna romba ciritā? ([in Tamil] "What, what, what's so funny?") Speaker A: Tututqaykit qanaanawakna. ([in Hopi] "Schools were not wanted.") Speaker B: Wédít’ókánk’egena’adi imbí akhonidi. ( [in Tewa] "They didn’t want a school on their land.") 2
  • 3.
    Hinglish An instance ofLinguistic Code Switching – HINGLISH Why has Hinglish become popular? Mingling of western culture into Indian culture. Youth is mostly influenced. More comfortable to express. 3
  • 4.
    Main main templeke pass hoon 4 मैं main temple पास हूँके मैं मैं (H) Means – “I am near the main temple”
  • 5.
    Problem statement Word levelidentification of language - In case of Hindi words, identify the authentic script 5 Main main temple ke pass hoon H E E H HH Our system मैं पास हूँके
  • 6.
    Why to solvethis problem? How can analyzing such texts help?  Sentiment analysis  Opinion Mining  Information Retrieval – analyzing search queries  Machine translation of such texts 6 125 million Indians know English in addition to their mother tongue More than 50 million Indians know both English and Hindi 43 million Indians on facebook
  • 7.
    Issue : Modelingand analyzing such language is tough Main Challenges faced  Inconsistent spelling usage  Ambiguous word usage  Current technology is not good enough  Simple English dictionary look up can not help 7
  • 8.
    Language Identification Written TextsSpokenLanguage Document Level Word Level Foreign-InclusionCode-Switched Identifying LCS points Transliteration Similarity Metric Related work 8
  • 9.
    King et al •Character n-grams • Full word • Non-word characters between words • Conditional Random Fields trained with Generalized Expectation Criteria (Druck, et al., 2008) • Semi- and weakly-supervised training method for CRFs horse Unigrams “h”, “o”, “r”, “s”, “e” Bigrams “_h”, “ho”, “or”, “rs”, “se”, “e_” Trigrams “_ho”, “hor”, “ors”, “rse”, “se_” 4-grams “_hor”, “hors”, “orse”, “rse_” 5-grams “_hors”, “horse”, “orse_” the horse, ‘94 bred Before “space_present” After “comma_present” “space_present” “apostrophe_present” “4_present” Full Word “horse” 9 Source: King Abney
  • 10.
    Data sets Data set1: Hinglish Sentences: This is a dataset of 500 Hinglish sentences containing a total of 3,287 words. e.g. rupeeE kiH=की valueE giriH=गिरी We evaluated on additional 1000 Hinglish sentences. • Data set 2: Transliteration pairs: It has multiple transliterated forms of same Hindi words. tera, thera, teraa, teraaa : तेरा • Data set 3: Hindi Dictionary: It is a Hindi word frequency list. e.g. लेने 2226 Data set 4: Transliterated Hindi Dictionary – We generated standard transliterated forms of all the words in the Dataset 3 using ITRANS rules. Data set 5: English Dictionary – It is a standard dictionary of 207,824 English words along with their frequencies computed from a large news corpus. 10
  • 11.
    Proposed solution Classifier 1Classifier 2 English Score Modified Edit Distance Hindi Score N-gramsFeatures HINGLISH Sentences P(E,w) POS Tagger (English) POS Tagger (Hindi) POS-tagged HINGLISH Sentences Identified Language of Words FUNCTIONAL DIAGRAM 11
  • 12.
    Common spelling substitutions Edit Distance khushbu khushboo Dissimilarity Score=2 Edit Distance khushbu khushi Dissimilarity Score =2 Observation: People often tend to substitute, ‘u’ in place of ‘oo’ while writing transliterated forms. The Edit Distance algorithm does not capture this fact. We call this type of substitutions as common spelling substitutions. Same word Different words 12
  • 13.
    Classifier 1 • ModifiedEdit Distance • English Score • Hindi Score • Ngram 13
  • 14.
    Method to generatesubstitution pairs Data set 2 – multiple transliterati on forms Enumerate substitution pairs Assign cost for using a substitution pair 14
  • 15.
    String dissimilarity feature Costfor substitution s=c1(s) Normal cost=c2 Cut-off threshold = subst_th 1. Get a list of possible spelling substitutions along with with frequency counts. 2. Extract those substitutions which occur more than subst_th times. 3. For each Hindi transliteration ht available, calculate the cost of word w with ht. The feature is the minimum cost over all ht. Also store the corresponding transliteration.  C1(s) = k/log(freq(s)) 15
  • 16.
    Classifier 1 • ModifiedEdit Distance • English Score • Hindi Score • Ngram 16
  • 17.
    English and Hindiscores This feature is to address ambiguous word usage problem. Example  दर – door  door - door If we know that door(E) is a frequent word and दर is not a frequent word, then “door” corresponds to door(E) with greater probability. Here we also tried to leverage an estimated language distribution of words in test data. 17
  • 18.
    Frequency of wordsas feature For a given word w in the test data, English Score: eng_score(w) = score(w), if w ∈ English Dictionary = 0, otherwise score(w) = log(freq(w))/M M=max(log(freq(w))) for all w ∈ English Dictionary Hindi score: First the modified Edit Distance algorithm is used to identify the closest matching Hindi word hw for a given word w in the test data. hindi_score(w) = score(hw) score(hw) = log(freq(hw))/M M=max(log(freq(q))) for all q ∈ Hindi Dictionary 18
  • 19.
  • 20.
    Classifier 1 • ModifiedEdit Distance • English Score • Hindi Score • Ngram 20
  • 21.
    N-grams (sequence ofcharacters) Sequence of letters occurring more frequently is a characteristic feature of a language. Learn the most frequent bigrams, trigrams, 4-grams from the training data set. For any term t (N-gram) in word w, the Delta TF-IDF score V is computed using formula V=n(t,w) * log2(Ht/Et). where, n (t, w) is the frequency count of term t in word w. Ht and Et are the number of occurrences of term t in the English and Hindi dictionaries. 21
  • 22.
    Classifier 2 Classifier 1Classifier 2 English Score Modified Edit Distance Hindi Score N-gramsFeatures HINGLISH Sentences P(E,w) POS Tagger (English) POS Tagger (Hindi) POS-tagged HINGLISH Sentences Identified Language of Words FUNCTIONAL DIAGRAM 22
  • 23.
    Parts-of-speech (POS) All theprevious features we have discussed focus on individual words of a code-switched sentence. Here we see that language usage of word follows certain pattern- POS based patterns. We did some analysis based on context based hypothesis on code switched sentences to show that POS helps to find language usage of words. 23
  • 24.
    Few hypothesis 1. Wordson the two sides of conjunctions ‘and’, ‘aur’, etc. are usually of same languages the conjunction ready hai? editing and all done? hello world ya kuch aur? padmasana yoga technique and posture 2. Direct objects are switched very often .. court ne election-commission... 3. Determiner and the corresponding noun/verb are of same language You would rather say “ that chair ” instead of “that kursi” 4. Adjectives/adverbs are usually of the same language as that of noun/verb they modify “Beta good boy banke dikha” and not “ .. good ladka ..“ 80% 20% FOLLOW HYPOTHESIS DO NOT FOLLOW HYPOTHESIS 24
  • 25.
    How to leveragethis? Identifier = Language_POS e.g. English noun => E_NN e.g. Hindi adjective => H_JJ From the training data set tagged with POS, learn the most frequent bigrams and trigrams of identifiers. Results From Dataset Most frequent bigram: H_JJ; E_NN -> 222 Most frequent trigram: H_NN; H_PSP; E_NN -> 61 25
  • 26.
    An Illustration “Main(H) main(E)temple ke pass hoon” Consider following cases: Case 1: Main(H_PRP) main(E_JJ) temple(E_NN) Case 2: Main(E_JJ) main(E_JJ) temple(E_NN) Case 3: Main(H_PRP) main(H_PRP) temple(E_NN) Case 4: Main(E_JJ) main(H_PRP) temple(E_NN) *H_PRN = Hindi Pronoun, E_JJ= English adjective, E_NN= English Noun Relative Count E_JJ;H_PRP E_JJ;E_JJ H_PRP;H_PRP E_JJ;E_NN H_PRP;E_NN 26
  • 27.
    Classifier 2 w1E w1Hw2E w2H w3E w3H w1 w2 w3 S=w1 w2 w3 1. S1= w1H w2E w3H  POS(S1) = POS_w1H POS_w2E POS_w3H  P(S1)=P(w1=w1H)*P(w2=w2E)*P(w3=w3H)  CS(S1)=P(POS_w2E| POS_w1H ) * P(POS_w3H | POS_w2E )  Score(S1)=P(S1) * CS(S1)  main(H) main(E) temple(E)  P(S1)=P(w1=w1H)*P(w2=w2 E)*P(w3=w3H)  CS(S1)=P(E_JJ| H_PRP) * P(E_NN| E_JJ )  Score(S1)=P(S1) * CS(S1) 27
  • 28.
  • 29.
    Baseline: Presence inEnglish dictionary No. of words (K) HPRE HRE HF1 EPRE ERE EF1 Top 100 0.74 0.98 0.84 0.31 0.02 0.04 Top 500 0.76 0.96 0.85 0.59 0.15 0.24 Top 1000 0.79 0.95 0.86 0.69 0.28 0.40 Top 5000 0.85 0.93 0.89 0.75 0.55 0.64 Top 10000 0.87 0.86 0.87 0.63 0.64 0.63 ALL 0.92 0.39 0.55 0.35 0.91 0.50 HPRE = Precision for Hindi class , HRE = Recall for the Hindi class, HF1 = f1 score for the Hindi class EPRE = Precision for English class , ERE = Recall for the English class, EF1 = f1 score for the English class 29
  • 30.
    KingAbney’s performance onthe data set HPRE HRE HF1 EPRE ERE EF1 Decision Tree 0.598 0.783 0.678 0.243 0.117 0.158 Naïve Bayes 0.663 0.832 0.738 0.390 0.202 0.266 Logistic Regression 0.422 0.681 0.521 0.097 0.035 0.052 Random Forest (n_estimators=10) 0.636 0.793 0.706 0.243 0.128 0.168 Technique HPRE HRE HF1 EPRE ERE EF1 HMM 0.752 0.915 0.826 0.589 0.287 0.386 CRF 0.759 0.963 0.849 0.762 0.278 0.408 30
  • 31.
    Classifier 1 -Using n-grams of characters, all features HPRE HRE HF1 EPRE ERE EF1 Decision Tree 0.848 0.799 0.823 0.466 0.552 0.505 Naïve Bayes 0.892 0.484 0.627 0.335 0.816 0.475 Logistic Regression 0.941 0.582 0.719 0.403 0.885 0.554 Random Forest 0.896 0.791 0.840 0.521 0.713 0.602 HPRE HRE HF1 EPRE ERE EF1 Decision Tree 0.959 0.934 0.946 0.809 0.874 0.840 Naïve Bayes 0.976 0.744 0.844 0.539 0.943 0.686 Logistic Regression 0.919 1.000 0.958 1.000 0.724 0.840 Random Forest 0.954 0.982 0.968 0.9379 0.851 0.892 31
  • 32.
    Performance of Classifier2 f1-score (Hindi class) f1-score (English class) Identifier bi-grams 0.966 0.897 Identifier tri-grams 0.974 0.919 32
  • 33.
    ErrorAnalysis Model NE* (%of total misclassified words) 𝑪 𝑭 𝟏 = n-grams 10.76% 𝑪 𝑭 𝟐 =ngrams + eng_score + hindi_score + dissimilarity_score 11.49% 𝑪 𝑭 𝟑 = 𝑪 𝑭 𝟏 + 𝑪 𝑭 𝟐 27.00% 33 *NE -> Named entities
  • 34.
    Performance of ModifiedEdit Distance 40 42 44 46 48 50 52 54 56 Percentage Percentage correct 34
  • 35.
    Comparing Four Methodsto Create Substitution Pairs Method 1 Method 2 Method 3 Method 4 k N P N P N P N P 0.1 1298 53.64 1305 53.93 1235 51.03 1050 43.39 0.5 1298 53.64 1305 53.93 1206 49.83 1073 44.34 1 1296 53.55 1307 54.01 1269 52.44 1139 47.07 1.5 1296 53.55 1309 54.09 1290 53.31 1197 49.46 35 40 45 50 55 60 0.1 0.5 1 1.5 Method 1 Method 2 Method 3 Method 3 35
  • 36.
    Conclusions Word level languageidentification is important to solve for further analysis of such texts We devised efficient techniques for word level language identification and finding the authentic Hindi script. We plan to test the work in other language pairs  And extend it further to cases with more than two languages switched in the same text. We plan to extend the work to incorporate sentence level contextual features. 36

Editor's Notes

  • #2 Code switching vs code mixing
  • #4 Because Computer can’t understand my language.
  • #13 But we know “khushbu” is much closer to “khushboo” than it is to “khushi”.
  • #19 Using formula (1) We compute the value corresponding to this feature and we call it English score (eng_score). First, we use logarithmic scaling on frequencies of occurrences of English words to do away with its skewness towards large values. Then, we normalize the word frequency values using the largest frequency registered.
  • #26 We annotated the data set with POS taggers For English, we used Stanford NLP POS tagger Most frequent POS usage For Hindi we used POS Tagger by Siva Reddy.