In which languages are these texts?
Malgranda Sablodezerto estas
dezerto de Okcidenta Aŭstralio
Esperanto
Po nepavykusių pirmųjų
bandymų su kukurūzais
Lithuanian

俄罗斯眼下不具备航母建造、
停泊和维护所需的基础设施和条件
Simpliﬁed Chinese
임금체계 개편은 기본적으로
노사 합의 또는
Korean

‫جلوگیری‬ .‫کردند‬ ‫گروه‬ ‫دوم‬ ‫هم‬ ‫به‬
Persian
আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা
Bengali

ဦးသိန္းစိန္အစိုးရရဲ
ဝန္ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ
စစ္ဗိုလ္လူထြက္ေတြ
Burmese
આ રસ મ લ િનચોડી સારી
રી િમકસ કરો અ લાસમ
Gujara

Approaches
Using a dic onary of words for each language:
Problem: amount of word forms!
Using language features:
compute unigrams, bigrams, trigrams, …;
compute short words;
compute word beginnings or termina ons;
Then use language models:
Naïve Bayes;
Hidden Markov Models (HMM);
Support Vector Machines (SVM);
Neural Networks (NN);

Mo va on for a new tool
lack of a decent iden ﬁca on tool for Perl;
use of Chrome Language Detec on library is limited:
how to add new languages?
how to restrict results to speciﬁc languages?
there are tools for other programming languages:
language interoperability can be a hassle;
not clear how to add new languages;

Why using a Neural Network?
learn how Neural Networks work!
an approach where:
training is tedious and slow;
iden fica on is easy to implement;
iden fica on efficient when BLAS available;
therefore:
possible to use trained data in different programming languages;
easy to restrict analysis to a set of languages;
iden fica on probabili es are comparable;

Neural Network Architecture
x1
x2
x3
. . .
xn
input layer
(features)
a
(2)
1
a
(2)
2
a
(2)
3
. . .
a
(2)
s2
y1
y2
. . .
yK
Θ(1)
Θ(2)
output
layer

Preparing Training Data
texts from TED website;
more than 105 languages available!
English texts were matched against English dic onary;
OOV items are removed from the English texts and from other
language texts (trying to remove named en es wri en in their
English form from other texts).
Example
…began spoken word poet Sarah Kay, in a talk that inspired two
standing ova ons at TED2011. She tells the story of her
metamorphosis — from a wide-eyed teenager soaking in verse at
New York's Bowery Poetry Club to a teacher connec ng kids with
the power of self-expression through Project V.O.I.C.E. — and
gives two breathtaking performances of ``B'' and ``Hiroshima.''

Two kind of Features
Used Alphabet
Which are the computer characters used in the text?
Are they usually used in Asia c, Arabic or La n text?
Used Sequences of Characters
Which unigrams, bigrams or trigrams are used?
Which are most common for each language?

Alphabet Features
Count number of Unicode characters in the following classes:
C1 La n characters, only a-z, without diacri cs;
C2 Cyrillic characters (0x0410-0x042F and 0x0430-0x044F);
C3 Hiragana and Katakana characters (0x3040-0x30FF);
C4 Hangul characters (0xAC00-0xD7AF, 0x1100-0x11FF,
0x3130-0x318F, 0xA960-0xA97F and 0xD7B0-0xD7FF);
C5 Kanji characters (0x4E00-0x9FAF);
C6 Simplified Chinese characters (2877 hand defined characters);
C7 Tradi onal Chinese characters (2663 hand defined characters);
C8 Arabic characters (0x0600-0x06FF);
C9 Thai characters (0x0E00-0x0E7F);
C10 Greek characters (0x0370-0x03FF and 0x1F00-0x1FFF).

Binariza on of Alphabet Features
In order of reducing entropy in the NN:
Alphabet features are binarized using a set of rules:
set C1 ⇐ C1 0.20
set C2 ⇐ C2 0.20
set C3 ⇐ C3 0.20
set C4 ⇐ C4 0.20
set C6 ⇐ C5 0.30 ∧ C6 C7
set C7 ⇐ C5 0.30 ∧ C6 C7
set C8 ⇐ C8 0.20
set C9 ⇐ C9 0.20
set C10 ⇐ C10 0.20
where
set Ci ⇔ Ci ← 1 ∧ ∀j̸=i Cj ← 0

Trigram Features
Why Trigrams?
bigrams would be too small when comparing very close
languages like Portuguese and Spanish;
tetragrams would be too big for some languages (like Asia c's),
where some glyphs represent words or morphemes;
as punctua on and numbers were removed, and spaces
normalized, trigrams would be able to capture, as well, the end
or beginning of words as well as to capture single character
words that appear surrounded by spaces.

Trigram Features: example
Für mich war das eine neue Erkenntnis. Und ich denke, mit der
Zeit, in den kommenden Jahren, Wir haben Künstler, aber leider
haben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nur
eine Form kultureller Integra on. Wir haben erkannt, dass seit
kurzem immer mehr Leutea
Top occurring trigrams
en␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149
hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149
␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149
␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766
mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766

Trigram Features: Merging
features ← {};
for L ∈ L do
trigrams ← ∅;
for file ∈ FilesL do
T ← computeTrigrams(file) ; // Str → IN
T ← mostOccurring(T) ; // Top 30 trigrams
for t ∈ keys(T) do
trigrams[t] ← trigrams[t] + 1;
T ← mostOccurring(T) ;
features ← features ∪ keys(trigrams);

Training Data Matrix (excerpt)
Alphabet Features Trigram Features
La n Greek Cyril. ␣pa ới␣ par nia ест ати. ата
PT 1 0 0 0.0041 0 0.0038 0.0001 0 0 0
PT 1 0 0 0.0039 0 0.0036 0 0 0 0
RU 0 0 1 0 0 0 0 0.0020 0.0004 0.0003
RU 0 0 1 0 0 0 0 0.0026 0.0005 0.0002
UK 0 0 1 0 0 0 0 0.0003 0.0034 0.0001
UK 0 0 1 0 0 0 0 0.0003 0.0026 0.0001
VI 1 0 0 0 0.0028 0 0 0 0 0
VI 1 0 0 0 0.0029 0 0.0001 0 0 0

Experiment 1: 25 languages
Arabic (AR)
Bulgarian (BG)
German (DE)
Modern Greek (EL)
Spanish (ES)
Persian (FA)
French (FR)
Hebrew (HE)
Hungarian (HU)
Italian (IT)
Japanese (JA)
Korean (KO)
Dutch (NL)
Polish (PL)
Portuguese (PT)
Brazilian Portuguese (PT-BR)
Romanian (RO)
Russian (RU)
Serbian (SR)
Thai (TH)
Turkish (TR)
Ukrainian (UK)
Vietnamese (VI)
Tradi onal Chinese (ZH-TW)
Simpliﬁed Chinese (ZH-CN)

Exp 1: Training and Test Sets
Training Set (30 ﬁles/lang) Test Set (21 ﬁles/lang)
Lang. Smaller Larger ¯x σ Smaller Larger ¯x σ
ar 871921 969387 907562 21392 863 4618 2366 1210
bg 988450 1087435 1027581 23663 660 2099 1091 378
de 588200 653508 618463 16475 677 3890 1554 842
el 773265 885770 841203 22653 550 3297 1590 705
es 578806 651240 617341 17637 897 3850 2342 935
fa 651807 766206 697212 28994 600 5221 1338 967
fr 639582 705675 673414 15377 936 4088 1879 689
he 806098 877218 836222 20545 559 3649 1586 878
hu 406271 454506 431797 13131 729 6045 2175 1356
it 588147 643252 616391 14348 1260 6607 2991 1370
ja 538033 606053 569956 18871 323 785 495 133
ko 737118 817651 773168 20550 530 1603 780 233
nl 533497 580313 557724 14033 552 1949 1115 381
pl 521184 591299 551259 17938 435 3092 1605 694
pt-br 596158 643215 617734 14028 920 3189 1953 589
pt 338272 378872 355800 10605 486 5875 2031 1169
ro 592714 650375 616051 15442 718 3254 1438 695
ru 1019789 1144200 1069884 31232 662 2470 1444 526
sr 349389 433221 379344 20560 834 6493 1813 1263
th 529484 601244 565082 18551 334 3242 1396 734
tr 494191 549998 524271 12774 332 5390 1559 1121
uk 370785 434683 395312 16641 299 15435 2430 3553
vi 470057 541930 510409 17246 680 6237 1555 1359
zh-cn 536438 595027 562728 14457 495 6331 1695 1559
zh-tw 514993 588860 542879 16000 270 1721 925 428

Exp1: Accuracy
Language 1500 iters. 4000 iters.
ar, bg, de 100% 100%
el, es, fa 100% 100%
fr, he, hu 100% 100%
it, ja, ko 100% 100%
nl, pl 100% 100%
pt 5% 52% wrongly classiﬁes as pt-br
pt-br 100% 76% wrongly classiﬁes as pt
ro, ru, sr 100% 100%
th, tr, uk 100% 100%
vi, zh-cn, zh-tw 100% 100%

Exp1: Comparison of PT variants
PT PT-BR

Experiment 2: 55 languages
Afrikaans
Albanian
Arabic
Bulgarian
Bengali
Catalan
Czech
Danish
German
Modern
Greek
English
Esperanto
Spanish
Estonian
Persian
Finnish
French
Galician
Gujara
Hebrew
Hindi
Hungarian
Armenian
Indonesian
Italian
Japanese
Georgian
Kannada
Korean
Kurdish
Lithuanian
Latvian
Macedonian
Malayalam
Marathi
Burmese
Nepali
Dutch
Polish
Portuguese
Romanian
Russian
Slovak
Slovenian
Somali
Serbian
Swedish
Tamil
Thai
Turkish
Ukrainian
Urdu
Vietnamese
Chinese
(simpliﬁed)
Chinese
(tradi onal)

Exp 2: Results
55 languages,
1.126 features,
Θ(l) take 11MB on disk (binary format),
running 7500 itera ons of learning algorithm,
during 6574 minutes and 50.386 seconds (more than 4.5 days),
s ll 21 test files per language,
46 seconds to run over the 1155 test files,
accuracy of 99.740%,
mis-iden fica ons:
2 Bulgarian texts detected as Macedonian,
1 Danish text detected as Dutch.

Conclusions
Up to 96% of accuracy when tes ng few languages, and
including two Portuguese variants;
Over 99.7% of accuracy for 55 languages;
NN are able to grow, but training me grows exaggeratedly;
The choice of features is relevant;
(if we know a specific detail will be good to dis nguish a
language, add it to the network!)
Obtained results are not ``determinis c''. Although the same
propor on of results are expected, the random ini aliza on of
the network may lead to some different results in different
number of itera ons.

Future Work
Reduce number of trigrams per language and include unigrams;
Compute distribu on differences between near languages;
Make experiments on training different neural networks for
each alphabet;
Include a regulariza on coefficient (λ ̸= 0);
Make experiments to Deep Neural Networks;
Test language iden fica on on short texts (namely Twi er
tweets).

Language Iden ﬁca on:
a Neural Network approach
Alberto Simões1 José João Almeida2 Simon D. Byers3
1CEHUM, Minho's University
ambs@ilch.uminho.pt
2CCTC, Minho's University
jj@di.uminho.pt
3ATT Labs, Bedminster NJ
headers@gmail.com
SLATE2014, 19--20th June 2014

Language Identification: A neural network approach

Recommended

Recommended

More Related Content

Similar to Language Identification: A neural network approach

Similar to Language Identification: A neural network approach (20)

More from Alberto Simões

More from Alberto Simões (20)

Recently uploaded

Recently uploaded (20)

Language Identification: A neural network approach