Language Identification: A neural network approach
Upcoming SlideShare
Loading in...5
×
 

Language Identification: A neural network approach

on

  • 224 views

A presentation on some experiments on language identifying with Perl and Neural networks

A presentation on some experiments on language identifying with Perl and Neural networks

Statistics

Views

Total Views
224
Views on SlideShare
221
Embed Views
3

Actions

Likes
1
Downloads
2
Comments
0

1 Embed 3

http://www.slideee.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Language Identification: A neural network approach Language Identification: A neural network approach Presentation Transcript

  • Language Iden fica on: a Neural Network approach Alberto Simões1 José João Almeida2 Simon D. Byers3 1CEHUM, Minho's University ambs@ilch.uminho.pt 2CCTC, Minho's University jj@di.uminho.pt 3AT&T Labs, Bedminster NJ headers@gmail.com SLATE2014, 19--20th June 2014 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • In which languages are these texts? Malgranda Sablodezerto estas dezerto de Okcidenta Aŭstralio Esperanto Po nepavykusių pirmųjų bandymų su kukurūzais Lithuanian Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • In which languages are these texts? Malgranda Sablodezerto estas dezerto de Okcidenta Aŭstralio Esperanto Po nepavykusių pirmųjų bandymų su kukurūzais Lithuanian Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach View slide
  • In which languages are these texts? Malgranda Sablodezerto estas dezerto de Okcidenta Aŭstralio Esperanto Po nepavykusių pirmųjų bandymų su kukurūzais Lithuanian Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach View slide
  • In which languages are these texts? 俄罗斯眼下不具备航母建造、 停泊和维护所需的基础设施和条件 Simplified Chinese 임금체계 개편은 기본적으로 노사 합의 또는 Korean Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • In which languages are these texts? 俄罗斯眼下不具备航母建造、 停泊和维护所需的基础设施和条件 Simplified Chinese 임금체계 개편은 기본적으로 노사 합의 또는 Korean Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • In which languages are these texts? 俄罗斯眼下不具备航母建造、 停泊和维护所需的基础设施和条件 Simplified Chinese 임금체계 개편은 기본적으로 노사 합의 또는 Korean Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • In which languages are these texts? ‫جلوگیری‬ .‫کردند‬ ‫گروه‬ ‫دوم‬ ‫هم‬ ‫به‬ Persian আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা Bengali Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • In which languages are these texts? ‫جلوگیری‬ .‫کردند‬ ‫گروه‬ ‫دوم‬ ‫هم‬ ‫به‬ Persian আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা Bengali Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • In which languages are these texts? ‫جلوگیری‬ .‫کردند‬ ‫گروه‬ ‫دوم‬ ‫هم‬ ‫به‬ Persian আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা Bengali Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • In which languages are these texts? ဦးသိန္းစိန္အစိုးရရဲ￿ ဝန္￿ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ￿ စစ္ဗိုလ္လူထြက္ေတြ Burmese આ રસ મ લ િનચોડી સારી રી િમકસ કરો અ લાસમ Gujara Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • In which languages are these texts? ဦးသိန္းစိန္အစိုးရရဲ￿ ဝန္￿ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ￿ စစ္ဗိုလ္လူထြက္ေတြ Burmese આ રસ મ લ િનચોડી સારી રી િમકસ કરો અ લાસમ Gujara Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • In which languages are these texts? ဦးသိန္းစိန္အစိုးရရဲ￿ ဝန္￿ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ￿ စစ္ဗိုလ္လူထြက္ေတြ Burmese આ રસ મ લ િનચોડી સારી રી િમકસ કરો અ લાસમ Gujara Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Approaches Using a dic onary of words for each language: Problem: amount of word forms! Using language features: compute unigrams, bigrams, trigrams, …; compute short words; compute word beginnings or termina ons; Then use language models: Naïve Bayes; Hidden Markov Models (HMM); Support Vector Machines (SVM); Neural Networks (NN); Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Approaches Using a dic onary of words for each language: Problem: amount of word forms! Using language features: compute unigrams, bigrams, trigrams, …; compute short words; compute word beginnings or termina ons; Then use language models: Naïve Bayes; Hidden Markov Models (HMM); Support Vector Machines (SVM); Neural Networks (NN); Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Approaches Using a dic onary of words for each language: Problem: amount of word forms! Using language features: compute unigrams, bigrams, trigrams, …; compute short words; compute word beginnings or termina ons; Then use language models: Naïve Bayes; Hidden Markov Models (HMM); Support Vector Machines (SVM); Neural Networks (NN); Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Mo va on for a new tool lack of a decent iden fica on tool for Perl; use of Chrome Language Detec on library is limited: how to add new languages? how to restrict results to specific languages? there are tools for other programming languages: language interoperability can be a hassle; not clear how to add new languages; Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Why using a Neural Network? learn how Neural Networks work! an approach where: training is tedious and slow; iden fica on is easy to implement; iden fica on efficient when BLAS available; therefore: possible to use trained data in different programming languages; easy to restrict analysis to a set of languages; iden fica on probabili es are comparable; Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Neural Network Architecture x1 x2 x3 . . . xn input layer (features) a (2) 1 a (2) 2 a (2) 3 . . . a (2) s2 y1 y2 . . . yK Θ(1) Θ(2) output layer Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Preparing Training Data texts from TED website; more than 105 languages available! English texts were matched against English dic onary; OOV items are removed from the English texts and from other language texts (trying to remove named en es wri en in their English form from other texts). Example …began spoken word poet Sarah Kay, in a talk that inspired two standing ova ons at TED2011. She tells the story of her metamorphosis — from a wide-eyed teenager soaking in verse at New York's Bowery Poetry Club to a teacher connec ng kids with the power of self-expression through Project V.O.I.C.E. — and gives two breathtaking performances of ``B'' and ``Hiroshima.'' Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Preparing Training Data texts from TED website; more than 105 languages available! English texts were matched against English dic onary; OOV items are removed from the English texts and from other language texts (trying to remove named en es wri en in their English form from other texts). Example …began spoken word poet Sarah Kay, in a talk that inspired two standing ova ons at TED2011. She tells the story of her metamorphosis — from a wide-eyed teenager soaking in verse at New York's Bowery Poetry Club to a teacher connec ng kids with the power of self-expression through Project V.O.I.C.E. — and gives two breathtaking performances of ``B'' and ``Hiroshima.'' Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Preparing Training Data texts from TED website; more than 105 languages available! English texts were matched against English dic onary; OOV items are removed from the English texts and from other language texts (trying to remove named en es wri en in their English form from other texts). Example …began spoken word poet Sarah Kay, in a talk that inspired two standing ova ons at TED2011. She tells the story of her metamorphosis — from a wide-eyed teenager soaking in verse at New York's Bowery Poetry Club to a teacher connec ng kids with the power of self-expression through Project V.O.I.C.E. — and gives two breathtaking performances of ``B'' and ``Hiroshima.'' Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Two kind of Features Used Alphabet Which are the computer characters used in the text? Are they usually used in Asia c, Arabic or La n text? Used Sequences of Characters Which unigrams, bigrams or trigrams are used? Which are most common for each language? Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Two kind of Features Used Alphabet Which are the computer characters used in the text? Are they usually used in Asia c, Arabic or La n text? Used Sequences of Characters Which unigrams, bigrams or trigrams are used? Which are most common for each language? Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Alphabet Features Count number of Unicode characters in the following classes: C1 La n characters, only a-z, without diacri cs; C2 Cyrillic characters (0x0410-0x042F and 0x0430-0x044F); C3 Hiragana and Katakana characters (0x3040-0x30FF); C4 Hangul characters (0xAC00-0xD7AF, 0x1100-0x11FF, 0x3130-0x318F, 0xA960-0xA97F and 0xD7B0-0xD7FF); C5 Kanji characters (0x4E00-0x9FAF); C6 Simplified Chinese characters (2877 hand defined characters); C7 Tradi onal Chinese characters (2663 hand defined characters); C8 Arabic characters (0x0600-0x06FF); C9 Thai characters (0x0E00-0x0E7F); C10 Greek characters (0x0370-0x03FF and 0x1F00-0x1FFF). Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Binariza on of Alphabet Features In order of reducing entropy in the NN: Alphabet features are binarized using a set of rules: set C1 ⇐ C1 > 0.20 set C2 ⇐ C2 > 0.20 set C3 ⇐ C3 > 0.20 set C4 ⇐ C4 > 0.20 set C6 ⇐ C5 > 0.30 ∧ C6 > C7 set C7 ⇐ C5 > 0.30 ∧ C6 < C7 set C8 ⇐ C8 > 0.20 set C9 ⇐ C9 > 0.20 set C10 ⇐ C10 > 0.20 where set Ci ⇔ Ci ← 1 ∧ ∀j̸=i Cj ← 0 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Trigram Features Why Trigrams? bigrams would be too small when comparing very close languages like Portuguese and Spanish; tetragrams would be too big for some languages (like Asia c's), where some glyphs represent words or morphemes; as punctua on and numbers were removed, and spaces normalized, trigrams would be able to capture, as well, the end or beginning of words as well as to capture single character words that appear surrounded by spaces. Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Trigram Features: example Für mich war das eine neue Erkenntnis. Und ich denke, mit der Zeit, in den kommenden Jahren, Wir haben Künstler, aber leider haben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nur eine Form kultureller Integra on. Wir haben erkannt, dass seit kurzem immer mehr Leutea Top occurring trigrams en␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149 hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149 ␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149 ␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766 mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Trigram Features: example Für mich war das eine neue Erkenntnis. Und ich denke, mit der Zeit, in den kommenden Jahren, Wir haben Künstler, aber leider haben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nur eine Form kultureller Integra on. Wir haben erkannt, dass seit kurzem immer mehr Leutea Top occurring trigrams en␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149 hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149 ␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149 ␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766 mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Trigram Features: Merging features ← {}; for L ∈ L do trigrams ← ∅; for file ∈ FilesL do T ← computeTrigrams(file) ; // Str → IN T ← mostOccurring(T) ; // Top 30 trigrams for t ∈ keys(T) do trigrams[t] ← trigrams[t] + 1; T ← mostOccurring(T) ; features ← features ∪ keys(trigrams); Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Training Data Matrix (excerpt) Alphabet Features Trigram Features La n Greek Cyril. ␣pa ới␣ par nia ест ати. ата PT 1 0 0 0.0041 0 0.0038 0.0001 0 0 0 PT 1 0 0 0.0039 0 0.0036 0 0 0 0 RU 0 0 1 0 0 0 0 0.0020 0.0004 0.0003 RU 0 0 1 0 0 0 0 0.0026 0.0005 0.0002 UK 0 0 1 0 0 0 0 0.0003 0.0034 0.0001 UK 0 0 1 0 0 0 0 0.0003 0.0026 0.0001 VI 1 0 0 0 0.0028 0 0 0 0 0 VI 1 0 0 0 0.0029 0 0.0001 0 0 0 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Experiment 1: 25 languages Arabic (AR) Bulgarian (BG) German (DE) Modern Greek (EL) Spanish (ES) Persian (FA) French (FR) Hebrew (HE) Hungarian (HU) Italian (IT) Japanese (JA) Korean (KO) Dutch (NL) Polish (PL) Portuguese (PT) Brazilian Portuguese (PT-BR) Romanian (RO) Russian (RU) Serbian (SR) Thai (TH) Turkish (TR) Ukrainian (UK) Vietnamese (VI) Tradi onal Chinese (ZH-TW) Simplified Chinese (ZH-CN) Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Exp 1: Training and Test Sets Training Set (30 files/lang) Test Set (21 files/lang) Lang. Smaller Larger ¯x σ Smaller Larger ¯x σ ar 871921 969387 907562 21392 863 4618 2366 1210 bg 988450 1087435 1027581 23663 660 2099 1091 378 de 588200 653508 618463 16475 677 3890 1554 842 el 773265 885770 841203 22653 550 3297 1590 705 es 578806 651240 617341 17637 897 3850 2342 935 fa 651807 766206 697212 28994 600 5221 1338 967 fr 639582 705675 673414 15377 936 4088 1879 689 he 806098 877218 836222 20545 559 3649 1586 878 hu 406271 454506 431797 13131 729 6045 2175 1356 it 588147 643252 616391 14348 1260 6607 2991 1370 ja 538033 606053 569956 18871 323 785 495 133 ko 737118 817651 773168 20550 530 1603 780 233 nl 533497 580313 557724 14033 552 1949 1115 381 pl 521184 591299 551259 17938 435 3092 1605 694 pt-br 596158 643215 617734 14028 920 3189 1953 589 pt 338272 378872 355800 10605 486 5875 2031 1169 ro 592714 650375 616051 15442 718 3254 1438 695 ru 1019789 1144200 1069884 31232 662 2470 1444 526 sr 349389 433221 379344 20560 834 6493 1813 1263 th 529484 601244 565082 18551 334 3242 1396 734 tr 494191 549998 524271 12774 332 5390 1559 1121 uk 370785 434683 395312 16641 299 15435 2430 3553 vi 470057 541930 510409 17246 680 6237 1555 1359 zh-cn 536438 595027 562728 14457 495 6331 1695 1559 zh-tw 514993 588860 542879 16000 270 1721 925 428 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Exp1: Accuracy Language 1500 iters. 4000 iters. ar, bg, de 100% 100% el, es, fa 100% 100% fr, he, hu 100% 100% it, ja, ko 100% 100% nl, pl 100% 100% pt 5% 52% wrongly classifies as pt-br pt-br 100% 76% wrongly classifies as pt ro, ru, sr 100% 100% th, tr, uk 100% 100% vi, zh-cn, zh-tw 100% 100% Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Exp1: Comparison of PT variants PT PT-BR Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Experiment 2: 55 languages Afrikaans Albanian Arabic Bulgarian Bengali Catalan Czech Danish German Modern Greek English Esperanto Spanish Estonian Persian Finnish French Galician Gujara Hebrew Hindi Hungarian Armenian Indonesian Italian Japanese Georgian Kannada Korean Kurdish Lithuanian Latvian Macedonian Malayalam Marathi Burmese Nepali Dutch Polish Portuguese Romanian Russian Slovak Slovenian Somali Serbian Swedish Tamil Thai Turkish Ukrainian Urdu Vietnamese Chinese (simplified) Chinese (tradi onal) Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Exp 2: Results 55 languages, 1.126 features, Θ(l) take 11MB on disk (binary format), running 7500 itera ons of learning algorithm, during 6574 minutes and 50.386 seconds (more than 4.5 days), s ll 21 test files per language, 46 seconds to run over the 1155 test files, accuracy of 99.740%, mis-iden fica ons: 2 Bulgarian texts detected as Macedonian, 1 Danish text detected as Dutch. Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Conclusions Up to 96% of accuracy when tes ng few languages, and including two Portuguese variants; Over 99.7% of accuracy for 55 languages; NN are able to grow, but training me grows exaggeratedly; The choice of features is relevant; (if we know a specific detail will be good to dis nguish a language, add it to the network!) Obtained results are not ``determinis c''. Although the same propor on of results are expected, the random ini aliza on of the network may lead to some different results in different number of itera ons. Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Future Work Reduce number of trigrams per language and include unigrams; Compute distribu on differences between near languages; Make experiments on training different neural networks for each alphabet; Include a regulariza on coefficient (λ ̸= 0); Make experiments to Deep Neural Networks; Test language iden fica on on short texts (namely Twi er tweets). Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • Language Iden fica on: a Neural Network approach Alberto Simões1 José João Almeida2 Simon D. Byers3 1CEHUM, Minho's University ambs@ilch.uminho.pt 2CCTC, Minho's University jj@di.uminho.pt 3AT&T Labs, Bedminster NJ headers@gmail.com SLATE2014, 19--20th June 2014 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach