Short Text Language Detection with Infinity-Gram
Upcoming SlideShare
Loading in...5
×
 

Short Text Language Detection with Infinity-Gram

on

  • 27,053 views

 

Statistics

Views

Total Views
27,053
Views on SlideShare
11,173
Embed Views
15,880

Actions

Likes
14
Downloads
118
Comments
1

21 Embeds 15,880

http://shuyo.wordpress.com 12968
http://d.hatena.ne.jp 2690
http://www.redditmedia.com 98
https://twitter.com 50
http://hatenatunnel.appspot.com 14
https://si0.twimg.com 13
https://shuyo.wordpress.com 9
http://us-w1.rockmelt.com 8
https://twimg0-a.akamaihd.net 8
http://webcache.googleusercontent.com 5
http://translate.googleusercontent.com 4
http://prlog.ru 2
http://www.google.co.uk 2
http://iblunk.com 2
http://feedly.com 1
http://a0.twimg.com 1
http://www.hatenatunnel.appspot.com 1
http://digg.com 1
http://www.365dailyjournal.com 1
http://131.253.14.250 1
http://www.pearltrees.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • 'Our Goal is 99%+ accurate detection for 'sentence with more than 3 words''

    Hope they define sentence somewhere!

    More interesting is their use of Tries and ESAs.

    I suspect a more efficient (in practise and asymptotic space+time) implementation of ESAs could be created by forming them using a special kind of Suffix Tree (rather than Array); which uses hash maps and sibling lists (made of skew binary lists or even regular linked lists).

    The normalisation stuff looked fairly generic, but the Tweet sampling was irritating because they're just using Twitter's algorithm; which they don't reference :(.

    A 98% accuracy before even reducing bias is fairly impressive!

    'Double array' was mentioned in reference to https://github.com/shuyo/ldig on slide 54. Not sure what to make of that (is it ESA?!); might take a look through the code later…

    Some need for improvement (especially related to lack of compression and use of overly suboptimal data-structures).

    However there were some very nice accuracy results at the end there :)
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Short Text Language Detection with Infinity-Gram Short Text Language Detection with Infinity-Gram Presentation Transcript

  • Short Text Language Detection with Infinity-Gram 2012/05/14 NAIST SeminarNakatani Shuyo @ Cybozu Labs Inc
  • Agenda• Language Detection• Proposal Method – Maximal Substring• Corpus• Implementation and Estimations• Conclusions Short Text Language Detection with Infinity-Gram 4 (NAIST Seminar)
  • Language Detection Short Text Language Detection with 5 Infinity-Gram (NAIST Seminar)
  • In What Language?• Ik kan er nooit tegen als mensen me negeren.• Aha ich seh angeblich süß aus• Czy mógłbym zasnąć w przedmieściach Twoich myśli?• Ah. Tak. Så skal jeg bare finde ud af *hvordan*!• Det er ikke så digg nei å vi som har finale til helga....Skrekk og gru! Takk :)• tack kompis! Hade faktiskt tänkt maila dig på fb och fråga vart du tog vägen!• Çok doğru. En büyük hatayı yaptım.• Încântat de cunoștință.• Một người dân bị thương và bốn người mất tích sau khi một ngọn núi lửa ở miền trung... Detection with Infinity-Gram Short Text Language 6 (NAIST Seminar)
  • Hints• Dutch if there is ik• German if there is ich or a letter ß• Polish if there is czy or letters Ł, ń, ś or ź• Scandinavian if there is a letter å – Danish if there is af. Tak means thanks. – Norwegian if there is nei. Takk means thanks. – Swedish if there is "och." Tack means thanks.• Turkish if there is a letter ı ( i without point) or ğ• Romanian if there is a letter ă or ș or ț – Although ă is also used in Vietnamese, it is easy to distinguish them. – Although ş is also used in Turkish, it is easy to distinguish them.• Vietnamese if there are many unreadable letters on WinXP :P Short Text Language Detection with Infinity-Gram 7 (NAIST Seminar)
  • In What Language? (Solution)• Ik kan er nooit tegen als mensen me negeren. Dutch• Aha ich seh angeblich süß aus German• Czy mógłbym zasnąć w przedmieściach Twoich myśli? Polish• Ah. Tak. Så skal jeg bare finde ud af *hvordan*! Danish• Det er ikke så digg nei å vi som har finale til helga....Skrekk og gru! Takk :) Norwegian• tack kompis! Hade faktiskt tänkt maila dig på fb och fråga vart du tog vägen! Swedish• Çok doğru. En büyük hatayı yaptım. Turkish• Încântat de cunoștință. Rumanian• Một người dân bị thương và bốn người mất tích sau khi một ngọn núi lửa ở miền trung... Detection with Infinity-Gram Short Text Language Vietnamese 8 (NAIST Seminar)
  • Whats Language Detection• To detect what language the input text written in – Time fries like arrow → English – Buona sera! → Italian• It is prior for many language processing tasks – Language model is built for each language – Text search, classification, extraction, translation, ...• It is possible to detect for long enough and noiseless text with more than 99% accuracy [Cavnar+ 94] – 3-gram model is used in many methods Short Text Language Detection with Infinity-Gram 9 (NAIST Seminar)
  • SPAM or not?• It is necessary to know that it is written in Polish. Short Text Language Detection with Infinity-Gram 10 (NAIST Seminar)
  • Document Categorization with Naive Bayes Classifier• Categorize a document 𝑋 = (𝑋 𝑖 ) into category 𝐶 𝑘 – A document 𝑋 is represented as collection of words 𝑋 𝑖 (bag-of-words)• Word probability assumes conditionally independent on each category – 𝑝 𝑋 𝐶𝑘 = 𝑖 𝑝 𝑋 𝑖 𝐶k (from independent hypothesis) – where 𝑝(𝑋 𝑖 |𝐶) : rate of word frequency for category• Estimate the category 𝐶k to maximize posterior 𝑝 𝑋 𝐶k 𝑝 𝐶k – 𝑝 𝐶k 𝑋 = ∝ 𝑝(𝐶k ) 𝑖 𝑝(𝑋 𝑖 |𝐶k ) 𝑝 𝑋 – where 𝑝(𝐶k ) : prior for category Short Text Language Detection with Infinity-Gram 11 (NAIST Seminar)
  • Language Detection with Naive Bayes Classifier• Document categorization with language labels – Categorize documents into English, Japanese and so on• Use character n-gram as features – "Unicode code point n-gram", strictly speaking – Assume character encoding of the document is already known • Most applications know encoding of inside text data Short Text Language Detection with Infinity-Gram 12 (NAIST Seminar)
  • Why Use n-Gram to Detect Language • Each language has proper characters and spelling rules – “é” is often used in Spanish, Italian and so on, but not in English in principle – There are many words which start with “Z” in German, but not in English – There are many words which start with “C” in English, but not in German – Spelling “Th” is often used in English, but not in the other languages □C □L □Z Th□T h i s □ English 0.75 0.47 0.02 0.74 T h i s ←1-gram German 0.10 0.37 0.53 0.03□T Th hi is s□ ←2-gram French 0.38 0.69 0.01 0.01 □Th Thi his is□ ←3-gram Short Text Language Detection with Infinity-Gram 13 (NAIST Seminar)
  • language-detection(langdetect) (Nakatani 2010)• Language detection library for Java – http://code.google.com/p/language-detection/ – Apache License 2.0 – Character 3-gram + Bayesian filter – Various normalizations + Feature sampling• 99% over precision for 53 languages – Training with Wikipedia abstract – Widely support including Asian languages – Adopted by Apache Solr Short Text Language Detection with Infinity-Gram 14 (NAIST Seminar)
  • Estimation with News Text Language size accuracy Language size accuracy af Afrikaans 200 199 (99.50%) mr Marathi 200 200 (100.00%) ar Arabic 200 200 (100.00%) ne Nepali 200 200 (100.00%) bg Bulgarian 200 200 (100.00%) nl Dutch 200 200 (100.00%) bn Bengali 200 200 (100.00%) no Norwegian 200 199 (99.50%) cs Czech 200 200 (100.00%) pa Punjabi 200 200 (100.00%) da Dannish 200 179 (89.50%) pl Polish 200 200 (100.00%) de German 200 200 (100.00%) pt Portuguese 200 200 (100.00%) el Greek 200 200 (100.00%) ro Romanian 200 200 (100.00%) en English 200 200 (100.00%) ru Russian 200 200 (100.00%) es Spanish 200 200 (100.00%) sk Slovak 200 200 (100.00%) fa Persian 200 200 (100.00%) so Somali 200 200 (100.00%) fi Finnish 200 200 (100.00%) sq Albanian 200 200 (100.00%) fr French 200 200 (100.00%) sv Swedish 200 200 (100.00%) gu Gujarati 200 200 (100.00%) sw Swahili 200 200 (100.00%) he Hebrew 200 200 (100.00%) ta Tamil 200 200 (100.00%) hi Hindi 200 200 (100.00%) te Telugu 200 200 (100.00%) hr Croatian 200 200 (100.00%) th Thai 200 200 (100.00%) hu Hungarian 200 200 (100.00%) tl Tagalog 200 200 (100.00%) id Indonesian 200 200 (100.00%) tr Turkish 200 200 (100.00%) it Italian 200 200 (100.00%) uk Ukrainian 200 200 (100.00%) ja Japanese 200 200 (100.00%) ur Urdu 200 200 (100.00%) kn Kannada 200 200 (100.00%) vi Vietnamese 200 200 (100.00%) ko Korean 200 200 (100.00%) zh-cn Simplified Chinese 200 200 (100.00%) mk Macedonian 200 200 (100.00%) zh-tw Traditional Chinese 200 200 (100.00%) ml Malayalam 200 200 (100.00%) total 9800 9777 (99.77%)• Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram 15 (NAIST Seminar)
  • Estimation with Europarl datasets language size correct accuracybg Bulgarian 1000 988 98.8% • Test for 1000 samples for eachcs Czech 1000 994 99.4%da Dannish 1000 968 96.8% language from Europarl Parallel Corpusde German 1000 998 99.8% – from the proceedings of the European Parliamentel Greek 1000 1000 100.0%en English 1000 996 99.6% – http://www.statmt.org/europarl/es Spanish 1000 996 99.6%et Estonian 1000 996 99.6% • http://code.google.com/p/language-fi Finnish 1000 998 99.8% detection/downloads/detail?name=eurfr French 1000 999 99.9%hu Hungarian 1000 999 99.9% oparl-test.zipit Italian 1000 999 99.9%lt Lithuanian 1000 997 99.7%lv Latvian 1000 999 99.9%nl Dutch 1000 974 97.4%pl Polish 1000 999 99.9%pt Portuguese 1000 996 99.6%ro Romanian 1000 999 99.9%sk Slovak 1000 988 98.8%sl Slovene 1000 976 97.6%sv Swedish 1000 991 99.1% total 21000 20850 99.3% Short Text Language Detection with Infinity-Gram 16 (NAIST Seminar)
  • Language Detection has been over, isnt it? 17
  • We still have ENEMY to beat! Short Text Language Detection with Infinity-Gram 18 (NAIST Seminar)
  • Twitter Language Detection with the Existing Methods • Only 90-95% accuracy language LD CLD Tikaca Catalan 95.3 93.0 83.8 for tweet corpuscs Czech 96.3 96.6 ----da Dannish 94.5 90.7 58.7de German 86.6 96.8 73.1en English 88.3 97.4 54.7es Spanish 91.5 90.5 44.4 • LD = language-detectionfi Finnish 98.9 99.4 94.8fr French 95.0 94.5 67.4 • CLD = Chromium Compact Languagehu Hungarian 85.8 89.0 76.2 Detectionid Indonesian 89.7 92.8 ----it Italian 96.2 93.8 87.1 – http://code.google.com/p/chromium-nl Dutch 69.5 93.2 65.0 compact-language-detector/no Norwegian 96.0 74.9 68.6 – regard ms(Malay) as id(Indonesian)pl Polish 98.0 97.8 88.8pt Portuguese 88.0 88.6 47.4 • Tika = Apache Tikaro Romanian 92.8 96.1 82.6sv Swedish 96.0 96.4 75.6 – http://tika.apache.org/tr Turkish 97.6 97.4 ---- – Estimate on 15 languages which Tikavi Vietnamese 98.7 98.9 ---- supports in our tweet corpus total 92.2 93.8 70.0 Short Text Language Detection with Infinity-Gram 19 (NAIST Seminar)
  • Chromium Compact Language Detection (CLD)• Porting the language detector from Google Chromium – http://code.google.com/p/chromium-compact-language-detector/ – Implementation in C++, Python binding – # of supported languages : CLD = 76, langdetect = 53 – Accuracy : CLD = 98.82%, langdetect = 99.22% • for 17 languages on Europarl datasets • http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html Short Text Language Detection with Infinity-Gram 20 (NAIST Seminar)
  • Is twitter Language Detection difficult? (1) • Tweet is too short to extract 3-gram features – At most 140 characters on twitter – URLs, mentions and hashtags are not useful to detect • LIGA [Tromp+ 11] – Graph-features based on 3-gram • Add long distance features • 95~98% accuracy for twitter Language Detection • 6 languages (de, en, es, fr, it, nl) Short Text Language Detection with Infinity-Gram 21 (NAIST Seminar)
  • Is twitter Language Detection difficult? (2) • Tweet is too noisy – Representations against the languages orthography often appear – Acronym, Abbreviation, lengthened word (like Cooooolll) • Likelihood of tweet tends to get smaller on normal language model OMG Oh My God u you LOL Laughing Out Loud ur your Letter k isnt used in Italian LMAO Laughing My Ass Out 4 for F4F Follow for Follow i0u I love you MDR Mort de Rire (French) k che (Italian) TKT Ne t‘Inquiète Pas (Fr) anke anche(Italian) Short Text Language Detection with Infinity-Gram 22 (NAIST Seminar)
  • Motivation to Detect Short Text Language• There are many small chunks of text in addition to twitter – Schedule, search query, bulletin board and so on – There are many questions about short text detection in the Issues Board of langdetect Project • http://code.google.com/p/language-detection/issues/detail?id=10• Detection for multi-language mixed text – Cut the target document in paragraphs or lines – Detect for each short text Short Text Language Detection with Infinity-Gram 23 (NAIST Seminar)
  • Our Goal• Over 99% accuracy – However it is too difficult to detect "one word sentence"... – Our Goal is 99%+ accurate detection for "sentence with more than 3 words" Short Text Language Detection with Infinity-Gram 24 (NAIST Seminar)
  • We need• Rich feature extractable model from short text, – Maximal substring model (∞-gram Logistic Regression)• and twitter-specific Language model or Corpus to construct it. – about 700K tweet corpus with language label Short Text Language Detection with Infinity-Gram 25 (NAIST Seminar)
  • Proposal Method Short Text Language Detection with 26 Infinity-Gram (NAIST Seminar)
  • How to increase features from 3-grams # of n-gram gram freq≧1 freq≧2 freq≧10 • The more n, the 1 79 72 57 more features 2 1896 1533 902 3 15970 10369 4525 • Maximum at 4 64966 33941 10534 n=∞, that is all 5 167543 69719 15538 substring 6 323749 107861 18970 – But it has O(T2) 7 524634 142954 21093 order 8 760719 171995 22159 9 921361 193995 22696 : : : :※ cumulative distributuion of feature length for 5090 normalized English tweets (300KB) Short Text Language Detection with Infinity-Gram 27 (NAIST Seminar)
  • Text Categorization with All Substring Features [Okanohara+ 09]• Multiclass Logistic Regression using all substrings as features – Maximal Substring makes the equivalent model that can be constructed in linear time – Store features into TRIE, fast prediction Short Text Language Detection with Infinity-Gram 28 (NAIST Seminar)
  • Maximal Substring (1)• Define a containment(semi-order) among non empty substrings abracadabra – “ra” ⊂ “bra“ ⇔ all ”ra” occur as the substring of “bra” – “a” ⊄ “ra“ ⇔ “a” occur in not only “ra“ but also “ca” ※It is strictly defined with also its position in the substring. Short Text Language Detection with Infinity-Gram 29 (NAIST Seminar)
  • Maximal Substring (2) via http://d.hatena.ne.jp/nokuno/20120203/1328237067• Each equivalent class formed by the containment relationship has a unique maximal element, that is named "Maximal Substring".• Maximal substrings of "abracadabra" are "a", "abra" and "abracadabra". Short Text Language Detection with Infinity-Gram 30 (NAIST Seminar)
  • Maximal Substring and Infinity-Gram• Frequencies of substrings that have a containment relationship always equal.• In the model with linear combination of features, it is possible to enclose the common feature values.• Logistic regression with maximal substrings is equivalent to the one with infinity-grams. ※ Although the equivalence collapses for test set, we assumes that it can be approximated by a sufficiently large training set. Short Text Language Detection with Infinity-Gram 31 (NAIST Seminar)
  • Extended Suffix Array• Extended Suffix Array consists of – SA=Suffix Array, – L=Longest Common Prefixes, – B=Burrows-Wheelers Transformed text.• A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree, which is equivalent to a suffix with L>0 and BWT has more than 1 character type. – They can be calculated on linear time.• esaxx : Okanoharas implement of ESA – http://code.google.com/p/esaxx/ Short Text Language Detection with Infinity-Gram via [Okanohara+ 09] 32 (NAIST Seminar)
  • Corpus and NormalizationShort Text Language Detection with 33 Infinity-Gram (NAIST Seminar)
  • Target Languages• Limit character type to detect – In short text detection, mixed text can be divided to type of characters• Latin alphabet language – The most difficult alphabet type to detect – Languages which speakers are over 5 million are more than 25. Short Text Language Detection with Infinity-Gram 34 (NAIST Seminar)
  • Whats Latin Alphabet?• Latin alphabet ≠ ascii alphabet – å, ą, æ, ð, Ħ, ŋ and so on...• They are assigned to 9 code blocks in Unicode Range Name Supplement U+0000-007F Basic Latin ascii U+0080-00FF Latin-1 Supplement Most languages are covered U+0100-017F Latin Extended-A with these. U+0180-024F Latin Extended-B Rumanian U+0250-02AF IPA Extensions U+0300-036F Combining Diacritical Marks for tone symbol composition U+1E00-1EFF Latin Extended Additional Vietnamese U+2C60-2C7F Latin Extended-C These aren’t used by almost U+A720-A7FF Latin Extended-D all present languages Short Text Language Detection with Infinity-Gram 35 (NAIST Seminar)
  • Latin Alphabets in Unicode Codepoint Chartuse often use sometimes for Vietnamese only Short Text Language Detection with Infinity-Gram 36 (NAIST Seminar)
  • How to Create Corpus• Collect tweets with sample method of twitter Streaming API – Sampling 1% of all tweets (about 2 million tweets). – Tweets in Latin alphabet language account for 60% of them.• The rest is only to annotate language labels to these tweets Short Text Language Detection with Infinity-Gram 37 (NAIST Seminar)
  • Language Label Annotation• Group tweets by their timezone – French tweets account for about 1% of all ones – But they account for 50% of ones in Paris timezone only• Annotate tentative labels to tweets using langdetect – Remove non-French tweets from ones labeled ‘fr’ – Recover French tweets from ones not labeled ‘fr’ (※ 20% of the whole tweets have no timezone) Short Text Language Detection with Infinity-Gram 38 (NAIST Seminar)
  • How to annotate Swedish, Norwegian, Danish, Vietnamese, Lithuanian,Czech, Hungarian, Catalan, Rumanian and Polish guides in turn Short Text Language Detection with Infinity-Gram 39 (NAIST Seminar)
  • Created Corpus language training testca Catalan 9,089 5,082cs Czech 9,082 7,682da Dannish 7,388 5,524de German 44,448 10,065en English 44,520 10,168es Spanish 44,118 10,265fi Finnish 8,087 7,050fr French 44,339 10,098hu Hungarian 10,030 4,904id Indonesian 44,722 10,181it Italian 43,366 10,152nl Dutch 44,682 10,007 • Noiseless tweets for trainingno Norwegian 10,124 8,496 datapl Polish 16,771 10,152pt Portuguese 44,215 10,208 • Noiseful tweets with morero Romanian 10,021 5,911 than 3 words as test datasv Swedish 44,054 10,032tr Turkish 44,703 10,308 • Work with Raúl Velaz andvi Vietnamese total 15,030 538,789 10,488 166,773 Hiroshi Manabe for Catalan corpus creation Short Text Language Detection with Infinity-Gram 40 (NAIST Seminar)
  • Simple Language Detection• Language detector can be constructed from maximal substring model and twitter corpus – It still gets at most 98% accuracy.• We guess it is necessary to reduce bias. – data size bias – language-specific bias – twitter-specific bias Short Text Language Detection with Infinity-Gram 41 (NAIST Seminar)
  • Bias by Data Size• Tweet size in each language has huge bias.• Level them out by sampling with replacement from each language up to the largest data – It actually approximates to copy the integer multiple of data and sample the rest without replacement English Portuguese Spanish Indonesian Dutch French German Turkish Italian Swedish others Short Text Language Detection with Infinity-Gram 42 (NAIST Seminar)
  • Convert to Lowercase on Multiple Languages• Conversion into lower case saves corpus and compresses model.• But the lower case of I (U+0049) in Turkish differs from others.• Convert to lower case excluding ‘I’ Upper case Lower case Turkish I (U+0049) ı (U+0131) Azerbaijani İ (U+0130) i (U+0069) Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram 43 (NAIST Seminar)
  • Normalization for Rumanian• Rumanian uses â, ă, î, ș, ț in addition to a-z• There are 2 character type as s/t with a “beard” – U+015E-F, U+0162-3 : s/t with cedilla – U+0218-B : s/t with comma below • ‘s/t with cedilla’ is more popular on news, twitter and Wikipedia.• The 2 code has the same design in some fonts... – Indistinguishable!! ș ş ț ţ U+0219 U+015F U+021B U+0163 Short Text Language Detection with Infinity-Gram 44 (NAIST Seminar)
  • Rumanian Character Affairs on PC• Although Romanian orthography provided that ‘s/t with comma’ must be used, they was not available to PC until recently. – 1989 Democratization in Rumania – 2001 ‘s/t with comma’ was provided by ISO8859-16(Latin-10) and Unicode – 2007 Rumania seated in the EU – 2007 Windows Vista supported ‘s/t with comma’ (available for everyone!) ‘s/t with cedilla’ is used on an advertisement board in Bucharest Short Text Language Detection with Infinity-Gram 45 (NAIST Seminar)
  • Normalization for Substitute Characters• ‘s/t with cedilla’ are substitute characters – But they are more popular than the others – with cedilla : with comma = 2 : 1 – “Rumanian IME” outputs the substitutes too :D• Regard ‘s/t with comma’ as ‘s/t with cedilla’ ț ţ I reckon it is similar to the relationship of Japanese character ‘SA’!! U+021B Short Text ささ U+0163 Language Detection with Infinity-Gram (NAIST Seminar) 46
  • Arabic Character Normalization (on language-detection)• Arabic and Persian have the similar trouble too.• Character ‘yeh’ in Farsi corresponds to 2 code points. – Wikipedia uses ‫( ی‬U+06cc, Farsi yeh) only – News uses ‫(ي‬U+064a, Arabic yeh) only• U+064a is a substitute in Farsi – The popular Arabic charset CP-1256 has no character mapped into U+06cc – As ‘yeh’ is very often used in both languages, quite all Persian text detection fails• Regard U+06cc as U+064a Short Text Language Detection with Infinity-Gram 47 (NAIST Seminar)
  • Normalization for Vietnamese (1)• Vietnamese has 12 vowels – a, ă, â, e, ê, i, y, o, ô, ơ, u, ư• Vietnamese has 6 tones – a, ả, à, ã, á, ạ – These tone symbols are used also in general documents like news.• The tone symbols can be appended to all vowels – 12 * 6 = 72 Short Text Language Detection with Infinity-Gram 48 (NAIST Seminar)
  • Normalization for Vietnamese (2) • Representation of vowels with tones 1. Use U+1ea0 - U+1ef9 • ẵ = U+1eb5 2. Combine with Diacritical Marks • ẵ = U+0103 U+0303 – Half and half on news and tweet • Normalize 2 into 1 Short Text Language Detection with Infinity-Gram 49 (NAIST Seminar)
  • CJK-Kanji Normalization (1) (on language-detection)• CJK-Kanji has too many characters(more than 20K) – Other character types have only 30-50 characters.• The character space is very sparse. – Characters that don’t occur in the training corpus have no probabilities. • e.g. "谢谢", Kanji for person name – Common frequent characters are too strong. • e.g. : a text which has ”的” tends to be detected as Traditional Chinese • Hence Kana is used in Japanese too, the probabilities of Kanji in Japanese are less than ones in Chinese. Short Text Language Detection with Infinity-Gram 50 (NAIST Seminar)
  • CJK-Kanji Normalization (2) (on language-detection)• Group Kanjis by frequency and normalize each group to the representative character – (1) K-means clustering • Use tf-idf on Wikipedia and Google News • K=50 (size of ascii alphabet = 52) – (2) “Commonly Used Kanji” provided in Japanese and Chinese • Simplified Chinese : 现代汉语常用字表(3500) • Traditional Chinese :常用国字標準字体表(4808) ⊂ Big5 the first standard(5401) • Japanese : 常用漢字(2136)∪ JIS the first standard(2965) = 2998 – 常用漢字 doesn’t have Kanji for person name and place name very much• Generate 130 clusters from product of (1) and (2) Short Text Language Detection with Infinity-Gram 51 (NAIST Seminar)
  • Normalization for twitter• Remove simply – URL – mention – hash tag – RT – face mark using alphabet like XD, :p Short Text Language Detection with Infinity-Gram 52 (NAIST Seminar)
  • Normalization for twitter-Specific Representation• How to Like ‘coooooooollllll’• Case 1: Make a normalization dictionary using [Brody+ 2011] – Unsupervised normalization like coooollll → cool – It can’t handle words that are not in the dictionary• Case 2: If the same character continues in more than 3, Shrink it to 2 – There is no language which over 3 continuation of the same Latin alphabet in orthography of. • If in Japanese, there are “かたたたき”, “かわいいいぬ”, “あわてて て” and so on. • Acronym (like WWW, СССР) is not useful for language detection Short Text Language Detection with Infinity-Gram 53 (NAIST Seminar)
  • Laugh Normalization• There are various laughs on each language – HOW MUCH DO YOU LOVE COACH BEISTE??? HHAHAHAHAHAH – Hihihihi. :) Habe ich regulär 2x die Woche! – Tafil con eso...!!! Jajajajajajaja – Malo?? Jejejeje XP – kekeke chỗ đó làm áo được ko em?• Shrink them to double – hahahha ⇒ haha Short Text Language Detection with Infinity-Gram 54 (NAIST Seminar)
  • Implementation and Estimation Short Text Language Detection with 55 Infinity-Gram (NAIST Seminar)
  • Language Detection with Infinity-Gram (ldig)• tweet language detection for Latin alphabet – https://github.com/shuyo/ldig • MIT license • Distribute also the trained model here – ∞-gram LR(maximal substring) [Okanohara+ 09] – L1 SGD (Cumulative Penalty) [Tsuruoka+ 09] – Double Array Short Text Language Detection with Infinity-Gram 56 (NAIST Seminar)
  • Usage (1) Model Initialization• ldig.py -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency] – Extract features from corpus and initialize model – -m : model directory – -x : path of maximal substring extractor (execute as external process) – --ff : Ignore less than the specified value Short Text Language Detection with Infinity-Gram 57 (NAIST Seminar)
  • Maximal String Extractor• maxsubst [input file] [output file] – Input as multiple line text • Replace TABs to “ “, line feeds to U+0001 in it – Output as ”[features]¥t[frequency]” Short Text Language Detection with Infinity-Gram 58 (NAIST Seminar)
  • Usage (2) Learn• ldig.py -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization] – Learn the model using the corpus on 1 cycle of SGD – -e : learning rate of SGD – -r : regularizer of L1 regularization – --wr : what times to regularize for whole parameters • Parameters are too many to regularize the whle ones every step Short Text Language Detection with Infinity-Gram 59 (NAIST Seminar)
  • Usage (3) Shrink Model• ldig.py -m [model] --shrink – Remove Unefficient features(all parameters of which are 0) from the model Short Text Language Detection with Infinity-Gram 60 (NAIST Seminar)
  • Usage (4) Detect Language• ldig.py -m [model] [test data] – Detect languages of test data and output its result and summary Short Text Language Detection with Infinity-Gram 61 (NAIST Seminar)
  • Data Format• Training and test data – [correct label]¥t[meta data]¥t[text]en u should just enjoy ur vacation sadlyen :D im online but you arent RT that muchen im gettin attacked for a tweet LOOOOOOOOOOOOOOOOLca [status ID] [datetime] [userID] [language of UI] @xxx xDDD no mextranya... Tal volta haguera segutmillor per a la humanitat que no lhaguera vist... you know..xDD Short Text Language Detection with Infinity-Gram 62 (NAIST Seminar)
  • Usage (5) Estimation Tool• server.py -m [model] -p [port number] – Open http://localhost:[port] after it is executed – Output their language probabilities, contained features and their parameters for a text inputed in the text area Short Text Language Detection with Infinity-Gram 63 (NAIST Seminar)
  • Estimation language size detect correct precision recall LD53 LDsmca Catalan 5,093 4,923 4,857 98.66 95.37 95.3 97.0cs Czech 7,681 7,668 7,663 99.93 99.77 96.3 99.7da Dannish 5,516 5,472 5,310 97.04 96.27 94.5 92.4de German 10,060 10,069 10,006 99.37 99.46 86.6 93.8en English 10,162 10,133 10,029 98.97 98.69 88.3 95.0es Spanish 10,244 10,284 10,120 98.41 98.79 91.5 96.0fi Finnish 7,051 7,038 7,024 99.80 99.62 98.9 99.6fr French 10,074 10,134 10,051 99.18 99.77 95.0 98.1hu Hungarian 4,904 4,892 4,858 99.30 99.06 85.8 95.5id Indonesian 10,178 10,225 10,160 99.36 99.82 89.7 98.9 it Italian 10,143 10,205 10,103 99.00 99.61 96.2 98.0nl Dutch 10,005 9,916 9,858 99.42 98.53 69.5 97.4no Norwegian 8,504 8,432 8,201 97.26 96.44 96.0 96.3pl Polish 10,151 10,149 10,130 99.81 99.79 98.0 99.7pt Portuguese 10,212 10,201 10,119 99.20 99.09 88.0 96.9ro Romanian 5,913 5,867 5,850 99.71 98.93 92.8 97.4sv Swedish 10,025 10,093 9,942 98.50 99.17 96.0 97.9tr Turkish 10,308 10,317 10,298 99.82 99.90 97.6 99.5vi Vietnamese 10,487 10,480 10,474 99.94 99.88 98.7 99.2 total 166,711 165,053 99.01 92.2 97.4 LD53 = langdetect + standard bundled profiles, LDsm = langdetect + profiles based on twitter corpus As a text with maximum probability < 0.6 is treated undetectablely, the sum of detect is less than the sum of size Short Text Language Detection with Infinity-Gram 64 (NAIST Seminar)
  • Estimation for LIGA dataset• Estimate using LIGA[Tromp+ 11] dataset with 9066 tweets for 6 languages – http://www.win.tue.nl/~mpechen/projects/smm/ Language size detect correct precision recallde German 1479 1476 1469 99.5 99.3en English 1505 1502 1490 99.2 99.0es Spanish 1562 1548 1541 99.6 98.7fr French 1551 1549 1540 99.4 99.3 it Italian 1539 1531 1528 99.8 99.3nl Dutch 1430 1429 1424 99.7 99.6 total 9066 8992 99.2 ※ Use 19 language model Short Text Language Detection with Infinity-Gram 65 (NAIST Seminar)
  • Estimation for Europarl Dataset ldig langdetect CLD language size correct rate correct rate correct ratebg Bulgarian 1000 988 98.8% 991 99.1%cs Czech 1000 1000 100.0% 994 99.4% 995 99.5%da Dannish 1000 976 97.6% 968 96.8% 932 93.2%de German 1000 999 99.9% 998 99.8% 1000 100.0%el Greek 1000 1000 100.0% 1000 100.0%en English 1000 999 99.9% 996 99.6% 1000 100.0%es Spanish 1000 1000 100.0% 996 99.6% 989 98.9%et Estonian 1000 996 99.6% 998 99.8%fi Finnish 1000 997 99.7% 998 99.8% 1000 100.0%fr French 1000 999 99.9% 999 99.9% 992 99.2%hu Hungarian 1000 1000 100.0% 999 99.9% 999 99.9%it Italian 1000 999 99.9% 999 99.9% 996 99.6%lt Lithuanian 1000 997 99.7% 999 99.9%lv Latvian 1000 999 99.9% 998 99.8%nl Dutch 1000 1000 100.0% 974 97.4% 995 99.5%pl Polish 1000 998 99.8% 999 99.9% 997 99.7%pt Portuguese 1000 995 99.5% 996 99.6% 989 98.9%ro Romanian 1000 1000 100.0% 999 99.9% 998 99.8%sk Slovak 1000 988 98.8% 990 99.0%sl Slovene 1000 976 97.6% 963 96.3%sv Swedish 1000 995 99.5% 991 99.1% 993 99.3% total 21000 13957 99.7% 20850 99.3% 20814 99.1% ※ Only supported languages for ldig Short Text Language Detection with Infinity-Gram 66 (NAIST Seminar)
  • Conclusions• Language detector using maximal substring model – Detect over 99% accuracy for 19 languages. – langdetect with tweet corpus even has 97% accuracy.• If the corpus is maintained, the precision will be still up. – There are still many mistakes (in particular da and no)• If metadata is added to features, the precision will be still up. – How to add and train metadata at low cost?• Desire to shrink the model without loss of precision. – Too large for application (>100MB) Short Text Language Detection with Infinity-Gram 67 (NAIST Seminar)
  • References• [中谷 NLP12] 極大部分文字列を使った twitter 言語判定• [Okanohara+ 09] Text Categorization with All Substring Features• [Brody+ 11] Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs• [Cavnar+ 94] N-Gram-Based Text Categorization• [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty Short Text Language Detection with Infinity-Gram 68 (NAIST Seminar)