This document describes a neural network approach for language identification. It discusses extracting features from text such as alphabet characters and character sequences (unigrams, bigrams, trigrams) that are common in different languages. Training data is prepared from texts of over 105 languages on the TED website, with out-of-vocabulary words removed. The neural network architecture has an input layer for features, hidden layers, and an output layer for language predictions. Alphabet features count Unicode character classes and are binarized. Trigrams are used as sequence features to aid comparisons between languages.
Paper presented at the Researching Multilingually seminar, held at The University of Manchester, UK. (22-23 May, 2012). Some slides have been added, containing information that was communicated orally during the seminar.
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
One of the difficult tasks on Natural Language
Processing (NLP) is to resolve the sense ambiguity of
characters or words on text, such as polyphones, homonymy,
and homograph. The paper addresses the ambiguity issue of
Chinese character polyphones and disambiguity approach for
such issues. Three methods, dictionary matching, language
models and voting scheme, are used to disambiguate the
prediction of polyphones. Compared with the well-known MS
Word 2007 and language models (LMs), our approach is
superior to these two methods for the issue. The final precision
rate is enhanced up to 92.75%. Based on the proposed
approaches, we have constructed the e-learning system in
which several related functions of Chinese transliteration are
integrated.
A quick introduction on code standards, documentation and testing for first year grading students. Very incomplete and opinionated. Still fun and interesting, I hope!
Paper presented at the Researching Multilingually seminar, held at The University of Manchester, UK. (22-23 May, 2012). Some slides have been added, containing information that was communicated orally during the seminar.
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
One of the difficult tasks on Natural Language
Processing (NLP) is to resolve the sense ambiguity of
characters or words on text, such as polyphones, homonymy,
and homograph. The paper addresses the ambiguity issue of
Chinese character polyphones and disambiguity approach for
such issues. Three methods, dictionary matching, language
models and voting scheme, are used to disambiguate the
prediction of polyphones. Compared with the well-known MS
Word 2007 and language models (LMs), our approach is
superior to these two methods for the issue. The final precision
rate is enhanced up to 92.75%. Based on the proposed
approaches, we have constructed the e-learning system in
which several related functions of Chinese transliteration are
integrated.
A quick introduction on code standards, documentation and testing for first year grading students. Very incomplete and opinionated. Still fun and interesting, I hope!
Modelação de Dados com DER e Modelo Relacional, das aulas de Planeamento de Sistemas de Informação do Mestrado em Informação Empresarial da Escola Superior de Estudos Industriais e de Gestão do Instituto Politécnico do Porto.
Aula 04 - Introdução aos Diagramas de SequênciaAlberto Simões
Introdução ultra-light aos diagramas de sequência, para a disciplina de planeamento de sistemas de informação do mestrado em informação empresarial da escola superior de estudos industriais e de gestão do instituto politecnico do porto, ano lectivo de 2012/2013.
Aula 03 - Introdução aos Diagramas de AtividadeAlberto Simões
Introdução aos Diagramas de Atividade (UML) para a disciplina de Planeamento de Sistemas de Informação do Mestrado em Informação Empresarial da Escola Superior de Estudos Industriais e de Gestão do Instituto Politécnico do Porto.
Uma introdução ligeira às redes de PERT e gráficos de GANTT. Aula de Planeamento de Sistemas de Informação do Mestrado em Informação Empresarial da Escola Superior de Estudos Industriais e de Gestão, do Instituto Politécnico do Cávado e do Ave.
Apresentação sobre arquitecturas de tradução automática, realizada na Escola de Verão em PLN realizada em 2009 na Faculdade de Letras da Universidade do Porto, Portugal.
Extracção de Recursos para Tradução AutomáticaAlberto Simões
Apresentação sobre extracção de recursos para tradução automática, realizada na Escola de Verão em PLN realizada em 2009 na Faculdade de Letras da Universidade do Porto, Portugal.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Language Identification: A neural network approach
1. Language Iden fica on:
a Neural Network approach
Alberto Simões1 José João Almeida2 Simon D. Byers3
1CEHUM, Minho's University
ambs@ilch.uminho.pt
2CCTC, Minho's University
jj@di.uminho.pt
3AT&T Labs, Bedminster NJ
headers@gmail.com
SLATE2014, 19--20th June 2014
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
2. In which languages are these texts?
Malgranda Sablodezerto estas
dezerto de Okcidenta Aŭstralio
Esperanto
Po nepavykusių pirmųjų
bandymų su kukurūzais
Lithuanian
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
3. In which languages are these texts?
Malgranda Sablodezerto estas
dezerto de Okcidenta Aŭstralio
Esperanto
Po nepavykusių pirmųjų
bandymų su kukurūzais
Lithuanian
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
4. In which languages are these texts?
Malgranda Sablodezerto estas
dezerto de Okcidenta Aŭstralio
Esperanto
Po nepavykusių pirmųjų
bandymų su kukurūzais
Lithuanian
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
5. In which languages are these texts?
俄罗斯眼下不具备航母建造、
停泊和维护所需的基础设施和条件
Simplified Chinese
임금체계 개편은 기본적으로
노사 합의 또는
Korean
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
6. In which languages are these texts?
俄罗斯眼下不具备航母建造、
停泊和维护所需的基础设施和条件
Simplified Chinese
임금체계 개편은 기본적으로
노사 합의 또는
Korean
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
7. In which languages are these texts?
俄罗斯眼下不具备航母建造、
停泊和维护所需的基础设施和条件
Simplified Chinese
임금체계 개편은 기본적으로
노사 합의 또는
Korean
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
8. In which languages are these texts?
جلوگیری .کردند گروه دوم هم به
Persian
আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা
Bengali
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
9. In which languages are these texts?
جلوگیری .کردند گروه دوم هم به
Persian
আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা
Bengali
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
10. In which languages are these texts?
جلوگیری .کردند گروه دوم هم به
Persian
আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা
Bengali
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
11. In which languages are these texts?
ဦးသိန္းစိန္အစိုးရရဲ
ဝန္ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ
စစ္ဗိုလ္လူထြက္ေတြ
Burmese
આ રસ મ લ િનચોડી સારી
રી િમકસ કરો અ લાસમ
Gujara
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
12. In which languages are these texts?
ဦးသိန္းစိန္အစိုးရရဲ
ဝန္ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ
စစ္ဗိုလ္လူထြက္ေတြ
Burmese
આ રસ મ લ િનચોડી સારી
રી િમકસ કરો અ લાસમ
Gujara
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
13. In which languages are these texts?
ဦးသိန္းစိန္အစိုးရရဲ
ဝန္ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ
စစ္ဗိုလ္လူထြက္ေတြ
Burmese
આ રસ મ લ િનચોડી સારી
રી િમકસ કરો અ લાસમ
Gujara
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
14. Approaches
Using a dic onary of words for each language:
Problem: amount of word forms!
Using language features:
compute unigrams, bigrams, trigrams, …;
compute short words;
compute word beginnings or termina ons;
Then use language models:
Naïve Bayes;
Hidden Markov Models (HMM);
Support Vector Machines (SVM);
Neural Networks (NN);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
15. Approaches
Using a dic onary of words for each language:
Problem: amount of word forms!
Using language features:
compute unigrams, bigrams, trigrams, …;
compute short words;
compute word beginnings or termina ons;
Then use language models:
Naïve Bayes;
Hidden Markov Models (HMM);
Support Vector Machines (SVM);
Neural Networks (NN);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
16. Approaches
Using a dic onary of words for each language:
Problem: amount of word forms!
Using language features:
compute unigrams, bigrams, trigrams, …;
compute short words;
compute word beginnings or termina ons;
Then use language models:
Naïve Bayes;
Hidden Markov Models (HMM);
Support Vector Machines (SVM);
Neural Networks (NN);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
17. Mo va on for a new tool
lack of a decent iden fica on tool for Perl;
use of Chrome Language Detec on library is limited:
how to add new languages?
how to restrict results to specific languages?
there are tools for other programming languages:
language interoperability can be a hassle;
not clear how to add new languages;
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
18. Why using a Neural Network?
learn how Neural Networks work!
an approach where:
training is tedious and slow;
iden fica on is easy to implement;
iden fica on efficient when BLAS available;
therefore:
possible to use trained data in different programming languages;
easy to restrict analysis to a set of languages;
iden fica on probabili es are comparable;
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
19. Neural Network Architecture
x1
x2
x3
. . .
xn
input layer
(features)
a
(2)
1
a
(2)
2
a
(2)
3
. . .
a
(2)
s2
y1
y2
. . .
yK
Θ(1)
Θ(2)
output
layer
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
20. Preparing Training Data
texts from TED website;
more than 105 languages available!
English texts were matched against English dic onary;
OOV items are removed from the English texts and from other
language texts (trying to remove named en es wri en in their
English form from other texts).
Example
…began spoken word poet Sarah Kay, in a talk that inspired two
standing ova ons at TED2011. She tells the story of her
metamorphosis — from a wide-eyed teenager soaking in verse at
New York's Bowery Poetry Club to a teacher connec ng kids with
the power of self-expression through Project V.O.I.C.E. — and
gives two breathtaking performances of ``B'' and ``Hiroshima.''
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
21. Preparing Training Data
texts from TED website;
more than 105 languages available!
English texts were matched against English dic onary;
OOV items are removed from the English texts and from other
language texts (trying to remove named en es wri en in their
English form from other texts).
Example
…began spoken word poet Sarah Kay, in a talk that inspired two
standing ova ons at TED2011. She tells the story of her
metamorphosis — from a wide-eyed teenager soaking in verse at
New York's Bowery Poetry Club to a teacher connec ng kids with
the power of self-expression through Project V.O.I.C.E. — and
gives two breathtaking performances of ``B'' and ``Hiroshima.''
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
22. Preparing Training Data
texts from TED website;
more than 105 languages available!
English texts were matched against English dic onary;
OOV items are removed from the English texts and from other
language texts (trying to remove named en es wri en in their
English form from other texts).
Example
…began spoken word poet Sarah Kay, in a talk that inspired two
standing ova ons at TED2011. She tells the story of her
metamorphosis — from a wide-eyed teenager soaking in verse at
New York's Bowery Poetry Club to a teacher connec ng kids with
the power of self-expression through Project V.O.I.C.E. — and
gives two breathtaking performances of ``B'' and ``Hiroshima.''
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
23. Two kind of Features
Used Alphabet
Which are the computer characters used in the text?
Are they usually used in Asia c, Arabic or La n text?
Used Sequences of Characters
Which unigrams, bigrams or trigrams are used?
Which are most common for each language?
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
24. Two kind of Features
Used Alphabet
Which are the computer characters used in the text?
Are they usually used in Asia c, Arabic or La n text?
Used Sequences of Characters
Which unigrams, bigrams or trigrams are used?
Which are most common for each language?
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
25. Alphabet Features
Count number of Unicode characters in the following classes:
C1 La n characters, only a-z, without diacri cs;
C2 Cyrillic characters (0x0410-0x042F and 0x0430-0x044F);
C3 Hiragana and Katakana characters (0x3040-0x30FF);
C4 Hangul characters (0xAC00-0xD7AF, 0x1100-0x11FF,
0x3130-0x318F, 0xA960-0xA97F and 0xD7B0-0xD7FF);
C5 Kanji characters (0x4E00-0x9FAF);
C6 Simplified Chinese characters (2877 hand defined characters);
C7 Tradi onal Chinese characters (2663 hand defined characters);
C8 Arabic characters (0x0600-0x06FF);
C9 Thai characters (0x0E00-0x0E7F);
C10 Greek characters (0x0370-0x03FF and 0x1F00-0x1FFF).
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
26. Binariza on of Alphabet Features
In order of reducing entropy in the NN:
Alphabet features are binarized using a set of rules:
set C1 ⇐ C1 0.20
set C2 ⇐ C2 0.20
set C3 ⇐ C3 0.20
set C4 ⇐ C4 0.20
set C6 ⇐ C5 0.30 ∧ C6 C7
set C7 ⇐ C5 0.30 ∧ C6 C7
set C8 ⇐ C8 0.20
set C9 ⇐ C9 0.20
set C10 ⇐ C10 0.20
where
set Ci ⇔ Ci ← 1 ∧ ∀j̸=i Cj ← 0
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
27. Trigram Features
Why Trigrams?
bigrams would be too small when comparing very close
languages like Portuguese and Spanish;
tetragrams would be too big for some languages (like Asia c's),
where some glyphs represent words or morphemes;
as punctua on and numbers were removed, and spaces
normalized, trigrams would be able to capture, as well, the end
or beginning of words as well as to capture single character
words that appear surrounded by spaces.
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
28. Trigram Features: example
Für mich war das eine neue Erkenntnis. Und ich denke, mit der
Zeit, in den kommenden Jahren, Wir haben Künstler, aber leider
haben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nur
eine Form kultureller Integra on. Wir haben erkannt, dass seit
kurzem immer mehr Leutea
Top occurring trigrams
en␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149
hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149
␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149
␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766
mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
29. Trigram Features: example
Für mich war das eine neue Erkenntnis. Und ich denke, mit der
Zeit, in den kommenden Jahren, Wir haben Künstler, aber leider
haben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nur
eine Form kultureller Integra on. Wir haben erkannt, dass seit
kurzem immer mehr Leutea
Top occurring trigrams
en␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149
hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149
␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149
␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766
mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
30. Trigram Features: Merging
features ← {};
for L ∈ L do
trigrams ← ∅;
for file ∈ FilesL do
T ← computeTrigrams(file) ; // Str → IN
T ← mostOccurring(T) ; // Top 30 trigrams
for t ∈ keys(T) do
trigrams[t] ← trigrams[t] + 1;
T ← mostOccurring(T) ;
features ← features ∪ keys(trigrams);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
31. Training Data Matrix (excerpt)
Alphabet Features Trigram Features
La n Greek Cyril. ␣pa ới␣ par nia ест ати. ата
PT 1 0 0 0.0041 0 0.0038 0.0001 0 0 0
PT 1 0 0 0.0039 0 0.0036 0 0 0 0
RU 0 0 1 0 0 0 0 0.0020 0.0004 0.0003
RU 0 0 1 0 0 0 0 0.0026 0.0005 0.0002
UK 0 0 1 0 0 0 0 0.0003 0.0034 0.0001
UK 0 0 1 0 0 0 0 0.0003 0.0026 0.0001
VI 1 0 0 0 0.0028 0 0 0 0 0
VI 1 0 0 0 0.0029 0 0.0001 0 0 0
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
32. Experiment 1: 25 languages
Arabic (AR)
Bulgarian (BG)
German (DE)
Modern Greek (EL)
Spanish (ES)
Persian (FA)
French (FR)
Hebrew (HE)
Hungarian (HU)
Italian (IT)
Japanese (JA)
Korean (KO)
Dutch (NL)
Polish (PL)
Portuguese (PT)
Brazilian Portuguese (PT-BR)
Romanian (RO)
Russian (RU)
Serbian (SR)
Thai (TH)
Turkish (TR)
Ukrainian (UK)
Vietnamese (VI)
Tradi onal Chinese (ZH-TW)
Simplified Chinese (ZH-CN)
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
33. Exp 1: Training and Test Sets
Training Set (30 files/lang) Test Set (21 files/lang)
Lang. Smaller Larger ¯x σ Smaller Larger ¯x σ
ar 871921 969387 907562 21392 863 4618 2366 1210
bg 988450 1087435 1027581 23663 660 2099 1091 378
de 588200 653508 618463 16475 677 3890 1554 842
el 773265 885770 841203 22653 550 3297 1590 705
es 578806 651240 617341 17637 897 3850 2342 935
fa 651807 766206 697212 28994 600 5221 1338 967
fr 639582 705675 673414 15377 936 4088 1879 689
he 806098 877218 836222 20545 559 3649 1586 878
hu 406271 454506 431797 13131 729 6045 2175 1356
it 588147 643252 616391 14348 1260 6607 2991 1370
ja 538033 606053 569956 18871 323 785 495 133
ko 737118 817651 773168 20550 530 1603 780 233
nl 533497 580313 557724 14033 552 1949 1115 381
pl 521184 591299 551259 17938 435 3092 1605 694
pt-br 596158 643215 617734 14028 920 3189 1953 589
pt 338272 378872 355800 10605 486 5875 2031 1169
ro 592714 650375 616051 15442 718 3254 1438 695
ru 1019789 1144200 1069884 31232 662 2470 1444 526
sr 349389 433221 379344 20560 834 6493 1813 1263
th 529484 601244 565082 18551 334 3242 1396 734
tr 494191 549998 524271 12774 332 5390 1559 1121
uk 370785 434683 395312 16641 299 15435 2430 3553
vi 470057 541930 510409 17246 680 6237 1555 1359
zh-cn 536438 595027 562728 14457 495 6331 1695 1559
zh-tw 514993 588860 542879 16000 270 1721 925 428
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
34. Exp1: Accuracy
Language 1500 iters. 4000 iters.
ar, bg, de 100% 100%
el, es, fa 100% 100%
fr, he, hu 100% 100%
it, ja, ko 100% 100%
nl, pl 100% 100%
pt 5% 52% wrongly classifies as pt-br
pt-br 100% 76% wrongly classifies as pt
ro, ru, sr 100% 100%
th, tr, uk 100% 100%
vi, zh-cn, zh-tw 100% 100%
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
35. Exp1: Comparison of PT variants
PT PT-BR
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
36. Experiment 2: 55 languages
Afrikaans
Albanian
Arabic
Bulgarian
Bengali
Catalan
Czech
Danish
German
Modern
Greek
English
Esperanto
Spanish
Estonian
Persian
Finnish
French
Galician
Gujara
Hebrew
Hindi
Hungarian
Armenian
Indonesian
Italian
Japanese
Georgian
Kannada
Korean
Kurdish
Lithuanian
Latvian
Macedonian
Malayalam
Marathi
Burmese
Nepali
Dutch
Polish
Portuguese
Romanian
Russian
Slovak
Slovenian
Somali
Serbian
Swedish
Tamil
Thai
Turkish
Ukrainian
Urdu
Vietnamese
Chinese
(simplified)
Chinese
(tradi onal)
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
37. Exp 2: Results
55 languages,
1.126 features,
Θ(l) take 11MB on disk (binary format),
running 7500 itera ons of learning algorithm,
during 6574 minutes and 50.386 seconds (more than 4.5 days),
s ll 21 test files per language,
46 seconds to run over the 1155 test files,
accuracy of 99.740%,
mis-iden fica ons:
2 Bulgarian texts detected as Macedonian,
1 Danish text detected as Dutch.
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
38. Conclusions
Up to 96% of accuracy when tes ng few languages, and
including two Portuguese variants;
Over 99.7% of accuracy for 55 languages;
NN are able to grow, but training me grows exaggeratedly;
The choice of features is relevant;
(if we know a specific detail will be good to dis nguish a
language, add it to the network!)
Obtained results are not ``determinis c''. Although the same
propor on of results are expected, the random ini aliza on of
the network may lead to some different results in different
number of itera ons.
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
39. Future Work
Reduce number of trigrams per language and include unigrams;
Compute distribu on differences between near languages;
Make experiments on training different neural networks for
each alphabet;
Include a regulariza on coefficient (λ ̸= 0);
Make experiments to Deep Neural Networks;
Test language iden fica on on short texts (namely Twi er
tweets).
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
40. Language Iden fica on:
a Neural Network approach
Alberto Simões1 José João Almeida2 Simon D. Byers3
1CEHUM, Minho's University
ambs@ilch.uminho.pt
2CCTC, Minho's University
jj@di.uminho.pt
3ATT Labs, Bedminster NJ
headers@gmail.com
SLATE2014, 19--20th June 2014
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach