SlideShare a Scribd company logo
1 of 4
Download to read offline
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 42
LANGUAGE IDENTIFICATION USING G-LDA
Shubham Saini1
, Bhavesh Kasliwal2
, Shraey Bhatia3
1, 2, 3
Student, School of Computing Science and Engineering, Vellore Institute of Technology, India,
shubham.saini2010@vit.ac.in, bhavesh.kasliwal2010@vit.ac.in, shraey.bhatia2010@vit.ac.in
Abstract
Language Identification has an important role in Natural Language processing applications as one of the pre-processing steps. There
are various mechanisms in use today to achieve this task with brilliant recognition rates.
Recent years have seen rapid growth in international communication which has lead to the requirement of systems capable of
correctly identifying languages of documents. Possible applications of language identification include information retrieval, web
crawlers, text mining and email filtering.
The paper uses a process called G-LDA [1], which takes concepts from Latent Dirichlet Allocation (LDA) and Genetic Evolution
techniques. This involves framing a set of words having a high frequency of occurrence in any given document. The method was tested
on Leipzig Corpora. The phrases that were evolved through the generations reflected significant improvement.
Keywords: Language Identification, Latent Dirichlet Allocation, Gibbs Sampling, Genetic Algorithm, Topic Modeling,
Breeding, Fitness, Roulette Wheel.
---------------------------------------------------------------------***-------------------------------------------------------------------------
1. INTRODUCTION
Language Identification is the problem of automatically
identifying the language a document is written in through the
use of computer. Language Identification has an important
role in Natural Language processing applications as one of the
pre-processing steps.
Recent years have seen rapid growth in international
communication which has lead to the requirement of systems
capable of correctly identifying languages of documents. Also,
with the spread of world-wide Internet access, text is available
in a great number of languages other than English. Automatic
processing of these texts for purposes like web crawling,
reading aids, indexing etc. need a preliminary identification of
the language used. Likewise, any system involving dictionary
access must identify language before applying any language-
specific operations.
The paper uses a process called G-LDA, which takes concepts
from Latent Dirichlet Allocation [2] (LDA) and Genetic
Evolution techniques. This involves framing a set of words
having a high frequency of occurrence in any given document.
The method was tested on Leipzig Corpora. The phrases that
were evolved through the generations reflected significant
improvement.
LDA is a way of automatically discovering topics within a
documents collection. Each word in the documents collection
will belong to a particular topic. The technique can be applied
to a language identification corpus. The result will be a set of
words having a high probability of occurrence within any
given document.
Further, set of words generated earlier are subjected to genetic
evolution techniques. This is done to get an improved set of
words with a higher probability of occurrence within a
document.
It was found that the method worked well without any pre-
processing like tokenization, unlike the N-Gram based
approach [7], and can handle e.g. Arabic documents and
documents encoded in Unicode.
Also, the method is very easy to understand and implement,
unlike the PPM approach [8], with almost similar detection
rates.
2. SET GENERATION
A sub-set of sentences from Leipzig Corpora was taken. The
languages chosen were Arabic, English, Guajarati, Hindi and
Italian. All the sentences from these five languages were
combined into a single document and sets of words from these
documents were generated using the JGibbLDA [3] package.
JGibbLDA is a Java implementation of LDA using Gibbs
Sampling technique. A total of 5 sets with 200 words each
were generated, each set representing a single language.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 43
Now, for each word in each language set, the number of lines
in the training data for that language matched, (Ne) was found.
Genetic evolution techniques were applied on the words with
Ne > 0.
3. GENETIC EVOLUTION ALGORITHM
Genetic algorithm was applied as follows:
 Loop for n generations
o Find Fitness Function for each phrase
o Breeding using roulette wheel selection
o Filtering to produce a new generation
3.1 Fitness Function
Based on the number of lines matched, a fitness value is
assigned to each phrase:
Fitness = No. sentences containing the phrase*(1+
N-Nw
N
)
Where
N = Threshold value
Nw = No. of words comprising the phrase
Threshold value is taken in order to provide a negative weight.
Any phrase containing more than N words will receive fitness
penalty, and phrases with less than N words receive a bonus.
3.2 Breeding
The process used created 1 child from 2 parents using the OR
Method. Each survivor from previous generation breeds with a
2nd parent selected using Roulette Wheel Selection.
The OR Method – Take 2 parents, such as (Lottery|Cash) and
(Cash|Win), breed them to produce the child
(Lottery|Cash|Win). Here, Nw = 3 as the phrase comprises of 3
words.
Roulette Wheel Selection – It is a method for choosing a
phrase from the breeding pool. Phrases with better fitness are
more likely to be chosen.
8/20
5/20
4/20
3/20
Fitness
Keyword 1
Keyword 2
Keyword 3
Keyword 4
Chart -1: Roulette Wheel Selection
In Chart 1, the breeding pool consists of 4 phrases. The
probability of Phrase 1 being selected is 8/20, the highest
among all the phrases. This is based on the fact that Phrase 1
has the highest fitness of 8.
Breading is carried out as follows to produce new children
 Mark all Phrases as ‘new’
 Loop till no phrase is marked ‘new’
o Take a phrase (P1) with highest fitness
which is marked ‘new’.
o Mark P1 as ‘old’
o Take another phrase (P2) using Roulette
Wheel Selection
o Mark P2 as ‘old’
o Breed P1 and P2 to produce a child C =
(P1|P2).
o Mark C as ‘old’
3.3 Filtering
Once breeding is done, fitness of the newly generated children
is calculated and then filtering is performed in order to remove
low fitness words. The words remaining after filtering
constitute the new generation.
Filtering criteria may vary with need and situation. For the
purpose of language identification, it was intended to keep
only top 200 phrases. Hence, after each round of breeding, top
200 phrases were sent to the next generation and the rest of the
words were discarded.
4. PROOF OF CONCEPT
In the Fitness function, as the threshold value (N) increases,
the time required to find the number of sentences for each
phrase increases with each generation as phrases with more
number of words tend to survive. After performing several
tests, the threshold value was chosen to be 5. This is the
optimum value which balances the time required for finding
the number of sentences for each phrase and the number of
new generations required. The breeding is stopped when the
increase in the fitness of the phrases become insignificant.
The top 10 phrases of the first two generations for the five
languages, with their fitness score is given below:
Table -1: Top 10 Arabic language phrases with fitness values
Generation 0 Generation 1
‫و‬ 124.7579956 ‫و‬ 124.758
‫في‬ 101.0699997 ‫و‬;‫فقد‬ 113.152
‫من‬ 87.17399597 ‫في‬ 101.07
‫على‬ 54.66600037 ‫في‬;‫المتحدة‬ 96.81601
‫أن‬ 47.12400055 ‫على‬;‫أن‬ 90.48
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 44
‫ان‬ 38.14199829 ‫من‬ 87.174
‫عن‬ 33.28199768 ‫من‬;‫دول‬ 83.888
‫مع‬ 30.23999977 ‫على‬ 54.666
‫التي‬ 27.395998 ‫عن‬;‫إلى‬ 52.512
‫إلى‬ 25.79399872 ‫مع‬;‫ال‬ 49.504
Table -2: Top 10 English language phrases with fitness values
Generation 0 Generation 1
an 91.27799988 year;named 145.474
of 85.03200531 a;had 140.576
and 80.85599518 However;Last 140.182
to 73.9980011 the;two 117.312
be 39.94200134 and;with 94.496
is 36.45000076 an 91.278
for 33.13799667 an;do 90.35201
was 31.12199974 of 85.03201
de 27.59399986 to;so 82.96
with 25.45199966 and 80.856
Table -3: Top 10 Guajarati language phrases with fitness
values
Generation 0 Generation 1
છે 80.7480011 છે;અને 232.33
જ 80.40599823 કે;હતી 180.068
આ 77.45400238 આ;છે. 126.192
ન 64.80000305 છે;એ 112.64
છે. 64.51199341 જ;કરવા 84.81601
એ 45.97199631 છે 80.748
તે 40.12199783 છે 80.748
અને 37.96199799 જ 80.406
કે 35.7840004 આ 77.454
પણ 26.51399994 ન 64.8
Table -4: Top 10 Hindi language phrases with fitness values
Generation 0 Generation 1
के 98.9280014 है;प्रति 216.79
न 97.91999817 हैं;उस 170.534
है 93.31200409 कहा;उन्हें 159.642
में 87.01200104 में;की 140.56
की 71.11799622 न;हो 116.704
व 67.98599243 को;का 108.96
को 62.22599792 के 98.928
का 60.35400009 न 97.92
कक 59.84999847 है 93.312
Table -5: Top 10 Italian language phrases with fitness values
Generation 0 Generation 1
di 105.1920013 che;a 161.056
in 73.56599426 anni;uno 147.448
del 72.46799469 di;ha 124.768
che 51.02999878 di 105.192
da 44.51399994 in;da 104.96
su 36.18000031 in 73.56599
o 36.16199875 del;I 73.296
ha 35.17200089 del 72.46799
e 34.0019989 su;della 57.168
è 34.0019989 che 51.03
Chart -2: Average Fitness of Top 10 phrases of each language
It is evident from Chart 2 that average fitness of the phrases
increases after applying the genetic algorithm.
To check the effectiveness of the evolved phrases, the
following method was adopted. The algorithm returns the
language of a document D:
 For Each Language Li
 For each phrase Ki ∈ Li
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 45
o C(Ki) = No. of lines of D containing Ki
 S(Li)=∑C(Ki)
• Return L, where S(L) = Max(S(Li))
This technique was applied to 15 web-pages of the 5 training
languages (Arabic, English, Gujarati, Hindi and Italian). The
system was able to identify the language for every web page
correctly (giving 100% recognition rate).
It maybe be noted that the technique was applied to moderate
(≥1000 words) sized web-pages. The identification rate may
decrease when applied to very small sized documents (<100
words). However, this can be improved by taking a bigger
training data, and retaining more words after each breeding
stage.
CONCLUSIONS
Combining the Genetic evolution technique, which simulates
processes of nature, with topic modeling method, the set of
phrases generated effectively identify the language of a
document.
This technique, when applied to 15 web-pages of the 5
training languages (Arabic, English, Gujarati, Hindi and
Italian), was able to identify the language for every web page
correctly (giving 100% recognition rate).
REFERENCES
[1]. Shubham Saini, Bhavesh Kasliwal and Shraey Bhatia.
"Spam Detection using G-LDA" International Journal of
Advanced Research in Computer Science and Software
Engineering 3.10 (2013).
[2]. Blei, David M., Andrew Y. Ng, and Michael I. Jordan.
"Latent dirichlet allocation." the Journal of machine Learning
research 3 (2003): 993-1022.J. Breckling, Ed., The Analysis of
Directional Time Series: Applications to Wind Speed and
Direction, ser. Lecture Notes in Statistics. Berlin, Germany:
Springer, 1989, vol. 61.
[3]. Phan, Xuan-Hieu, and Cam-Tu Nguyen. "Jgibblda: A java
implementation of latent dirichlet allocation (lda) using gibbs
sampling for parameter estimation and inference." (2006).
[4]. Ramage, D., and E. Rosen. "Stanford topic modeling
toolbox." (2011).
[5]. Wall, Matthew. "GAlib: A C++ library of genetic
algorithm components."Mechanical Engineering Department,
Massachusetts Institute of Technology 87 (1996): 54.
[6]. Kranig, Simon. "Evaluation of language identification
methods." Bakalárska práca, Universität Tübingen, Nemecko
(2005)
[7]. Cavnar, William B., and John M. Trenkle. "N-gram-based
text categorization." Ann Arbor MI 48113.2 (1994): 161-175.
[8]. Teahan, William J., and David J. Harper. "Using
compression-based language models for text categorization."
Language Modeling for Information Retrieval. Springer
Netherlands, 2003. 141-165
BIOGRAPHIES
Shubham Saini is a senior year undergraduate
student majoring in Computer Science and
Engineering. His research interests are
Information Retrieval, Semantic Web and
Knowledge Discovery in Databases.
Bhavesh Kasliwal is a senior year
undergraduate student majoring in Computer
Science and Engineering. His research interests
are Algorithm Design and Analysis, Data
Analytics.
Shraey Bhatia is a senior year undergraduate
student majoring in Computer Science and
Engineering. His research interests are
Information Security and Cryptography.

More Related Content

What's hot

Driving cycle development for Kuala Terengganu city using k-means method
Driving cycle development for Kuala Terengganu city using k-means methodDriving cycle development for Kuala Terengganu city using k-means method
Driving cycle development for Kuala Terengganu city using k-means method
IJECEIAES
 
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODELNERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
ijnlc
 

What's hot (14)

Static dictionary for pronunciation modeling
Static dictionary for pronunciation modelingStatic dictionary for pronunciation modeling
Static dictionary for pronunciation modeling
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languages
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random Field
 
Driving cycle development for Kuala Terengganu city using k-means method
Driving cycle development for Kuala Terengganu city using k-means methodDriving cycle development for Kuala Terengganu city using k-means method
Driving cycle development for Kuala Terengganu city using k-means method
 
33 9765 development paper id 0034 (edit a) (1)
33 9765 development paper id 0034 (edit a) (1)33 9765 development paper id 0034 (edit a) (1)
33 9765 development paper id 0034 (edit a) (1)
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...
 
Comparison of stemming algorithms on Indonesian text processing
Comparison of stemming algorithms on Indonesian text processingComparison of stemming algorithms on Indonesian text processing
Comparison of stemming algorithms on Indonesian text processing
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
A decision tree based word sense disambiguation system in manipuri language
A decision tree based word sense disambiguation system in manipuri languageA decision tree based word sense disambiguation system in manipuri language
A decision tree based word sense disambiguation system in manipuri language
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
 
Significance of Speech Intelligibility Assessors in Medium Classroom Using An...
Significance of Speech Intelligibility Assessors in Medium Classroom Using An...Significance of Speech Intelligibility Assessors in Medium Classroom Using An...
Significance of Speech Intelligibility Assessors in Medium Classroom Using An...
 
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
 
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODELNERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
 

Viewers also liked

English Language:Speech recognition
English Language:Speech recognitionEnglish Language:Speech recognition
English Language:Speech recognition
Adz Basic Educare
 
Figurative language identification and analysis
Figurative language identification and analysisFigurative language identification and analysis
Figurative language identification and analysis
Nicholas Green
 
Evaluation of language identification methods
Evaluation of language identification methodsEvaluation of language identification methods
Evaluation of language identification methods
edma2
 
Language, dialect, and varieties
Language, dialect, and varietiesLanguage, dialect, and varieties
Language, dialect, and varieties
Sari Kusumaningrum
 

Viewers also liked (13)

Language identification
Language identificationLanguage identification
Language identification
 
English Language:Speech recognition
English Language:Speech recognitionEnglish Language:Speech recognition
English Language:Speech recognition
 
Figurative language identification and analysis
Figurative language identification and analysisFigurative language identification and analysis
Figurative language identification and analysis
 
Evaluation of language identification methods
Evaluation of language identification methodsEvaluation of language identification methods
Evaluation of language identification methods
 
A survey on Script and Language identification for Handwritten document images
A survey on Script and Language identification for Handwritten document imagesA survey on Script and Language identification for Handwritten document images
A survey on Script and Language identification for Handwritten document images
 
Native Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the artNative Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the art
 
speech processing basics
speech processing basicsspeech processing basics
speech processing basics
 
Phonetics
PhoneticsPhonetics
Phonetics
 
Automatic Language Identification
Automatic Language IdentificationAutomatic Language Identification
Automatic Language Identification
 
Speech processing
Speech processingSpeech processing
Speech processing
 
Oral Language : Nature and Characteristics
Oral Language : Nature and CharacteristicsOral Language : Nature and Characteristics
Oral Language : Nature and Characteristics
 
Language Identification: A neural network approach
Language Identification: A neural network approachLanguage Identification: A neural network approach
Language Identification: A neural network approach
 
Language, dialect, and varieties
Language, dialect, and varietiesLanguage, dialect, and varieties
Language, dialect, and varieties
 

Similar to Language identification using g lda

An unsupervised approach to develop ir system the case of urdu
An unsupervised approach to develop ir system  the case of urduAn unsupervised approach to develop ir system  the case of urdu
An unsupervised approach to develop ir system the case of urdu
ijaia
 
Performance of different classifiers in speech recognition
Performance of different classifiers in speech recognitionPerformance of different classifiers in speech recognition
Performance of different classifiers in speech recognition
eSAT Journals
 

Similar to Language identification using g lda (20)

Static dictionary for pronunciation modeling
Static dictionary for pronunciation modelingStatic dictionary for pronunciation modeling
Static dictionary for pronunciation modeling
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phones
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phones
 
A lexicon based algorithm for noisy text normalization as pre processing for ...
A lexicon based algorithm for noisy text normalization as pre processing for ...A lexicon based algorithm for noisy text normalization as pre processing for ...
A lexicon based algorithm for noisy text normalization as pre processing for ...
 
GLOVE BASED GESTURE RECOGNITION USING IR SENSOR
GLOVE BASED GESTURE RECOGNITION USING IR SENSORGLOVE BASED GESTURE RECOGNITION USING IR SENSOR
GLOVE BASED GESTURE RECOGNITION USING IR SENSOR
 
A review of factors that impact the design of a glove based wearable devices
A review of factors that impact the design of a glove based wearable devicesA review of factors that impact the design of a glove based wearable devices
A review of factors that impact the design of a glove based wearable devices
 
An unsupervised approach to develop ir system the case of urdu
An unsupervised approach to develop ir system  the case of urduAn unsupervised approach to develop ir system  the case of urdu
An unsupervised approach to develop ir system the case of urdu
 
Semantic analyzer for marathi text
Semantic analyzer for marathi textSemantic analyzer for marathi text
Semantic analyzer for marathi text
 
Semantic analyzer for marathi text
Semantic analyzer for marathi textSemantic analyzer for marathi text
Semantic analyzer for marathi text
 
Live Sign Language Translation: A Survey
Live Sign Language Translation: A SurveyLive Sign Language Translation: A Survey
Live Sign Language Translation: A Survey
 
SignReco: Sign Language Translator
SignReco: Sign Language TranslatorSignReco: Sign Language Translator
SignReco: Sign Language Translator
 
Exploring Data Augmentation Strategies for Labeled Data Enrichment in Low-Res...
Exploring Data Augmentation Strategies for Labeled Data Enrichment in Low-Res...Exploring Data Augmentation Strategies for Labeled Data Enrichment in Low-Res...
Exploring Data Augmentation Strategies for Labeled Data Enrichment in Low-Res...
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
 
Gesture Acquisition and Recognition of Sign Language
Gesture Acquisition and Recognition of Sign LanguageGesture Acquisition and Recognition of Sign Language
Gesture Acquisition and Recognition of Sign Language
 
Cl35491494
Cl35491494Cl35491494
Cl35491494
 
Performance of different classifiers in speech recognition
Performance of different classifiers in speech recognitionPerformance of different classifiers in speech recognition
Performance of different classifiers in speech recognition
 
Performance of different classifiers in speech recognition
Performance of different classifiers in speech recognitionPerformance of different classifiers in speech recognition
Performance of different classifiers in speech recognition
 
Speech To Speech Translation
Speech To Speech TranslationSpeech To Speech Translation
Speech To Speech Translation
 
IRJET- Modeling Student’s Vocabulary Knowledge with Natural
IRJET-  	  Modeling Student’s Vocabulary Knowledge with NaturalIRJET-  	  Modeling Student’s Vocabulary Knowledge with Natural
IRJET- Modeling Student’s Vocabulary Knowledge with Natural
 
A Survey on Speech Recognition with Language Specification
A Survey on Speech Recognition with Language SpecificationA Survey on Speech Recognition with Language Specification
A Survey on Speech Recognition with Language Specification
 

More from eSAT Journals

Material management in construction – a case study
Material management in construction – a case studyMaterial management in construction – a case study
Material management in construction – a case study
eSAT Journals
 
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materialsLaboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
eSAT Journals
 
Geographical information system (gis) for water resources management
Geographical information system (gis) for water resources managementGeographical information system (gis) for water resources management
Geographical information system (gis) for water resources management
eSAT Journals
 
Estimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of surface runoff in nallur amanikere watershed using scs cn methodEstimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of surface runoff in nallur amanikere watershed using scs cn method
eSAT Journals
 
Effect of variation of plastic hinge length on the results of non linear anal...
Effect of variation of plastic hinge length on the results of non linear anal...Effect of variation of plastic hinge length on the results of non linear anal...
Effect of variation of plastic hinge length on the results of non linear anal...
eSAT Journals
 

More from eSAT Journals (20)

Mechanical properties of hybrid fiber reinforced concrete for pavements
Mechanical properties of hybrid fiber reinforced concrete for pavementsMechanical properties of hybrid fiber reinforced concrete for pavements
Mechanical properties of hybrid fiber reinforced concrete for pavements
 
Material management in construction – a case study
Material management in construction – a case studyMaterial management in construction – a case study
Material management in construction – a case study
 
Managing drought short term strategies in semi arid regions a case study
Managing drought    short term strategies in semi arid regions  a case studyManaging drought    short term strategies in semi arid regions  a case study
Managing drought short term strategies in semi arid regions a case study
 
Life cycle cost analysis of overlay for an urban road in bangalore
Life cycle cost analysis of overlay for an urban road in bangaloreLife cycle cost analysis of overlay for an urban road in bangalore
Life cycle cost analysis of overlay for an urban road in bangalore
 
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materialsLaboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
 
Laboratory investigation of expansive soil stabilized with natural inorganic ...
Laboratory investigation of expansive soil stabilized with natural inorganic ...Laboratory investigation of expansive soil stabilized with natural inorganic ...
Laboratory investigation of expansive soil stabilized with natural inorganic ...
 
Influence of reinforcement on the behavior of hollow concrete block masonry p...
Influence of reinforcement on the behavior of hollow concrete block masonry p...Influence of reinforcement on the behavior of hollow concrete block masonry p...
Influence of reinforcement on the behavior of hollow concrete block masonry p...
 
Influence of compaction energy on soil stabilized with chemical stabilizer
Influence of compaction energy on soil stabilized with chemical stabilizerInfluence of compaction energy on soil stabilized with chemical stabilizer
Influence of compaction energy on soil stabilized with chemical stabilizer
 
Geographical information system (gis) for water resources management
Geographical information system (gis) for water resources managementGeographical information system (gis) for water resources management
Geographical information system (gis) for water resources management
 
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
Forest type mapping of bidar forest division, karnataka using geoinformatics ...Forest type mapping of bidar forest division, karnataka using geoinformatics ...
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
 
Factors influencing compressive strength of geopolymer concrete
Factors influencing compressive strength of geopolymer concreteFactors influencing compressive strength of geopolymer concrete
Factors influencing compressive strength of geopolymer concrete
 
Experimental investigation on circular hollow steel columns in filled with li...
Experimental investigation on circular hollow steel columns in filled with li...Experimental investigation on circular hollow steel columns in filled with li...
Experimental investigation on circular hollow steel columns in filled with li...
 
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
Experimental behavior of circular hsscfrc filled steel tubular columns under ...Experimental behavior of circular hsscfrc filled steel tubular columns under ...
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
 
Evaluation of punching shear in flat slabs
Evaluation of punching shear in flat slabsEvaluation of punching shear in flat slabs
Evaluation of punching shear in flat slabs
 
Evaluation of performance of intake tower dam for recent earthquake in india
Evaluation of performance of intake tower dam for recent earthquake in indiaEvaluation of performance of intake tower dam for recent earthquake in india
Evaluation of performance of intake tower dam for recent earthquake in india
 
Evaluation of operational efficiency of urban road network using travel time ...
Evaluation of operational efficiency of urban road network using travel time ...Evaluation of operational efficiency of urban road network using travel time ...
Evaluation of operational efficiency of urban road network using travel time ...
 
Estimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of surface runoff in nallur amanikere watershed using scs cn methodEstimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of surface runoff in nallur amanikere watershed using scs cn method
 
Estimation of morphometric parameters and runoff using rs &amp; gis techniques
Estimation of morphometric parameters and runoff using rs &amp; gis techniquesEstimation of morphometric parameters and runoff using rs &amp; gis techniques
Estimation of morphometric parameters and runoff using rs &amp; gis techniques
 
Effect of variation of plastic hinge length on the results of non linear anal...
Effect of variation of plastic hinge length on the results of non linear anal...Effect of variation of plastic hinge length on the results of non linear anal...
Effect of variation of plastic hinge length on the results of non linear anal...
 
Effect of use of recycled materials on indirect tensile strength of asphalt c...
Effect of use of recycled materials on indirect tensile strength of asphalt c...Effect of use of recycled materials on indirect tensile strength of asphalt c...
Effect of use of recycled materials on indirect tensile strength of asphalt c...
 

Recently uploaded

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 

Recently uploaded (20)

Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 

Language identification using g lda

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 42 LANGUAGE IDENTIFICATION USING G-LDA Shubham Saini1 , Bhavesh Kasliwal2 , Shraey Bhatia3 1, 2, 3 Student, School of Computing Science and Engineering, Vellore Institute of Technology, India, shubham.saini2010@vit.ac.in, bhavesh.kasliwal2010@vit.ac.in, shraey.bhatia2010@vit.ac.in Abstract Language Identification has an important role in Natural Language processing applications as one of the pre-processing steps. There are various mechanisms in use today to achieve this task with brilliant recognition rates. Recent years have seen rapid growth in international communication which has lead to the requirement of systems capable of correctly identifying languages of documents. Possible applications of language identification include information retrieval, web crawlers, text mining and email filtering. The paper uses a process called G-LDA [1], which takes concepts from Latent Dirichlet Allocation (LDA) and Genetic Evolution techniques. This involves framing a set of words having a high frequency of occurrence in any given document. The method was tested on Leipzig Corpora. The phrases that were evolved through the generations reflected significant improvement. Keywords: Language Identification, Latent Dirichlet Allocation, Gibbs Sampling, Genetic Algorithm, Topic Modeling, Breeding, Fitness, Roulette Wheel. ---------------------------------------------------------------------***------------------------------------------------------------------------- 1. INTRODUCTION Language Identification is the problem of automatically identifying the language a document is written in through the use of computer. Language Identification has an important role in Natural Language processing applications as one of the pre-processing steps. Recent years have seen rapid growth in international communication which has lead to the requirement of systems capable of correctly identifying languages of documents. Also, with the spread of world-wide Internet access, text is available in a great number of languages other than English. Automatic processing of these texts for purposes like web crawling, reading aids, indexing etc. need a preliminary identification of the language used. Likewise, any system involving dictionary access must identify language before applying any language- specific operations. The paper uses a process called G-LDA, which takes concepts from Latent Dirichlet Allocation [2] (LDA) and Genetic Evolution techniques. This involves framing a set of words having a high frequency of occurrence in any given document. The method was tested on Leipzig Corpora. The phrases that were evolved through the generations reflected significant improvement. LDA is a way of automatically discovering topics within a documents collection. Each word in the documents collection will belong to a particular topic. The technique can be applied to a language identification corpus. The result will be a set of words having a high probability of occurrence within any given document. Further, set of words generated earlier are subjected to genetic evolution techniques. This is done to get an improved set of words with a higher probability of occurrence within a document. It was found that the method worked well without any pre- processing like tokenization, unlike the N-Gram based approach [7], and can handle e.g. Arabic documents and documents encoded in Unicode. Also, the method is very easy to understand and implement, unlike the PPM approach [8], with almost similar detection rates. 2. SET GENERATION A sub-set of sentences from Leipzig Corpora was taken. The languages chosen were Arabic, English, Guajarati, Hindi and Italian. All the sentences from these five languages were combined into a single document and sets of words from these documents were generated using the JGibbLDA [3] package. JGibbLDA is a Java implementation of LDA using Gibbs Sampling technique. A total of 5 sets with 200 words each were generated, each set representing a single language.
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 43 Now, for each word in each language set, the number of lines in the training data for that language matched, (Ne) was found. Genetic evolution techniques were applied on the words with Ne > 0. 3. GENETIC EVOLUTION ALGORITHM Genetic algorithm was applied as follows:  Loop for n generations o Find Fitness Function for each phrase o Breeding using roulette wheel selection o Filtering to produce a new generation 3.1 Fitness Function Based on the number of lines matched, a fitness value is assigned to each phrase: Fitness = No. sentences containing the phrase*(1+ N-Nw N ) Where N = Threshold value Nw = No. of words comprising the phrase Threshold value is taken in order to provide a negative weight. Any phrase containing more than N words will receive fitness penalty, and phrases with less than N words receive a bonus. 3.2 Breeding The process used created 1 child from 2 parents using the OR Method. Each survivor from previous generation breeds with a 2nd parent selected using Roulette Wheel Selection. The OR Method – Take 2 parents, such as (Lottery|Cash) and (Cash|Win), breed them to produce the child (Lottery|Cash|Win). Here, Nw = 3 as the phrase comprises of 3 words. Roulette Wheel Selection – It is a method for choosing a phrase from the breeding pool. Phrases with better fitness are more likely to be chosen. 8/20 5/20 4/20 3/20 Fitness Keyword 1 Keyword 2 Keyword 3 Keyword 4 Chart -1: Roulette Wheel Selection In Chart 1, the breeding pool consists of 4 phrases. The probability of Phrase 1 being selected is 8/20, the highest among all the phrases. This is based on the fact that Phrase 1 has the highest fitness of 8. Breading is carried out as follows to produce new children  Mark all Phrases as ‘new’  Loop till no phrase is marked ‘new’ o Take a phrase (P1) with highest fitness which is marked ‘new’. o Mark P1 as ‘old’ o Take another phrase (P2) using Roulette Wheel Selection o Mark P2 as ‘old’ o Breed P1 and P2 to produce a child C = (P1|P2). o Mark C as ‘old’ 3.3 Filtering Once breeding is done, fitness of the newly generated children is calculated and then filtering is performed in order to remove low fitness words. The words remaining after filtering constitute the new generation. Filtering criteria may vary with need and situation. For the purpose of language identification, it was intended to keep only top 200 phrases. Hence, after each round of breeding, top 200 phrases were sent to the next generation and the rest of the words were discarded. 4. PROOF OF CONCEPT In the Fitness function, as the threshold value (N) increases, the time required to find the number of sentences for each phrase increases with each generation as phrases with more number of words tend to survive. After performing several tests, the threshold value was chosen to be 5. This is the optimum value which balances the time required for finding the number of sentences for each phrase and the number of new generations required. The breeding is stopped when the increase in the fitness of the phrases become insignificant. The top 10 phrases of the first two generations for the five languages, with their fitness score is given below: Table -1: Top 10 Arabic language phrases with fitness values Generation 0 Generation 1 ‫و‬ 124.7579956 ‫و‬ 124.758 ‫في‬ 101.0699997 ‫و‬;‫فقد‬ 113.152 ‫من‬ 87.17399597 ‫في‬ 101.07 ‫على‬ 54.66600037 ‫في‬;‫المتحدة‬ 96.81601 ‫أن‬ 47.12400055 ‫على‬;‫أن‬ 90.48
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 44 ‫ان‬ 38.14199829 ‫من‬ 87.174 ‫عن‬ 33.28199768 ‫من‬;‫دول‬ 83.888 ‫مع‬ 30.23999977 ‫على‬ 54.666 ‫التي‬ 27.395998 ‫عن‬;‫إلى‬ 52.512 ‫إلى‬ 25.79399872 ‫مع‬;‫ال‬ 49.504 Table -2: Top 10 English language phrases with fitness values Generation 0 Generation 1 an 91.27799988 year;named 145.474 of 85.03200531 a;had 140.576 and 80.85599518 However;Last 140.182 to 73.9980011 the;two 117.312 be 39.94200134 and;with 94.496 is 36.45000076 an 91.278 for 33.13799667 an;do 90.35201 was 31.12199974 of 85.03201 de 27.59399986 to;so 82.96 with 25.45199966 and 80.856 Table -3: Top 10 Guajarati language phrases with fitness values Generation 0 Generation 1 છે 80.7480011 છે;અને 232.33 જ 80.40599823 કે;હતી 180.068 આ 77.45400238 આ;છે. 126.192 ન 64.80000305 છે;એ 112.64 છે. 64.51199341 જ;કરવા 84.81601 એ 45.97199631 છે 80.748 તે 40.12199783 છે 80.748 અને 37.96199799 જ 80.406 કે 35.7840004 આ 77.454 પણ 26.51399994 ન 64.8 Table -4: Top 10 Hindi language phrases with fitness values Generation 0 Generation 1 के 98.9280014 है;प्रति 216.79 न 97.91999817 हैं;उस 170.534 है 93.31200409 कहा;उन्हें 159.642 में 87.01200104 में;की 140.56 की 71.11799622 न;हो 116.704 व 67.98599243 को;का 108.96 को 62.22599792 के 98.928 का 60.35400009 न 97.92 कक 59.84999847 है 93.312 Table -5: Top 10 Italian language phrases with fitness values Generation 0 Generation 1 di 105.1920013 che;a 161.056 in 73.56599426 anni;uno 147.448 del 72.46799469 di;ha 124.768 che 51.02999878 di 105.192 da 44.51399994 in;da 104.96 su 36.18000031 in 73.56599 o 36.16199875 del;I 73.296 ha 35.17200089 del 72.46799 e 34.0019989 su;della 57.168 è 34.0019989 che 51.03 Chart -2: Average Fitness of Top 10 phrases of each language It is evident from Chart 2 that average fitness of the phrases increases after applying the genetic algorithm. To check the effectiveness of the evolved phrases, the following method was adopted. The algorithm returns the language of a document D:  For Each Language Li  For each phrase Ki ∈ Li
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 45 o C(Ki) = No. of lines of D containing Ki  S(Li)=∑C(Ki) • Return L, where S(L) = Max(S(Li)) This technique was applied to 15 web-pages of the 5 training languages (Arabic, English, Gujarati, Hindi and Italian). The system was able to identify the language for every web page correctly (giving 100% recognition rate). It maybe be noted that the technique was applied to moderate (≥1000 words) sized web-pages. The identification rate may decrease when applied to very small sized documents (<100 words). However, this can be improved by taking a bigger training data, and retaining more words after each breeding stage. CONCLUSIONS Combining the Genetic evolution technique, which simulates processes of nature, with topic modeling method, the set of phrases generated effectively identify the language of a document. This technique, when applied to 15 web-pages of the 5 training languages (Arabic, English, Gujarati, Hindi and Italian), was able to identify the language for every web page correctly (giving 100% recognition rate). REFERENCES [1]. Shubham Saini, Bhavesh Kasliwal and Shraey Bhatia. "Spam Detection using G-LDA" International Journal of Advanced Research in Computer Science and Software Engineering 3.10 (2013). [2]. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.J. Breckling, Ed., The Analysis of Directional Time Series: Applications to Wind Speed and Direction, ser. Lecture Notes in Statistics. Berlin, Germany: Springer, 1989, vol. 61. [3]. Phan, Xuan-Hieu, and Cam-Tu Nguyen. "Jgibblda: A java implementation of latent dirichlet allocation (lda) using gibbs sampling for parameter estimation and inference." (2006). [4]. Ramage, D., and E. Rosen. "Stanford topic modeling toolbox." (2011). [5]. Wall, Matthew. "GAlib: A C++ library of genetic algorithm components."Mechanical Engineering Department, Massachusetts Institute of Technology 87 (1996): 54. [6]. Kranig, Simon. "Evaluation of language identification methods." Bakalárska práca, Universität Tübingen, Nemecko (2005) [7]. Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Ann Arbor MI 48113.2 (1994): 161-175. [8]. Teahan, William J., and David J. Harper. "Using compression-based language models for text categorization." Language Modeling for Information Retrieval. Springer Netherlands, 2003. 141-165 BIOGRAPHIES Shubham Saini is a senior year undergraduate student majoring in Computer Science and Engineering. His research interests are Information Retrieval, Semantic Web and Knowledge Discovery in Databases. Bhavesh Kasliwal is a senior year undergraduate student majoring in Computer Science and Engineering. His research interests are Algorithm Design and Analysis, Data Analytics. Shraey Bhatia is a senior year undergraduate student majoring in Computer Science and Engineering. His research interests are Information Security and Cryptography.