SlideShare a Scribd company logo
1 of 4
Download to read offline
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 42
LANGUAGE IDENTIFICATION USING G-LDA
Shubham Saini1
, Bhavesh Kasliwal2
, Shraey Bhatia3
1, 2, 3
Student, School of Computing Science and Engineering, Vellore Institute of Technology, India,
shubham.saini2010@vit.ac.in, bhavesh.kasliwal2010@vit.ac.in, shraey.bhatia2010@vit.ac.in
Abstract
Language Identification has an important role in Natural Language processing applications as one of the pre-processing steps. There
are various mechanisms in use today to achieve this task with brilliant recognition rates.
Recent years have seen rapid growth in international communication which has lead to the requirement of systems capable of
correctly identifying languages of documents. Possible applications of language identification include information retrieval, web
crawlers, text mining and email filtering.
The paper uses a process called G-LDA [1], which takes concepts from Latent Dirichlet Allocation (LDA) and Genetic Evolution
techniques. This involves framing a set of words having a high frequency of occurrence in any given document. The method was tested
on Leipzig Corpora. The phrases that were evolved through the generations reflected significant improvement.
Keywords: Language Identification, Latent Dirichlet Allocation, Gibbs Sampling, Genetic Algorithm, Topic Modeling,
Breeding, Fitness, Roulette Wheel.
---------------------------------------------------------------------***-------------------------------------------------------------------------
1. INTRODUCTION
Language Identification is the problem of automatically
identifying the language a document is written in through the
use of computer. Language Identification has an important
role in Natural Language processing applications as one of the
pre-processing steps.
Recent years have seen rapid growth in international
communication which has lead to the requirement of systems
capable of correctly identifying languages of documents. Also,
with the spread of world-wide Internet access, text is available
in a great number of languages other than English. Automatic
processing of these texts for purposes like web crawling,
reading aids, indexing etc. need a preliminary identification of
the language used. Likewise, any system involving dictionary
access must identify language before applying any language-
specific operations.
The paper uses a process called G-LDA, which takes concepts
from Latent Dirichlet Allocation [2] (LDA) and Genetic
Evolution techniques. This involves framing a set of words
having a high frequency of occurrence in any given document.
The method was tested on Leipzig Corpora. The phrases that
were evolved through the generations reflected significant
improvement.
LDA is a way of automatically discovering topics within a
documents collection. Each word in the documents collection
will belong to a particular topic. The technique can be applied
to a language identification corpus. The result will be a set of
words having a high probability of occurrence within any
given document.
Further, set of words generated earlier are subjected to genetic
evolution techniques. This is done to get an improved set of
words with a higher probability of occurrence within a
document.
It was found that the method worked well without any pre-
processing like tokenization, unlike the N-Gram based
approach [7], and can handle e.g. Arabic documents and
documents encoded in Unicode.
Also, the method is very easy to understand and implement,
unlike the PPM approach [8], with almost similar detection
rates.
2. SET GENERATION
A sub-set of sentences from Leipzig Corpora was taken. The
languages chosen were Arabic, English, Guajarati, Hindi and
Italian. All the sentences from these five languages were
combined into a single document and sets of words from these
documents were generated using the JGibbLDA [3] package.
JGibbLDA is a Java implementation of LDA using Gibbs
Sampling technique. A total of 5 sets with 200 words each
were generated, each set representing a single language.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 43
Now, for each word in each language set, the number of lines
in the training data for that language matched, (Ne) was found.
Genetic evolution techniques were applied on the words with
Ne > 0.
3. GENETIC EVOLUTION ALGORITHM
Genetic algorithm was applied as follows:
• Loop for n generations
o Find Fitness Function for each phrase
o Breeding using roulette wheel selection
o Filtering to produce a new generation
3.1 Fitness Function
Based on the number of lines matched, a fitness value is
assigned to each phrase:
Fitness = No. sentences containing the phrase*(1+
N-Nw
N
)
Where
N = Threshold value
Nw = No. of words comprising the phrase
Threshold value is taken in order to provide a negative weight.
Any phrase containing more than N words will receive fitness
penalty, and phrases with less than N words receive a bonus.
3.2 Breeding
The process used created 1 child from 2 parents using the OR
Method. Each survivor from previous generation breeds with a
2nd parent selected using Roulette Wheel Selection.
The OR Method – Take 2 parents, such as (Lottery|Cash) and
(Cash|Win), breed them to produce the child
(Lottery|Cash|Win). Here, Nw = 3 as the phrase comprises of 3
words.
Roulette Wheel Selection – It is a method for choosing a
phrase from the breeding pool. Phrases with better fitness are
more likely to be chosen.
8/20
5/20
4/20
3/20
Fitness
Keyword 1
Keyword 2
Keyword 3
Keyword 4
Chart -1: Roulette Wheel Selection
In Chart 1, the breeding pool consists of 4 phrases. The
probability of Phrase 1 being selected is 8/20, the highest
among all the phrases. This is based on the fact that Phrase 1
has the highest fitness of 8.
Breading is carried out as follows to produce new children
• Mark all Phrases as ‘new’
• Loop till no phrase is marked ‘new’
o Take a phrase (P1) with highest fitness
which is marked ‘new’.
o Mark P1 as ‘old’
o Take another phrase (P2) using Roulette
Wheel Selection
o Mark P2 as ‘old’
o Breed P1 and P2 to produce a child C =
(P1|P2).
o Mark C as ‘old’
3.3 Filtering
Once breeding is done, fitness of the newly generated children
is calculated and then filtering is performed in order to remove
low fitness words. The words remaining after filtering
constitute the new generation.
Filtering criteria may vary with need and situation. For the
purpose of language identification, it was intended to keep
only top 200 phrases. Hence, after each round of breeding, top
200 phrases were sent to the next generation and the rest of the
words were discarded.
4. PROOF OF CONCEPT
In the Fitness function, as the threshold value (N) increases,
the time required to find the number of sentences for each
phrase increases with each generation as phrases with more
number of words tend to survive. After performing several
tests, the threshold value was chosen to be 5. This is the
optimum value which balances the time required for finding
the number of sentences for each phrase and the number of
new generations required. The breeding is stopped when the
increase in the fitness of the phrases become insignificant.
The top 10 phrases of the first two generations for the five
languages, with their fitness score is given below:
Table -1: Top 10 Arabic language phrases with fitness values
Generation 0 Generation 1
‫ﻭ‬ 124.7579956 ‫ﻭ‬ 124.758
‫ﻓﻲ‬ 101.0699997 ‫ﻭ‬;‫ﻓﻘﺩ‬ 113.152
‫ﻣﻥ‬ 87.17399597 ‫ﻓﻲ‬ 101.07
‫ﻋﻠﻰ‬ 54.66600037 ‫ﻓﻲ‬;‫ﻟﻣﺗﺣﺩﺓ‬‫ﺍ‬ 96.81601
‫ﺃﻥ‬ 47.12400055 ‫ﻋﻠﻰ‬;‫ﺃﻥ‬ 90.48
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 44
‫ﺍﻥ‬ 38.14199829 ‫ﻣﻥ‬ 87.174
‫ﻋﻥ‬ 33.28199768 ‫ﻣﻥ‬;‫ﺩﻭﻝ‬ 83.888
‫ﻣﻊ‬ 30.23999977 ‫ﻋﻠﻰ‬ 54.666
‫ﻟﺗﻲ‬‫ﺍ‬ 27.395998 ‫ﻋﻥ‬;‫ﻟﻰ‬‫ﺇ‬ 52.512
‫ﻟﻰ‬‫ﺇ‬ 25.79399872 ‫ﻣﻊ‬;‫ﻻ‬ 49.504
Table -2: Top 10 English language phrases with fitness values
Generation 0 Generation 1
an 91.27799988 year;named 145.474
of 85.03200531 a;had 140.576
and 80.85599518 However;Last 140.182
to 73.9980011 the;two 117.312
be 39.94200134 and;with 94.496
is 36.45000076 an 91.278
for 33.13799667 an;do 90.35201
was 31.12199974 of 85.03201
de 27.59399986 to;so 82.96
with 25.45199966 and 80.856
Table -3: Top 10 Guajarati language phrases with fitness
values
Generation 0 Generation 1
છે 80.7480011 છે;અને 232.33
જ 80.40599823 ક;હતી 180.068
આ 77.45400238 આ;છે. 126.192
ન 64.80000305 છે;એ 112.64
છે. 64.51199341 જ;કરવા 84.81601
એ 45.97199631 છે 80.748
તે 40.12199783 છે 80.748
અને 37.96199799 જ 80.406
ક 35.7840004 આ 77.454
પણ 26.51399994 ન 64.8
Table -4: Top 10 Hindi language phrases with fitness values
Generation 0 Generation 1
के 98.9280014 है; त 216.79
न 97.91999817 ह;उस 170.534
है 93.31200409 कहा;उ ह 159.642
म 87.01200104 म;क 140.56
क 71.11799622 न;हो 116.704
व 67.98599243 को;का 108.96
को 62.22599792 के 98.928
का 60.35400009 न 97.92
क 59.84999847 है 93.312
Table -5: Top 10 Italian language phrases with fitness values
Generation 0 Generation 1
di 105.1920013 che;a 161.056
in 73.56599426 anni;uno 147.448
del 72.46799469 di;ha 124.768
che 51.02999878 di 105.192
da 44.51399994 in;da 104.96
su 36.18000031 in 73.56599
o 36.16199875 del;I 73.296
ha 35.17200089 del 72.46799
e 34.0019989 su;della 57.168
è 34.0019989 che 51.03
Chart -2: Average Fitness of Top 10 phrases of each language
It is evident from Chart 2 that average fitness of the phrases
increases after applying the genetic algorithm.
To check the effectiveness of the evolved phrases, the
following method was adopted. The algorithm returns the
language of a document D:
• For Each Language Li
• For each phrase Ki ∈ Li
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 45
o C(Ki) = No. of lines of D containing Ki
• S(Li)=∑C(Ki)
• Return L, where S(L) = Max(S(Li))
This technique was applied to 15 web-pages of the 5 training
languages (Arabic, English, Gujarati, Hindi and Italian). The
system was able to identify the language for every web page
correctly (giving 100% recognition rate).
It maybe be noted that the technique was applied to moderate
(≥1000 words) sized web-pages. The identification rate may
decrease when applied to very small sized documents (<100
words). However, this can be improved by taking a bigger
training data, and retaining more words after each breeding
stage.
CONCLUSIONS
Combining the Genetic evolution technique, which simulates
processes of nature, with topic modeling method, the set of
phrases generated effectively identify the language of a
document.
This technique, when applied to 15 web-pages of the 5
training languages (Arabic, English, Gujarati, Hindi and
Italian), was able to identify the language for every web page
correctly (giving 100% recognition rate).
REFERENCES
[1]. Shubham Saini, Bhavesh Kasliwal and Shraey Bhatia.
"Spam Detection using G-LDA" International Journal of
Advanced Research in Computer Science and Software
Engineering 3.10 (2013).
[2]. Blei, David M., Andrew Y. Ng, and Michael I. Jordan.
"Latent dirichlet allocation." the Journal of machine Learning
research 3 (2003): 993-1022.J. Breckling, Ed., The Analysis of
Directional Time Series: Applications to Wind Speed and
Direction, ser. Lecture Notes in Statistics. Berlin, Germany:
Springer, 1989, vol. 61.
[3]. Phan, Xuan-Hieu, and Cam-Tu Nguyen. "Jgibblda: A java
implementation of latent dirichlet allocation (lda) using gibbs
sampling for parameter estimation and inference." (2006).
[4]. Ramage, D., and E. Rosen. "Stanford topic modeling
toolbox." (2011).
[5]. Wall, Matthew. "GAlib: A C++ library of genetic
algorithm components."Mechanical Engineering Department,
Massachusetts Institute of Technology 87 (1996): 54.
[6]. Kranig, Simon. "Evaluation of language identification
methods." Bakalárska práca, Universität Tübingen, Nemecko
(2005)
[7]. Cavnar, William B., and John M. Trenkle. "N-gram-based
text categorization." Ann Arbor MI 48113.2 (1994): 161-175.
[8]. Teahan, William J., and David J. Harper. "Using
compression-based language models for text categorization."
Language Modeling for Information Retrieval. Springer
Netherlands, 2003. 141-165
BIOGRAPHIES
Shubham Saini is a senior year undergraduate
student majoring in Computer Science and
Engineering. His research interests are
Information Retrieval, Semantic Web and
Knowledge Discovery in Databases.
Bhavesh Kasliwal is a senior year
undergraduate student majoring in Computer
Science and Engineering. His research interests
are Algorithm Design and Analysis, Data
Analytics.
Shraey Bhatia is a senior year undergraduate
student majoring in Computer Science and
Engineering. His research interests are
Information Security and Cryptography.

More Related Content

What's hot

Driving cycle development for Kuala Terengganu city using k-means method
Driving cycle development for Kuala Terengganu city using k-means methodDriving cycle development for Kuala Terengganu city using k-means method
Driving cycle development for Kuala Terengganu city using k-means methodIJECEIAES
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesijnlc
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...journalBEEI
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
 

What's hot (6)

Driving cycle development for Kuala Terengganu city using k-means method
Driving cycle development for Kuala Terengganu city using k-means methodDriving cycle development for Kuala Terengganu city using k-means method
Driving cycle development for Kuala Terengganu city using k-means method
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languages
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random Field
 

Viewers also liked

Investigations on the performance of diesel in an air
Investigations on the performance of diesel in an airInvestigations on the performance of diesel in an air
Investigations on the performance of diesel in an aireSAT Publishing House
 
Optimization and implementation of parallel squarer
Optimization and implementation of parallel squarerOptimization and implementation of parallel squarer
Optimization and implementation of parallel squarereSAT Publishing House
 
Effect of mobile phone and bts radiation on heart rate
Effect of mobile phone and bts radiation on heart rateEffect of mobile phone and bts radiation on heart rate
Effect of mobile phone and bts radiation on heart rateeSAT Publishing House
 
Geographical information system (gis) for water
Geographical information system (gis) for waterGeographical information system (gis) for water
Geographical information system (gis) for watereSAT Publishing House
 
Design and implementation of an ancrchitecture of embedded web server for wir...
Design and implementation of an ancrchitecture of embedded web server for wir...Design and implementation of an ancrchitecture of embedded web server for wir...
Design and implementation of an ancrchitecture of embedded web server for wir...eSAT Publishing House
 
Analysis and mitigation of unbalance due to load in
Analysis and mitigation of unbalance due to load inAnalysis and mitigation of unbalance due to load in
Analysis and mitigation of unbalance due to load ineSAT Publishing House
 
Reminiscing cloud computing technology
Reminiscing cloud computing technologyReminiscing cloud computing technology
Reminiscing cloud computing technologyeSAT Publishing House
 
Three dimensional finite element modeling of pervious
Three dimensional finite element modeling of perviousThree dimensional finite element modeling of pervious
Three dimensional finite element modeling of perviouseSAT Publishing House
 
Multisensor data fusion based autonomous mobile
Multisensor data fusion based autonomous mobileMultisensor data fusion based autonomous mobile
Multisensor data fusion based autonomous mobileeSAT Publishing House
 
Testing the flexural fatigue behavior of e glass epoxy laminates
Testing the flexural fatigue behavior of e glass epoxy laminatesTesting the flexural fatigue behavior of e glass epoxy laminates
Testing the flexural fatigue behavior of e glass epoxy laminateseSAT Publishing House
 
Microwave dehydrator an environmental friendly step toward improving microwav...
Microwave dehydrator an environmental friendly step toward improving microwav...Microwave dehydrator an environmental friendly step toward improving microwav...
Microwave dehydrator an environmental friendly step toward improving microwav...eSAT Publishing House
 
Video copy detection using segmentation method and
Video copy detection using segmentation method andVideo copy detection using segmentation method and
Video copy detection using segmentation method andeSAT Publishing House
 
Offline signature identification using high intensity variations and cross ov...
Offline signature identification using high intensity variations and cross ov...Offline signature identification using high intensity variations and cross ov...
Offline signature identification using high intensity variations and cross ov...eSAT Publishing House
 
Gsm based automatic lpg ordering system with leakage alert
Gsm based automatic lpg ordering system with leakage alertGsm based automatic lpg ordering system with leakage alert
Gsm based automatic lpg ordering system with leakage alerteSAT Publishing House
 
Formulating a trip production prediction model for
Formulating a trip production prediction model forFormulating a trip production prediction model for
Formulating a trip production prediction model foreSAT Publishing House
 
Survey on traditional and evolutionary clustering
Survey on traditional and evolutionary clusteringSurvey on traditional and evolutionary clustering
Survey on traditional and evolutionary clusteringeSAT Publishing House
 
Design & development of embedded application
Design & development of embedded applicationDesign & development of embedded application
Design & development of embedded applicationeSAT Publishing House
 
An improved color image encryption algorithm with
An improved color image encryption algorithm withAn improved color image encryption algorithm with
An improved color image encryption algorithm witheSAT Publishing House
 
Cancer cell segmentation and detection using nc ratio
Cancer cell segmentation and detection using nc ratioCancer cell segmentation and detection using nc ratio
Cancer cell segmentation and detection using nc ratioeSAT Publishing House
 
Passive control of structures using sliding isolators at intermediate floor l...
Passive control of structures using sliding isolators at intermediate floor l...Passive control of structures using sliding isolators at intermediate floor l...
Passive control of structures using sliding isolators at intermediate floor l...eSAT Publishing House
 

Viewers also liked (20)

Investigations on the performance of diesel in an air
Investigations on the performance of diesel in an airInvestigations on the performance of diesel in an air
Investigations on the performance of diesel in an air
 
Optimization and implementation of parallel squarer
Optimization and implementation of parallel squarerOptimization and implementation of parallel squarer
Optimization and implementation of parallel squarer
 
Effect of mobile phone and bts radiation on heart rate
Effect of mobile phone and bts radiation on heart rateEffect of mobile phone and bts radiation on heart rate
Effect of mobile phone and bts radiation on heart rate
 
Geographical information system (gis) for water
Geographical information system (gis) for waterGeographical information system (gis) for water
Geographical information system (gis) for water
 
Design and implementation of an ancrchitecture of embedded web server for wir...
Design and implementation of an ancrchitecture of embedded web server for wir...Design and implementation of an ancrchitecture of embedded web server for wir...
Design and implementation of an ancrchitecture of embedded web server for wir...
 
Analysis and mitigation of unbalance due to load in
Analysis and mitigation of unbalance due to load inAnalysis and mitigation of unbalance due to load in
Analysis and mitigation of unbalance due to load in
 
Reminiscing cloud computing technology
Reminiscing cloud computing technologyReminiscing cloud computing technology
Reminiscing cloud computing technology
 
Three dimensional finite element modeling of pervious
Three dimensional finite element modeling of perviousThree dimensional finite element modeling of pervious
Three dimensional finite element modeling of pervious
 
Multisensor data fusion based autonomous mobile
Multisensor data fusion based autonomous mobileMultisensor data fusion based autonomous mobile
Multisensor data fusion based autonomous mobile
 
Testing the flexural fatigue behavior of e glass epoxy laminates
Testing the flexural fatigue behavior of e glass epoxy laminatesTesting the flexural fatigue behavior of e glass epoxy laminates
Testing the flexural fatigue behavior of e glass epoxy laminates
 
Microwave dehydrator an environmental friendly step toward improving microwav...
Microwave dehydrator an environmental friendly step toward improving microwav...Microwave dehydrator an environmental friendly step toward improving microwav...
Microwave dehydrator an environmental friendly step toward improving microwav...
 
Video copy detection using segmentation method and
Video copy detection using segmentation method andVideo copy detection using segmentation method and
Video copy detection using segmentation method and
 
Offline signature identification using high intensity variations and cross ov...
Offline signature identification using high intensity variations and cross ov...Offline signature identification using high intensity variations and cross ov...
Offline signature identification using high intensity variations and cross ov...
 
Gsm based automatic lpg ordering system with leakage alert
Gsm based automatic lpg ordering system with leakage alertGsm based automatic lpg ordering system with leakage alert
Gsm based automatic lpg ordering system with leakage alert
 
Formulating a trip production prediction model for
Formulating a trip production prediction model forFormulating a trip production prediction model for
Formulating a trip production prediction model for
 
Survey on traditional and evolutionary clustering
Survey on traditional and evolutionary clusteringSurvey on traditional and evolutionary clustering
Survey on traditional and evolutionary clustering
 
Design & development of embedded application
Design & development of embedded applicationDesign & development of embedded application
Design & development of embedded application
 
An improved color image encryption algorithm with
An improved color image encryption algorithm withAn improved color image encryption algorithm with
An improved color image encryption algorithm with
 
Cancer cell segmentation and detection using nc ratio
Cancer cell segmentation and detection using nc ratioCancer cell segmentation and detection using nc ratio
Cancer cell segmentation and detection using nc ratio
 
Passive control of structures using sliding isolators at intermediate floor l...
Passive control of structures using sliding isolators at intermediate floor l...Passive control of structures using sliding isolators at intermediate floor l...
Passive control of structures using sliding isolators at intermediate floor l...
 

Similar to Language identification using g lda

Static dictionary for pronunciation modeling
Static dictionary for pronunciation modelingStatic dictionary for pronunciation modeling
Static dictionary for pronunciation modelingeSAT Journals
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phoneseSAT Journals
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phoneseSAT Publishing House
 
A lexicon based algorithm for noisy text normalization as pre processing for ...
A lexicon based algorithm for noisy text normalization as pre processing for ...A lexicon based algorithm for noisy text normalization as pre processing for ...
A lexicon based algorithm for noisy text normalization as pre processing for ...eSAT Publishing House
 
GLOVE BASED GESTURE RECOGNITION USING IR SENSOR
GLOVE BASED GESTURE RECOGNITION USING IR SENSORGLOVE BASED GESTURE RECOGNITION USING IR SENSOR
GLOVE BASED GESTURE RECOGNITION USING IR SENSORIRJET Journal
 
A review of factors that impact the design of a glove based wearable devices
A review of factors that impact the design of a glove based wearable devicesA review of factors that impact the design of a glove based wearable devices
A review of factors that impact the design of a glove based wearable devicesIAESIJAI
 
Live Sign Language Translation: A Survey
Live Sign Language Translation: A SurveyLive Sign Language Translation: A Survey
Live Sign Language Translation: A SurveyIRJET Journal
 
Dynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationDynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationeSAT Publishing House
 
Dynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationDynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationeSAT Publishing House
 
A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...eSAT Journals
 
Dynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationDynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationeSAT Journals
 
SignReco: Sign Language Translator
SignReco: Sign Language TranslatorSignReco: Sign Language Translator
SignReco: Sign Language TranslatorIRJET Journal
 
An unsupervised approach to develop ir system the case of urdu
An unsupervised approach to develop ir system  the case of urduAn unsupervised approach to develop ir system  the case of urdu
An unsupervised approach to develop ir system the case of urduijaia
 
Semantic analyzer for marathi text
Semantic analyzer for marathi textSemantic analyzer for marathi text
Semantic analyzer for marathi texteSAT Journals
 
Performance of different classifiers in speech recognition
Performance of different classifiers in speech recognitionPerformance of different classifiers in speech recognition
Performance of different classifiers in speech recognitioneSAT Journals
 
Performance of different classifiers in speech recognition
Performance of different classifiers in speech recognitionPerformance of different classifiers in speech recognition
Performance of different classifiers in speech recognitioneSAT Publishing House
 
Speech To Speech Translation
Speech To Speech TranslationSpeech To Speech Translation
Speech To Speech TranslationIRJET Journal
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
 
IRJET- Modeling Student’s Vocabulary Knowledge with Natural
IRJET-  	  Modeling Student’s Vocabulary Knowledge with NaturalIRJET-  	  Modeling Student’s Vocabulary Knowledge with Natural
IRJET- Modeling Student’s Vocabulary Knowledge with NaturalIRJET Journal
 

Similar to Language identification using g lda (20)

Static dictionary for pronunciation modeling
Static dictionary for pronunciation modelingStatic dictionary for pronunciation modeling
Static dictionary for pronunciation modeling
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phones
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phones
 
A lexicon based algorithm for noisy text normalization as pre processing for ...
A lexicon based algorithm for noisy text normalization as pre processing for ...A lexicon based algorithm for noisy text normalization as pre processing for ...
A lexicon based algorithm for noisy text normalization as pre processing for ...
 
GLOVE BASED GESTURE RECOGNITION USING IR SENSOR
GLOVE BASED GESTURE RECOGNITION USING IR SENSORGLOVE BASED GESTURE RECOGNITION USING IR SENSOR
GLOVE BASED GESTURE RECOGNITION USING IR SENSOR
 
A review of factors that impact the design of a glove based wearable devices
A review of factors that impact the design of a glove based wearable devicesA review of factors that impact the design of a glove based wearable devices
A review of factors that impact the design of a glove based wearable devices
 
Live Sign Language Translation: A Survey
Live Sign Language Translation: A SurveyLive Sign Language Translation: A Survey
Live Sign Language Translation: A Survey
 
Dynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationDynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentation
 
Dynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationDynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentation
 
A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...
 
Dynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationDynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentation
 
SignReco: Sign Language Translator
SignReco: Sign Language TranslatorSignReco: Sign Language Translator
SignReco: Sign Language Translator
 
An unsupervised approach to develop ir system the case of urdu
An unsupervised approach to develop ir system  the case of urduAn unsupervised approach to develop ir system  the case of urdu
An unsupervised approach to develop ir system the case of urdu
 
Semantic analyzer for marathi text
Semantic analyzer for marathi textSemantic analyzer for marathi text
Semantic analyzer for marathi text
 
Semantic analyzer for marathi text
Semantic analyzer for marathi textSemantic analyzer for marathi text
Semantic analyzer for marathi text
 
Performance of different classifiers in speech recognition
Performance of different classifiers in speech recognitionPerformance of different classifiers in speech recognition
Performance of different classifiers in speech recognition
 
Performance of different classifiers in speech recognition
Performance of different classifiers in speech recognitionPerformance of different classifiers in speech recognition
Performance of different classifiers in speech recognition
 
Speech To Speech Translation
Speech To Speech TranslationSpeech To Speech Translation
Speech To Speech Translation
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
 
IRJET- Modeling Student’s Vocabulary Knowledge with Natural
IRJET-  	  Modeling Student’s Vocabulary Knowledge with NaturalIRJET-  	  Modeling Student’s Vocabulary Knowledge with Natural
IRJET- Modeling Student’s Vocabulary Knowledge with Natural
 

More from eSAT Publishing House

Likely impacts of hudhud on the environment of visakhapatnam
Likely impacts of hudhud on the environment of visakhapatnamLikely impacts of hudhud on the environment of visakhapatnam
Likely impacts of hudhud on the environment of visakhapatnameSAT Publishing House
 
Impact of flood disaster in a drought prone area – case study of alampur vill...
Impact of flood disaster in a drought prone area – case study of alampur vill...Impact of flood disaster in a drought prone area – case study of alampur vill...
Impact of flood disaster in a drought prone area – case study of alampur vill...eSAT Publishing House
 
Hudhud cyclone – a severe disaster in visakhapatnam
Hudhud cyclone – a severe disaster in visakhapatnamHudhud cyclone – a severe disaster in visakhapatnam
Hudhud cyclone – a severe disaster in visakhapatnameSAT Publishing House
 
Groundwater investigation using geophysical methods a case study of pydibhim...
Groundwater investigation using geophysical methods  a case study of pydibhim...Groundwater investigation using geophysical methods  a case study of pydibhim...
Groundwater investigation using geophysical methods a case study of pydibhim...eSAT Publishing House
 
Flood related disasters concerned to urban flooding in bangalore, india
Flood related disasters concerned to urban flooding in bangalore, indiaFlood related disasters concerned to urban flooding in bangalore, india
Flood related disasters concerned to urban flooding in bangalore, indiaeSAT Publishing House
 
Enhancing post disaster recovery by optimal infrastructure capacity building
Enhancing post disaster recovery by optimal infrastructure capacity buildingEnhancing post disaster recovery by optimal infrastructure capacity building
Enhancing post disaster recovery by optimal infrastructure capacity buildingeSAT Publishing House
 
Effect of lintel and lintel band on the global performance of reinforced conc...
Effect of lintel and lintel band on the global performance of reinforced conc...Effect of lintel and lintel band on the global performance of reinforced conc...
Effect of lintel and lintel band on the global performance of reinforced conc...eSAT Publishing House
 
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...eSAT Publishing House
 
Wind damage to buildings, infrastrucuture and landscape elements along the be...
Wind damage to buildings, infrastrucuture and landscape elements along the be...Wind damage to buildings, infrastrucuture and landscape elements along the be...
Wind damage to buildings, infrastrucuture and landscape elements along the be...eSAT Publishing House
 
Shear strength of rc deep beam panels – a review
Shear strength of rc deep beam panels – a reviewShear strength of rc deep beam panels – a review
Shear strength of rc deep beam panels – a revieweSAT Publishing House
 
Role of voluntary teams of professional engineers in dissater management – ex...
Role of voluntary teams of professional engineers in dissater management – ex...Role of voluntary teams of professional engineers in dissater management – ex...
Role of voluntary teams of professional engineers in dissater management – ex...eSAT Publishing House
 
Risk analysis and environmental hazard management
Risk analysis and environmental hazard managementRisk analysis and environmental hazard management
Risk analysis and environmental hazard managementeSAT Publishing House
 
Review study on performance of seismically tested repaired shear walls
Review study on performance of seismically tested repaired shear wallsReview study on performance of seismically tested repaired shear walls
Review study on performance of seismically tested repaired shear wallseSAT Publishing House
 
Monitoring and assessment of air quality with reference to dust particles (pm...
Monitoring and assessment of air quality with reference to dust particles (pm...Monitoring and assessment of air quality with reference to dust particles (pm...
Monitoring and assessment of air quality with reference to dust particles (pm...eSAT Publishing House
 
Low cost wireless sensor networks and smartphone applications for disaster ma...
Low cost wireless sensor networks and smartphone applications for disaster ma...Low cost wireless sensor networks and smartphone applications for disaster ma...
Low cost wireless sensor networks and smartphone applications for disaster ma...eSAT Publishing House
 
Coastal zones – seismic vulnerability an analysis from east coast of india
Coastal zones – seismic vulnerability an analysis from east coast of indiaCoastal zones – seismic vulnerability an analysis from east coast of india
Coastal zones – seismic vulnerability an analysis from east coast of indiaeSAT Publishing House
 
Can fracture mechanics predict damage due disaster of structures
Can fracture mechanics predict damage due disaster of structuresCan fracture mechanics predict damage due disaster of structures
Can fracture mechanics predict damage due disaster of structureseSAT Publishing House
 
Assessment of seismic susceptibility of rc buildings
Assessment of seismic susceptibility of rc buildingsAssessment of seismic susceptibility of rc buildings
Assessment of seismic susceptibility of rc buildingseSAT Publishing House
 
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...eSAT Publishing House
 
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...eSAT Publishing House
 

More from eSAT Publishing House (20)

Likely impacts of hudhud on the environment of visakhapatnam
Likely impacts of hudhud on the environment of visakhapatnamLikely impacts of hudhud on the environment of visakhapatnam
Likely impacts of hudhud on the environment of visakhapatnam
 
Impact of flood disaster in a drought prone area – case study of alampur vill...
Impact of flood disaster in a drought prone area – case study of alampur vill...Impact of flood disaster in a drought prone area – case study of alampur vill...
Impact of flood disaster in a drought prone area – case study of alampur vill...
 
Hudhud cyclone – a severe disaster in visakhapatnam
Hudhud cyclone – a severe disaster in visakhapatnamHudhud cyclone – a severe disaster in visakhapatnam
Hudhud cyclone – a severe disaster in visakhapatnam
 
Groundwater investigation using geophysical methods a case study of pydibhim...
Groundwater investigation using geophysical methods  a case study of pydibhim...Groundwater investigation using geophysical methods  a case study of pydibhim...
Groundwater investigation using geophysical methods a case study of pydibhim...
 
Flood related disasters concerned to urban flooding in bangalore, india
Flood related disasters concerned to urban flooding in bangalore, indiaFlood related disasters concerned to urban flooding in bangalore, india
Flood related disasters concerned to urban flooding in bangalore, india
 
Enhancing post disaster recovery by optimal infrastructure capacity building
Enhancing post disaster recovery by optimal infrastructure capacity buildingEnhancing post disaster recovery by optimal infrastructure capacity building
Enhancing post disaster recovery by optimal infrastructure capacity building
 
Effect of lintel and lintel band on the global performance of reinforced conc...
Effect of lintel and lintel band on the global performance of reinforced conc...Effect of lintel and lintel band on the global performance of reinforced conc...
Effect of lintel and lintel band on the global performance of reinforced conc...
 
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
 
Wind damage to buildings, infrastrucuture and landscape elements along the be...
Wind damage to buildings, infrastrucuture and landscape elements along the be...Wind damage to buildings, infrastrucuture and landscape elements along the be...
Wind damage to buildings, infrastrucuture and landscape elements along the be...
 
Shear strength of rc deep beam panels – a review
Shear strength of rc deep beam panels – a reviewShear strength of rc deep beam panels – a review
Shear strength of rc deep beam panels – a review
 
Role of voluntary teams of professional engineers in dissater management – ex...
Role of voluntary teams of professional engineers in dissater management – ex...Role of voluntary teams of professional engineers in dissater management – ex...
Role of voluntary teams of professional engineers in dissater management – ex...
 
Risk analysis and environmental hazard management
Risk analysis and environmental hazard managementRisk analysis and environmental hazard management
Risk analysis and environmental hazard management
 
Review study on performance of seismically tested repaired shear walls
Review study on performance of seismically tested repaired shear wallsReview study on performance of seismically tested repaired shear walls
Review study on performance of seismically tested repaired shear walls
 
Monitoring and assessment of air quality with reference to dust particles (pm...
Monitoring and assessment of air quality with reference to dust particles (pm...Monitoring and assessment of air quality with reference to dust particles (pm...
Monitoring and assessment of air quality with reference to dust particles (pm...
 
Low cost wireless sensor networks and smartphone applications for disaster ma...
Low cost wireless sensor networks and smartphone applications for disaster ma...Low cost wireless sensor networks and smartphone applications for disaster ma...
Low cost wireless sensor networks and smartphone applications for disaster ma...
 
Coastal zones – seismic vulnerability an analysis from east coast of india
Coastal zones – seismic vulnerability an analysis from east coast of indiaCoastal zones – seismic vulnerability an analysis from east coast of india
Coastal zones – seismic vulnerability an analysis from east coast of india
 
Can fracture mechanics predict damage due disaster of structures
Can fracture mechanics predict damage due disaster of structuresCan fracture mechanics predict damage due disaster of structures
Can fracture mechanics predict damage due disaster of structures
 
Assessment of seismic susceptibility of rc buildings
Assessment of seismic susceptibility of rc buildingsAssessment of seismic susceptibility of rc buildings
Assessment of seismic susceptibility of rc buildings
 
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
 
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
 

Recently uploaded

the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 

Recently uploaded (20)

the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 

Language identification using g lda

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 42 LANGUAGE IDENTIFICATION USING G-LDA Shubham Saini1 , Bhavesh Kasliwal2 , Shraey Bhatia3 1, 2, 3 Student, School of Computing Science and Engineering, Vellore Institute of Technology, India, shubham.saini2010@vit.ac.in, bhavesh.kasliwal2010@vit.ac.in, shraey.bhatia2010@vit.ac.in Abstract Language Identification has an important role in Natural Language processing applications as one of the pre-processing steps. There are various mechanisms in use today to achieve this task with brilliant recognition rates. Recent years have seen rapid growth in international communication which has lead to the requirement of systems capable of correctly identifying languages of documents. Possible applications of language identification include information retrieval, web crawlers, text mining and email filtering. The paper uses a process called G-LDA [1], which takes concepts from Latent Dirichlet Allocation (LDA) and Genetic Evolution techniques. This involves framing a set of words having a high frequency of occurrence in any given document. The method was tested on Leipzig Corpora. The phrases that were evolved through the generations reflected significant improvement. Keywords: Language Identification, Latent Dirichlet Allocation, Gibbs Sampling, Genetic Algorithm, Topic Modeling, Breeding, Fitness, Roulette Wheel. ---------------------------------------------------------------------***------------------------------------------------------------------------- 1. INTRODUCTION Language Identification is the problem of automatically identifying the language a document is written in through the use of computer. Language Identification has an important role in Natural Language processing applications as one of the pre-processing steps. Recent years have seen rapid growth in international communication which has lead to the requirement of systems capable of correctly identifying languages of documents. Also, with the spread of world-wide Internet access, text is available in a great number of languages other than English. Automatic processing of these texts for purposes like web crawling, reading aids, indexing etc. need a preliminary identification of the language used. Likewise, any system involving dictionary access must identify language before applying any language- specific operations. The paper uses a process called G-LDA, which takes concepts from Latent Dirichlet Allocation [2] (LDA) and Genetic Evolution techniques. This involves framing a set of words having a high frequency of occurrence in any given document. The method was tested on Leipzig Corpora. The phrases that were evolved through the generations reflected significant improvement. LDA is a way of automatically discovering topics within a documents collection. Each word in the documents collection will belong to a particular topic. The technique can be applied to a language identification corpus. The result will be a set of words having a high probability of occurrence within any given document. Further, set of words generated earlier are subjected to genetic evolution techniques. This is done to get an improved set of words with a higher probability of occurrence within a document. It was found that the method worked well without any pre- processing like tokenization, unlike the N-Gram based approach [7], and can handle e.g. Arabic documents and documents encoded in Unicode. Also, the method is very easy to understand and implement, unlike the PPM approach [8], with almost similar detection rates. 2. SET GENERATION A sub-set of sentences from Leipzig Corpora was taken. The languages chosen were Arabic, English, Guajarati, Hindi and Italian. All the sentences from these five languages were combined into a single document and sets of words from these documents were generated using the JGibbLDA [3] package. JGibbLDA is a Java implementation of LDA using Gibbs Sampling technique. A total of 5 sets with 200 words each were generated, each set representing a single language.
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 43 Now, for each word in each language set, the number of lines in the training data for that language matched, (Ne) was found. Genetic evolution techniques were applied on the words with Ne > 0. 3. GENETIC EVOLUTION ALGORITHM Genetic algorithm was applied as follows: • Loop for n generations o Find Fitness Function for each phrase o Breeding using roulette wheel selection o Filtering to produce a new generation 3.1 Fitness Function Based on the number of lines matched, a fitness value is assigned to each phrase: Fitness = No. sentences containing the phrase*(1+ N-Nw N ) Where N = Threshold value Nw = No. of words comprising the phrase Threshold value is taken in order to provide a negative weight. Any phrase containing more than N words will receive fitness penalty, and phrases with less than N words receive a bonus. 3.2 Breeding The process used created 1 child from 2 parents using the OR Method. Each survivor from previous generation breeds with a 2nd parent selected using Roulette Wheel Selection. The OR Method – Take 2 parents, such as (Lottery|Cash) and (Cash|Win), breed them to produce the child (Lottery|Cash|Win). Here, Nw = 3 as the phrase comprises of 3 words. Roulette Wheel Selection – It is a method for choosing a phrase from the breeding pool. Phrases with better fitness are more likely to be chosen. 8/20 5/20 4/20 3/20 Fitness Keyword 1 Keyword 2 Keyword 3 Keyword 4 Chart -1: Roulette Wheel Selection In Chart 1, the breeding pool consists of 4 phrases. The probability of Phrase 1 being selected is 8/20, the highest among all the phrases. This is based on the fact that Phrase 1 has the highest fitness of 8. Breading is carried out as follows to produce new children • Mark all Phrases as ‘new’ • Loop till no phrase is marked ‘new’ o Take a phrase (P1) with highest fitness which is marked ‘new’. o Mark P1 as ‘old’ o Take another phrase (P2) using Roulette Wheel Selection o Mark P2 as ‘old’ o Breed P1 and P2 to produce a child C = (P1|P2). o Mark C as ‘old’ 3.3 Filtering Once breeding is done, fitness of the newly generated children is calculated and then filtering is performed in order to remove low fitness words. The words remaining after filtering constitute the new generation. Filtering criteria may vary with need and situation. For the purpose of language identification, it was intended to keep only top 200 phrases. Hence, after each round of breeding, top 200 phrases were sent to the next generation and the rest of the words were discarded. 4. PROOF OF CONCEPT In the Fitness function, as the threshold value (N) increases, the time required to find the number of sentences for each phrase increases with each generation as phrases with more number of words tend to survive. After performing several tests, the threshold value was chosen to be 5. This is the optimum value which balances the time required for finding the number of sentences for each phrase and the number of new generations required. The breeding is stopped when the increase in the fitness of the phrases become insignificant. The top 10 phrases of the first two generations for the five languages, with their fitness score is given below: Table -1: Top 10 Arabic language phrases with fitness values Generation 0 Generation 1 ‫ﻭ‬ 124.7579956 ‫ﻭ‬ 124.758 ‫ﻓﻲ‬ 101.0699997 ‫ﻭ‬;‫ﻓﻘﺩ‬ 113.152 ‫ﻣﻥ‬ 87.17399597 ‫ﻓﻲ‬ 101.07 ‫ﻋﻠﻰ‬ 54.66600037 ‫ﻓﻲ‬;‫ﻟﻣﺗﺣﺩﺓ‬‫ﺍ‬ 96.81601 ‫ﺃﻥ‬ 47.12400055 ‫ﻋﻠﻰ‬;‫ﺃﻥ‬ 90.48
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 44 ‫ﺍﻥ‬ 38.14199829 ‫ﻣﻥ‬ 87.174 ‫ﻋﻥ‬ 33.28199768 ‫ﻣﻥ‬;‫ﺩﻭﻝ‬ 83.888 ‫ﻣﻊ‬ 30.23999977 ‫ﻋﻠﻰ‬ 54.666 ‫ﻟﺗﻲ‬‫ﺍ‬ 27.395998 ‫ﻋﻥ‬;‫ﻟﻰ‬‫ﺇ‬ 52.512 ‫ﻟﻰ‬‫ﺇ‬ 25.79399872 ‫ﻣﻊ‬;‫ﻻ‬ 49.504 Table -2: Top 10 English language phrases with fitness values Generation 0 Generation 1 an 91.27799988 year;named 145.474 of 85.03200531 a;had 140.576 and 80.85599518 However;Last 140.182 to 73.9980011 the;two 117.312 be 39.94200134 and;with 94.496 is 36.45000076 an 91.278 for 33.13799667 an;do 90.35201 was 31.12199974 of 85.03201 de 27.59399986 to;so 82.96 with 25.45199966 and 80.856 Table -3: Top 10 Guajarati language phrases with fitness values Generation 0 Generation 1 છે 80.7480011 છે;અને 232.33 જ 80.40599823 ક;હતી 180.068 આ 77.45400238 આ;છે. 126.192 ન 64.80000305 છે;એ 112.64 છે. 64.51199341 જ;કરવા 84.81601 એ 45.97199631 છે 80.748 તે 40.12199783 છે 80.748 અને 37.96199799 જ 80.406 ક 35.7840004 આ 77.454 પણ 26.51399994 ન 64.8 Table -4: Top 10 Hindi language phrases with fitness values Generation 0 Generation 1 के 98.9280014 है; त 216.79 न 97.91999817 ह;उस 170.534 है 93.31200409 कहा;उ ह 159.642 म 87.01200104 म;क 140.56 क 71.11799622 न;हो 116.704 व 67.98599243 को;का 108.96 को 62.22599792 के 98.928 का 60.35400009 न 97.92 क 59.84999847 है 93.312 Table -5: Top 10 Italian language phrases with fitness values Generation 0 Generation 1 di 105.1920013 che;a 161.056 in 73.56599426 anni;uno 147.448 del 72.46799469 di;ha 124.768 che 51.02999878 di 105.192 da 44.51399994 in;da 104.96 su 36.18000031 in 73.56599 o 36.16199875 del;I 73.296 ha 35.17200089 del 72.46799 e 34.0019989 su;della 57.168 è 34.0019989 che 51.03 Chart -2: Average Fitness of Top 10 phrases of each language It is evident from Chart 2 that average fitness of the phrases increases after applying the genetic algorithm. To check the effectiveness of the evolved phrases, the following method was adopted. The algorithm returns the language of a document D: • For Each Language Li • For each phrase Ki ∈ Li
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 11 | Nov-2013, Available @ http://www.ijret.org 45 o C(Ki) = No. of lines of D containing Ki • S(Li)=∑C(Ki) • Return L, where S(L) = Max(S(Li)) This technique was applied to 15 web-pages of the 5 training languages (Arabic, English, Gujarati, Hindi and Italian). The system was able to identify the language for every web page correctly (giving 100% recognition rate). It maybe be noted that the technique was applied to moderate (≥1000 words) sized web-pages. The identification rate may decrease when applied to very small sized documents (<100 words). However, this can be improved by taking a bigger training data, and retaining more words after each breeding stage. CONCLUSIONS Combining the Genetic evolution technique, which simulates processes of nature, with topic modeling method, the set of phrases generated effectively identify the language of a document. This technique, when applied to 15 web-pages of the 5 training languages (Arabic, English, Gujarati, Hindi and Italian), was able to identify the language for every web page correctly (giving 100% recognition rate). REFERENCES [1]. Shubham Saini, Bhavesh Kasliwal and Shraey Bhatia. "Spam Detection using G-LDA" International Journal of Advanced Research in Computer Science and Software Engineering 3.10 (2013). [2]. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.J. Breckling, Ed., The Analysis of Directional Time Series: Applications to Wind Speed and Direction, ser. Lecture Notes in Statistics. Berlin, Germany: Springer, 1989, vol. 61. [3]. Phan, Xuan-Hieu, and Cam-Tu Nguyen. "Jgibblda: A java implementation of latent dirichlet allocation (lda) using gibbs sampling for parameter estimation and inference." (2006). [4]. Ramage, D., and E. Rosen. "Stanford topic modeling toolbox." (2011). [5]. Wall, Matthew. "GAlib: A C++ library of genetic algorithm components."Mechanical Engineering Department, Massachusetts Institute of Technology 87 (1996): 54. [6]. Kranig, Simon. "Evaluation of language identification methods." Bakalárska práca, Universität Tübingen, Nemecko (2005) [7]. Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Ann Arbor MI 48113.2 (1994): 161-175. [8]. Teahan, William J., and David J. Harper. "Using compression-based language models for text categorization." Language Modeling for Information Retrieval. Springer Netherlands, 2003. 141-165 BIOGRAPHIES Shubham Saini is a senior year undergraduate student majoring in Computer Science and Engineering. His research interests are Information Retrieval, Semantic Web and Knowledge Discovery in Databases. Bhavesh Kasliwal is a senior year undergraduate student majoring in Computer Science and Engineering. His research interests are Algorithm Design and Analysis, Data Analytics. Shraey Bhatia is a senior year undergraduate student majoring in Computer Science and Engineering. His research interests are Information Security and Cryptography.