SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 03 | Mar-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 3178
Automatic Language Identification using Hybrid Approach and
Classification Algorithms
Ajinkya Gadgil1, Swaraj Joshi2, Pranit Katwe3, Prasad Kshatriya4
1,2,3,4 Students, Department of Computer Engineering, K.K.Wagh Institute of Engineering Education and Research,
Nashik-422003, Maharashtra, India.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - In this paper, we implement the working of an
uninterrupted classification algorithm for automatic
language identification. The method introduced in this
paper identify the language without user interaction during
processing. In this process, we implement an efficient
algorithm which contains Naive Bayesian classification and
n-gram text processing algorithm. This method does not
require any prior information (number of classes, initial
partition) and is able to quickly process a large amount of
data and results can be visualized. We address the problem
of detecting documents that contain text from more than
one language. This project also includes language
preprocessing like special characters removal, suffix
removal and token generation. Naive Bayesian classification
is used for classification of texts in multiple languages. In
this project, we use languages like Hindi, English, Gujarati,
Sanskrit and other foreign languages. We can extend the
project by including language translation, document
summarization and sentiment analysis.
Keywords: Language identification, N-Gram,
Multilingual text, Classification, Natural Language
Processing (NLP).
1. INTRODUCTION
Currently, large number of methods in Natural Language
Processing (NLP) projects are getting the advantage of text
processing algorithm to improve their performance in text
processing application to help stop the search space, for
example, work of identifying several demographic
properties of documents from a written text, we can
identify the structure of the language. The method of
identifying the native language depends on the manner of
speaking and writing of the second language and is
borrowed from Second Language Acquisition (SLA), which
is known as language transfer. The method of language
transfer says that the first language (L1) influences the
way that a second language (L2) is learned (Ahn, 2011;
Tsur and Rappoport, 2007). According to this theory, if we
learn to identify what is being transferred from one
language to another, then it is possible to identify the
native language of an author given a text written in L2. For
instance, a Korean native speaker can be identified by the
errors in the use of articles 'a', 'an' and 'the' in his English
writings due to the lack of similar function words in
Korean. As we see, error identification is very common in
automatic approaches; however, a previous analysis and
understanding of linguistic markers are often required in
such approaches.
R and D Department in recent years has given a lot of
interest to text data processing and especially to
multilingual textual data[1], this is for several reasons: a
growing collection of networked and universally
distributed data, the development of communication
infrastructure and the Internet, the increase in the number
of people connected to the global network and whose
mother tongue is not English[2]. This has created a need to
organize and process huge volumes of data. The manual
processing of these data (expert, or knowledge-based
systems) is very costly in time and personnel, they are
inflexible and generalization to other areas are virtually
impossible. So we try to develop automatic methods as
Automatic Language Identification using Hybrid Approach
and Classification Algorithms.
2. METHODOLOGICAL APPROACH
In this part, we describe our methodology with the use of
approach based on n-grams[3][4] to group similar
documents together. This combination will be examined in
several experiments using the Naive Bayesian
classification[7], Support Vector Machines (SVM)[8] and
Logistic Regression[9] as similarity measures for several
values of n.
The language detection of a written text is
probably one of the most basic tasks in natural language
processing (NLP). For any language-dependent processing
of an unknown text, the first thing is to know in which
language the text is written. The approach we have chosen
to implement is pretty straightforward. The idea is that
any language has a unique set of character (co-
)occurrences. Our method depends on the three attributes
of language that include character set, N-gram[5] and
word list[3].
2.1. Character Set
Some languages have a very specific Character set (e.g.
Marathi, Sanskrit); while for others, some characters give
a good hint of what languages come in question (e.g. Hindi,
Gujarati).
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 03 | Mar-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 3179
2.2. N-gram Algorithm
The term "n-gram"[6] was introduced in 1948. An n-gram
may designate both: n-tuple of characters (n-gram
character) or an n-tuple of words (n-gram words). This
model does not represent documents by a vector of term’s
frequencies but by a vector of n-gram frequencies in the
documents. An n-gram character is a sequence of n
consecutive characters. For any document, all n-grams
that can be generated are the result obtained by moving a
window of n boxes in the text. This movement is made in
stages; one stage corresponds to a character for n-grams
of characters and a word for n-grams of words. Then we
count the frequencies of n-grams found. In scientific
literature, this term sometimes refers to sequences that
are neither ordered nor straight, for example, a bigram can
be composed of the first letter and second letter of a word;
consider an n-gram as a set of unordered n words after
performing the stemming and the removing of special
characters. In our experiments, n-grams of characters are
used. Thus an n-gram refers to a string of n consecutive
characters. In this approach, we do not need to conduct
any linguistic processing of the corpus. For a given
document, as we already said, extracting all n-grams
(usually n = (2, 3, 4, 5)) is the result obtained by moving a
window of n boxes in the main text. This movement is
made by steps of one character at a time, every step we
take a "snapshot" and all these 'shots' constitute the set of
all n-grams of the document. We cut the texts of the corpus
based on the value of n chosen. We took n = 2, 3, 4 and 5.
2.3. Word List
The last source of disambiguation is the actually used
words. Some languages (Marathi and Hindi) are almost
identical in character set and also the occurrences of the
specific n-grams. Still, different words are used in different
frequencies.
2.4. Naive Bayes Classification
Naive Bayes classifier is an efficient and powerful
algorithm for the classification task. Even if we are
working on a data set with millions of records with some
attributes, it is suggested to try Naive Bayes approach.
Naive Bayes classifier gives good results when we use it
for textual data processing such as Natural Language
Processing. To learn the Naive Bayes classifier we need to
understand the Bayes theorem. It works on
conditional probability. Conditional probability is the
probability that something will happen, given that
something else has already occurred. Using the conditional
probability, we can calculate the probability of an event
using its prior knowledge.
Naive Bayes is a kind of classifier which uses the Bayes
Theorem. It predicts membership probabilities for each
class such as the probability that given record or data
point belongs to a particular class. The class with
the highest probability is considered as the most likely
class. This is also known as Maximum A Posteriori (MAP).
Naive Bayes classifier assumes that all the features
are unrelated to each other. Presence or absence of a
feature does not influence the presence or absence of any
other feature. We can use the example of Wikipedia for
explaining the use of Naive Bayes classifier i.e.
A fruit may be considered to be an apple if it is red, round,
and about 4″ in diameter. Even if these features depend
on each other or upon the existence of the other features, a
Naive Bayes classifier considers all of these properties to
independently contribute to the probability that this fruit
is an apple.
In real datasets, we test a hypothesis given multiple
evidence (feature). So, calculations become complicated.
To simplify the work, the feature independence approach
is used to ‘uncouple’ multiple evidence and treat each as
an independent one.
2.5. Language Identification
In this step, we classify input text document that matches
the model of dominant language.
2.6. Equations
These are the equations for the n-gram model used in our
methodology
● unigram: p(wi) (i.i.d. process)
● bigram: p(wi |wi−1) (Markov process)
● trigram: p(wi |wi−2, wi−1)
We can estimate n-gram probabilities by counting relative
frequency on a training corpus. (This is maximum
likelihood estimation.)
where N is the total number of words in the training set
and c(·) denotes count of the word or word sequence in
the training data.
Following equation is used for the Naive Bayes Classifier
to calculate the conditional probability.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 03 | Mar-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 3180
● P(C) is the probability of hypothesis H being true.
This is known as the prior probability.
● P(A) is the probability of the evidence (regardless
of the hypothesis).
● P(A|C) is the probability of the evidence given that
hypothesis is true.
● P(C|A) is the probability of the hypothesis given
that the evidence is there
2.7. Working of Classification Algorithms
A Path Lab is performing a Test of disease say “D” with
two results “Positive” & “Negative.” They guarantee that
their test result is 99% accurate: if you have the disease,
they will give test positive 99% of the time. If you don’t
have the disease, they will test negative 99% of the time. If
3% of all the people have this disease and test gives
“positive” result, what is the probability that you actually
have the disease?
For solving the above problem, we will have to use
conditional probability. Probability of people suffering
from Disease D, P(D) = 0.03 = 3%. Probability that test
gives “positive” result and patient have the disease, P(Pos |
D) = 0.99 =99%.
Probability of people not suffering from Disease D, P(~D)
= 0.97 = 97%. The probability that test gives “positive”
result and patient does have the disease, P(Pos | ~D) =
0.01 =1%.
For calculating the probability that the patient actually has
the disease i.e, P( D | Pos) we will use Bayes theorem:
We have all the values of numerator but we need to
calculate P(Pos):
P(Pos) = P(D | pos) + P( ~D | pos)
= P(pos|D)*P(D) + P(pos|~D)*P(~D)
= 0.99 * 0.03 + 0.01 * 0.97
= 0.0297 + 0.0097
= 0.0394
Let’s calculate,
P( D | Pos) = (P(Pos | D) * P(D)) / P(Pos)
= (0.99 * 0.03) / 0.0394
= 0.753807107
So, Approximately 75% chances are there that the patient
is actually suffering from the disease.
2.8. Results and Evaluation
In this section, we present and evaluate the obtained
results. We evaluate our method using three classification
algorithms.
● Naive Bayesian classification[7]
● Support Vector Machines (SVM)[8]
● Logistic Regression[9]
Following diagram show the comparison between this
three methods.
Distance NBC SVM LR
n-gram 2 3 4 5 2 3 4 5 2 3 4 5
Time for
extracting
244 204 204 152 244 204 204 152 244 204 20
4
152
n-grams
(second)
Time
required
to
Classify
49 272 403 400 54 130 285 263 28 65 99 61
(second)
F-
measure
(%)
51.7
1
44.35 47.4
5
51.8
1
48.6
5
44.0
9
40.6
3
49.58 44.3
9
40.28 40.4 46.95
3. CONCLUSIONS
The work presented in this paper shows that it is possible
to identify the language automatically in an unsupervised
manner. It aims to enhance the unsupervised methods and
techniques applied for classification of a text document to
identify the language in a multilingual corpus.
Furthermore, we can extend our project to translate the
data from one language to another language.
REFERENCES
[1] Marco Lui, Jey Han Lau and Timothy Baldwin,
“Automatic Detection and Language Identification of
Multilingual Documents”, Department of Computing
and Information Systems, The University of
Melbourne, NICTA Victoria Research Laboratory,
Department of Philosophy, King’s College London,
2014
[2] Pavol ZAVARSKY, Yoshiki MIKAMI, Shota WADA,
“Language and encoding scheme identification of
extremely large sets of multilingual text documents”,
Department of Management and Information Sciences,
Nagaoka University of Technology, 1603-1
Kamitomioka, 940-2188 Nagaoka, Japan
[3] IOANNIS KANARIS, KONSTANTINOS KANARIS,
IOANNIS HOUVARDAS, and EFSTATHIOS
STAMATATOS, “WORDS VS. CHARACTER N-GRAMS
FOR ANTI-SPAM FILTERING”, Dept. of Information
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 03 | Mar-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 3181
and Communication Systems Eng., University of the
Aegean, Karlovassi, Samos - 83200, Greece, 2006
[4] Bashir Ahmed, Sung-Hyuk Cha, and Charles Tappert,
“Language Identification from Text Using N-gram
Based Cumulative Frequency Addition”, Proceedings
of Student/Faculty Research Day, CSIS, Pace
University, May 7th, 2004
[5] Rahmoun A, Elberrichi Z. Experimenting N-Grams in
Text Categorization. International Arab Journal of
Information Technology, 2007, Vol 4, N°.4, pp. 377-
385.
[6] ABDELMALEK AMINE, MICHEL SIMONET, ZAKARIA
ELBERRICHI, “On Automatic Language Identification
International Journal of Computer Science and
Applications, ©Technomathematics Research
Foundation Vol. 7, No. 1, pp. 94 – 107, 2010.
(references)
[7] Yuguang Huang, Lei Li, “NAIVE BAYES
CLASSIFICATION ALGORITHM BASED ON SMALL
SAMPLE SET”, Beijing University of Posts and
Telecommunications, Beijing, China, 2011
[8] Thorsten Joachims, “Text Categorization with Support
Vector Machines: Learning with Many Relevant
Features”, Universit• at Dortmund, Informatik LS8,
Baroper Str. 301, 44221 Dortmund, Germany
[9] Daniel Jurafsky & James H. Martin, “Speech and
Language Processing, Chapter 7 - Logistic Regression”,
Copyright c 2016, Draft of August 7, 2017.

More Related Content

What's hot

Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
kevig
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
Shubham Kumar
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
Lifeng (Aaron) Han
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
Pratik Kumar
 
Text summarization
Text summarizationText summarization
Text summarization
Akash Karwande
 
Implementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic ParserImplementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic Parser
Waqas Tariq
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
Lidia Pivovarova
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
Innovation Engineering
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
Karol Grzegorczyk
 
Clustering Algorithm for Gujarati Language
Clustering Algorithm for Gujarati LanguageClustering Algorithm for Gujarati Language
Clustering Algorithm for Gujarati Language
ijsrd.com
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTKANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
ijnlc
 
Myanmar Named Entity Recognition with Hidden Markov Model
Myanmar Named Entity Recognition with Hidden Markov ModelMyanmar Named Entity Recognition with Hidden Markov Model
Myanmar Named Entity Recognition with Hidden Markov Model
ijtsrd
 
The Duet model
The Duet modelThe Duet model
The Duet model
Bhaskar Mitra
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
Ali Kabbadj
 
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...
csandit
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
KozoChikai
 

What's hot (19)

Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
 
Text summarization
Text summarizationText summarization
Text summarization
 
Implementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic ParserImplementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic Parser
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Clustering Algorithm for Gujarati Language
Clustering Algorithm for Gujarati LanguageClustering Algorithm for Gujarati Language
Clustering Algorithm for Gujarati Language
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTKANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
 
Myanmar Named Entity Recognition with Hidden Markov Model
Myanmar Named Entity Recognition with Hidden Markov ModelMyanmar Named Entity Recognition with Hidden Markov Model
Myanmar Named Entity Recognition with Hidden Markov Model
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 

Similar to IRJET- Automatic Language Identification using Hybrid Approach and Classification Algorithms

DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
AbdurrahimDerric
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysisIndexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
kevig
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
kevig
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
kevig
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
kevig
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET Journal
 
Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...
ijaia
 
7 probability and statistics an introduction
7 probability and statistics an introduction7 probability and statistics an introduction
7 probability and statistics an introduction
ThennarasuSakkan
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
kevig
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
kevig
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Tagging
kevig
 
228-SE3001_2
228-SE3001_2228-SE3001_2
228-SE3001_2
Boshra Albayaty
 
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCESSTATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
cscpconf
 
Integrating natural language processing and software engineering
Integrating natural language processing and software engineeringIntegrating natural language processing and software engineering
Integrating natural language processing and software engineering
Nakul Sharma
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
kevig
 
Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...
Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...
Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...
cscpconf
 
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONSUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
ijaia
 
Automated Essay Scoring Using Generalized Latent Semantic Analysis
Automated Essay Scoring Using Generalized Latent Semantic AnalysisAutomated Essay Scoring Using Generalized Latent Semantic Analysis
Automated Essay Scoring Using Generalized Latent Semantic Analysis
Gina Rizzo
 
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGESYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
kevig
 

Similar to IRJET- Automatic Language Identification using Hybrid Approach and Classification Algorithms (20)

DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysisIndexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
 
Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...
 
7 probability and statistics an introduction
7 probability and statistics an introduction7 probability and statistics an introduction
7 probability and statistics an introduction
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Tagging
 
228-SE3001_2
228-SE3001_2228-SE3001_2
228-SE3001_2
 
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCESSTATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
 
Integrating natural language processing and software engineering
Integrating natural language processing and software engineeringIntegrating natural language processing and software engineering
Integrating natural language processing and software engineering
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
 
Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...
Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...
Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...
 
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONSUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
 
Automated Essay Scoring Using Generalized Latent Semantic Analysis
Automated Essay Scoring Using Generalized Latent Semantic AnalysisAutomated Essay Scoring Using Generalized Latent Semantic Analysis
Automated Essay Scoring Using Generalized Latent Semantic Analysis
 
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGESYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
 

More from IRJET Journal

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
IRJET Journal
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
IRJET Journal
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
IRJET Journal
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
IRJET Journal
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
IRJET Journal
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
IRJET Journal
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
IRJET Journal
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
IRJET Journal
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
IRJET Journal
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
IRJET Journal
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
IRJET Journal
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
IRJET Journal
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
IRJET Journal
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
IRJET Journal
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
IRJET Journal
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
IRJET Journal
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
IRJET Journal
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
IRJET Journal
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
IRJET Journal
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
IRJET Journal
 

More from IRJET Journal (20)

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
 

Recently uploaded

Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
Aditya Rajan Patra
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 

Recently uploaded (20)

Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 

IRJET- Automatic Language Identification using Hybrid Approach and Classification Algorithms

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 03 | Mar-2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 3178 Automatic Language Identification using Hybrid Approach and Classification Algorithms Ajinkya Gadgil1, Swaraj Joshi2, Pranit Katwe3, Prasad Kshatriya4 1,2,3,4 Students, Department of Computer Engineering, K.K.Wagh Institute of Engineering Education and Research, Nashik-422003, Maharashtra, India. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - In this paper, we implement the working of an uninterrupted classification algorithm for automatic language identification. The method introduced in this paper identify the language without user interaction during processing. In this process, we implement an efficient algorithm which contains Naive Bayesian classification and n-gram text processing algorithm. This method does not require any prior information (number of classes, initial partition) and is able to quickly process a large amount of data and results can be visualized. We address the problem of detecting documents that contain text from more than one language. This project also includes language preprocessing like special characters removal, suffix removal and token generation. Naive Bayesian classification is used for classification of texts in multiple languages. In this project, we use languages like Hindi, English, Gujarati, Sanskrit and other foreign languages. We can extend the project by including language translation, document summarization and sentiment analysis. Keywords: Language identification, N-Gram, Multilingual text, Classification, Natural Language Processing (NLP). 1. INTRODUCTION Currently, large number of methods in Natural Language Processing (NLP) projects are getting the advantage of text processing algorithm to improve their performance in text processing application to help stop the search space, for example, work of identifying several demographic properties of documents from a written text, we can identify the structure of the language. The method of identifying the native language depends on the manner of speaking and writing of the second language and is borrowed from Second Language Acquisition (SLA), which is known as language transfer. The method of language transfer says that the first language (L1) influences the way that a second language (L2) is learned (Ahn, 2011; Tsur and Rappoport, 2007). According to this theory, if we learn to identify what is being transferred from one language to another, then it is possible to identify the native language of an author given a text written in L2. For instance, a Korean native speaker can be identified by the errors in the use of articles 'a', 'an' and 'the' in his English writings due to the lack of similar function words in Korean. As we see, error identification is very common in automatic approaches; however, a previous analysis and understanding of linguistic markers are often required in such approaches. R and D Department in recent years has given a lot of interest to text data processing and especially to multilingual textual data[1], this is for several reasons: a growing collection of networked and universally distributed data, the development of communication infrastructure and the Internet, the increase in the number of people connected to the global network and whose mother tongue is not English[2]. This has created a need to organize and process huge volumes of data. The manual processing of these data (expert, or knowledge-based systems) is very costly in time and personnel, they are inflexible and generalization to other areas are virtually impossible. So we try to develop automatic methods as Automatic Language Identification using Hybrid Approach and Classification Algorithms. 2. METHODOLOGICAL APPROACH In this part, we describe our methodology with the use of approach based on n-grams[3][4] to group similar documents together. This combination will be examined in several experiments using the Naive Bayesian classification[7], Support Vector Machines (SVM)[8] and Logistic Regression[9] as similarity measures for several values of n. The language detection of a written text is probably one of the most basic tasks in natural language processing (NLP). For any language-dependent processing of an unknown text, the first thing is to know in which language the text is written. The approach we have chosen to implement is pretty straightforward. The idea is that any language has a unique set of character (co- )occurrences. Our method depends on the three attributes of language that include character set, N-gram[5] and word list[3]. 2.1. Character Set Some languages have a very specific Character set (e.g. Marathi, Sanskrit); while for others, some characters give a good hint of what languages come in question (e.g. Hindi, Gujarati).
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 03 | Mar-2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 3179 2.2. N-gram Algorithm The term "n-gram"[6] was introduced in 1948. An n-gram may designate both: n-tuple of characters (n-gram character) or an n-tuple of words (n-gram words). This model does not represent documents by a vector of term’s frequencies but by a vector of n-gram frequencies in the documents. An n-gram character is a sequence of n consecutive characters. For any document, all n-grams that can be generated are the result obtained by moving a window of n boxes in the text. This movement is made in stages; one stage corresponds to a character for n-grams of characters and a word for n-grams of words. Then we count the frequencies of n-grams found. In scientific literature, this term sometimes refers to sequences that are neither ordered nor straight, for example, a bigram can be composed of the first letter and second letter of a word; consider an n-gram as a set of unordered n words after performing the stemming and the removing of special characters. In our experiments, n-grams of characters are used. Thus an n-gram refers to a string of n consecutive characters. In this approach, we do not need to conduct any linguistic processing of the corpus. For a given document, as we already said, extracting all n-grams (usually n = (2, 3, 4, 5)) is the result obtained by moving a window of n boxes in the main text. This movement is made by steps of one character at a time, every step we take a "snapshot" and all these 'shots' constitute the set of all n-grams of the document. We cut the texts of the corpus based on the value of n chosen. We took n = 2, 3, 4 and 5. 2.3. Word List The last source of disambiguation is the actually used words. Some languages (Marathi and Hindi) are almost identical in character set and also the occurrences of the specific n-grams. Still, different words are used in different frequencies. 2.4. Naive Bayes Classification Naive Bayes classifier is an efficient and powerful algorithm for the classification task. Even if we are working on a data set with millions of records with some attributes, it is suggested to try Naive Bayes approach. Naive Bayes classifier gives good results when we use it for textual data processing such as Natural Language Processing. To learn the Naive Bayes classifier we need to understand the Bayes theorem. It works on conditional probability. Conditional probability is the probability that something will happen, given that something else has already occurred. Using the conditional probability, we can calculate the probability of an event using its prior knowledge. Naive Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class. The class with the highest probability is considered as the most likely class. This is also known as Maximum A Posteriori (MAP). Naive Bayes classifier assumes that all the features are unrelated to each other. Presence or absence of a feature does not influence the presence or absence of any other feature. We can use the example of Wikipedia for explaining the use of Naive Bayes classifier i.e. A fruit may be considered to be an apple if it is red, round, and about 4″ in diameter. Even if these features depend on each other or upon the existence of the other features, a Naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. In real datasets, we test a hypothesis given multiple evidence (feature). So, calculations become complicated. To simplify the work, the feature independence approach is used to ‘uncouple’ multiple evidence and treat each as an independent one. 2.5. Language Identification In this step, we classify input text document that matches the model of dominant language. 2.6. Equations These are the equations for the n-gram model used in our methodology ● unigram: p(wi) (i.i.d. process) ● bigram: p(wi |wi−1) (Markov process) ● trigram: p(wi |wi−2, wi−1) We can estimate n-gram probabilities by counting relative frequency on a training corpus. (This is maximum likelihood estimation.) where N is the total number of words in the training set and c(·) denotes count of the word or word sequence in the training data. Following equation is used for the Naive Bayes Classifier to calculate the conditional probability.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 03 | Mar-2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 3180 ● P(C) is the probability of hypothesis H being true. This is known as the prior probability. ● P(A) is the probability of the evidence (regardless of the hypothesis). ● P(A|C) is the probability of the evidence given that hypothesis is true. ● P(C|A) is the probability of the hypothesis given that the evidence is there 2.7. Working of Classification Algorithms A Path Lab is performing a Test of disease say “D” with two results “Positive” & “Negative.” They guarantee that their test result is 99% accurate: if you have the disease, they will give test positive 99% of the time. If you don’t have the disease, they will test negative 99% of the time. If 3% of all the people have this disease and test gives “positive” result, what is the probability that you actually have the disease? For solving the above problem, we will have to use conditional probability. Probability of people suffering from Disease D, P(D) = 0.03 = 3%. Probability that test gives “positive” result and patient have the disease, P(Pos | D) = 0.99 =99%. Probability of people not suffering from Disease D, P(~D) = 0.97 = 97%. The probability that test gives “positive” result and patient does have the disease, P(Pos | ~D) = 0.01 =1%. For calculating the probability that the patient actually has the disease i.e, P( D | Pos) we will use Bayes theorem: We have all the values of numerator but we need to calculate P(Pos): P(Pos) = P(D | pos) + P( ~D | pos) = P(pos|D)*P(D) + P(pos|~D)*P(~D) = 0.99 * 0.03 + 0.01 * 0.97 = 0.0297 + 0.0097 = 0.0394 Let’s calculate, P( D | Pos) = (P(Pos | D) * P(D)) / P(Pos) = (0.99 * 0.03) / 0.0394 = 0.753807107 So, Approximately 75% chances are there that the patient is actually suffering from the disease. 2.8. Results and Evaluation In this section, we present and evaluate the obtained results. We evaluate our method using three classification algorithms. ● Naive Bayesian classification[7] ● Support Vector Machines (SVM)[8] ● Logistic Regression[9] Following diagram show the comparison between this three methods. Distance NBC SVM LR n-gram 2 3 4 5 2 3 4 5 2 3 4 5 Time for extracting 244 204 204 152 244 204 204 152 244 204 20 4 152 n-grams (second) Time required to Classify 49 272 403 400 54 130 285 263 28 65 99 61 (second) F- measure (%) 51.7 1 44.35 47.4 5 51.8 1 48.6 5 44.0 9 40.6 3 49.58 44.3 9 40.28 40.4 46.95 3. CONCLUSIONS The work presented in this paper shows that it is possible to identify the language automatically in an unsupervised manner. It aims to enhance the unsupervised methods and techniques applied for classification of a text document to identify the language in a multilingual corpus. Furthermore, we can extend our project to translate the data from one language to another language. REFERENCES [1] Marco Lui, Jey Han Lau and Timothy Baldwin, “Automatic Detection and Language Identification of Multilingual Documents”, Department of Computing and Information Systems, The University of Melbourne, NICTA Victoria Research Laboratory, Department of Philosophy, King’s College London, 2014 [2] Pavol ZAVARSKY, Yoshiki MIKAMI, Shota WADA, “Language and encoding scheme identification of extremely large sets of multilingual text documents”, Department of Management and Information Sciences, Nagaoka University of Technology, 1603-1 Kamitomioka, 940-2188 Nagaoka, Japan [3] IOANNIS KANARIS, KONSTANTINOS KANARIS, IOANNIS HOUVARDAS, and EFSTATHIOS STAMATATOS, “WORDS VS. CHARACTER N-GRAMS FOR ANTI-SPAM FILTERING”, Dept. of Information
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 03 | Mar-2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 3181 and Communication Systems Eng., University of the Aegean, Karlovassi, Samos - 83200, Greece, 2006 [4] Bashir Ahmed, Sung-Hyuk Cha, and Charles Tappert, “Language Identification from Text Using N-gram Based Cumulative Frequency Addition”, Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 [5] Rahmoun A, Elberrichi Z. Experimenting N-Grams in Text Categorization. International Arab Journal of Information Technology, 2007, Vol 4, N°.4, pp. 377- 385. [6] ABDELMALEK AMINE, MICHEL SIMONET, ZAKARIA ELBERRICHI, “On Automatic Language Identification International Journal of Computer Science and Applications, ©Technomathematics Research Foundation Vol. 7, No. 1, pp. 94 – 107, 2010. (references) [7] Yuguang Huang, Lei Li, “NAIVE BAYES CLASSIFICATION ALGORITHM BASED ON SMALL SAMPLE SET”, Beijing University of Posts and Telecommunications, Beijing, China, 2011 [8] Thorsten Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, Universit• at Dortmund, Informatik LS8, Baroper Str. 301, 44221 Dortmund, Germany [9] Daniel Jurafsky & James H. Martin, “Speech and Language Processing, Chapter 7 - Logistic Regression”, Copyright c 2016, Draft of August 7, 2017.