This study proposes a method to develop neural models of the morphological analyzer for Japanese
Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that
divides text data into words and assigns information such as parts of speech. In Japanese natural
language processing systems, this technique plays an essential role in downstream applications
because the Japanese language does not have word delimiters between words. Hiragana is a type
of Japanese phonogramic characters, which is used for texts for children or people who cannot read
Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of
ordinary Japanese sentences because there is less information for dividing. For morphological
analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model
based on ordinary Japanese text and examined the influence of training data on texts of various
genres.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
Machine translation is being carried out by the researchers from quite a long time. However, it is still a
dream to materialize flawless Machine Translator and the small numbers of researchers has focussed at
translating Marathi Text to English. Perfect Machine Translation Systems have not yet been fully built
owing to the fact that languages differ syntactically as well as morphologically. Majority of the researchers
have opted for Statistical Machine translation whereas in this paper we have addressed the challenges of
Rule based Machine Translation. The paper describes the major divergences observed in language
Marathi and English and many challenges encountered while attempting to build machine translation
system form Marathi to English using rule based approach and rules to handle these challenges. As there
are exceptions to the rules and limit to the feasibility of maintaining knowledgebase, the practical machine
translation from Marathi to English is a complex task.
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Quality estimation of machine translation outputs through stemmingijcsa
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
Machine translation is being carried out by the researchers from quite a long time. However, it is still a
dream to materialize flawless Machine Translator and the small numbers of researchers has focussed at
translating Marathi Text to English. Perfect Machine Translation Systems have not yet been fully built
owing to the fact that languages differ syntactically as well as morphologically. Majority of the researchers
have opted for Statistical Machine translation whereas in this paper we have addressed the challenges of
Rule based Machine Translation. The paper describes the major divergences observed in language
Marathi and English and many challenges encountered while attempting to build machine translation
system form Marathi to English using rule based approach and rules to handle these challenges. As there
are exceptions to the rules and limit to the feasibility of maintaining knowledgebase, the practical machine
translation from Marathi to English is a complex task.
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Quality estimation of machine translation outputs through stemmingijcsa
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration. We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language.
Transliteration by orthography or phonology for hindi and marathi to english ...ijnlc
e-Governance and Web based online commercial multilingual applications has given utmost importance to
the task of translation and transliteration. The Named Entities and Technical Terms occur in the source
language of translation are called out of vocabulary words as they are not available in the multilingual
corpus or dictionary used to support translation process. These Named Entities and Technical Terms need
to be transliterated from source language to target language without losing their phonetic properties. The
fundamental problem in India is that there is no set of rules available to write the spellings in English for
Indian languages according to the linguistics. People are writing different spellings for the same name at
different places. This fact certainly affects the Top-1 accuracy of the transliteration and in turn the
translation process. Major issue noticed by us is the transliteration of named entities consisting three
syllables or three phonetic units in Hindi and Marathi languages where people use mixed approach to
write the spelling either by orthographical approach or by phonological approach. In this paper authors
have provided their opinion through experimentation about appropriateness of either approach.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
Detection of slang words in e data using semi supervised learningijaia
The proposed algorithmic approach deals with finding the sense of a word in an electronic data. Now a day,
in different communication mediums like internet, mobile services etc. people use few words, which are
slang in nature. This approach detects those abusive words using supervised learning procedure. But in the
real life scenario, the slang words are not used in complete word forms always. Most of the times, those
words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed
approach can detect those abbreviated forms also using semi supervised learning procedure. Using the
synset and concept analysis of the text, the probability of a suspicious word to be a slang word is also
evaluated.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
Tamil-English Document Translation Using Statistical Machine Translation Appr...baskaran_md
The Paper presents a new method for translating a text document from Tamil to English. Our method is based on the Statistical Machine Translation Approach, combined with the Morphological Analysis, due to the fact that Tamil is a highly-inflected language. This paper presents a slight modification in SMT to make the approach more efficient and effective, and the experimental results have proven the method to be speed and accurate in the translation process.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration. We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language.
Transliteration by orthography or phonology for hindi and marathi to english ...ijnlc
e-Governance and Web based online commercial multilingual applications has given utmost importance to
the task of translation and transliteration. The Named Entities and Technical Terms occur in the source
language of translation are called out of vocabulary words as they are not available in the multilingual
corpus or dictionary used to support translation process. These Named Entities and Technical Terms need
to be transliterated from source language to target language without losing their phonetic properties. The
fundamental problem in India is that there is no set of rules available to write the spellings in English for
Indian languages according to the linguistics. People are writing different spellings for the same name at
different places. This fact certainly affects the Top-1 accuracy of the transliteration and in turn the
translation process. Major issue noticed by us is the transliteration of named entities consisting three
syllables or three phonetic units in Hindi and Marathi languages where people use mixed approach to
write the spelling either by orthographical approach or by phonological approach. In this paper authors
have provided their opinion through experimentation about appropriateness of either approach.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
Detection of slang words in e data using semi supervised learningijaia
The proposed algorithmic approach deals with finding the sense of a word in an electronic data. Now a day,
in different communication mediums like internet, mobile services etc. people use few words, which are
slang in nature. This approach detects those abusive words using supervised learning procedure. But in the
real life scenario, the slang words are not used in complete word forms always. Most of the times, those
words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed
approach can detect those abbreviated forms also using semi supervised learning procedure. Using the
synset and concept analysis of the text, the probability of a suspicious word to be a slang word is also
evaluated.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
Tamil-English Document Translation Using Statistical Machine Translation Appr...baskaran_md
The Paper presents a new method for translating a text document from Tamil to English. Our method is based on the Statistical Machine Translation Approach, combined with the Morphological Analysis, due to the fact that Tamil is a highly-inflected language. This paper presents a slight modification in SMT to make the approach more efficient and effective, and the experimental results have proven the method to be speed and accurate in the translation process.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and ismonosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological
Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game
characters. Conventional morphological analyzers, such as MeCab, segment words with high
performance, but they are unable to segment broken expressions or utterance endings that are not listed in
the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we
propose segmenting lines of Japanese anime or game characters using subword units that were proposed
mainly for deep learning, and extracting frequently occurring strings to obtain expressions that
characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender,
age, and each anime character and show that they are linguistic speech patterns that are specific for each
feature. Additionally, a classification experiment shows that the model with subword units outperformed
that with the conventional method
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences
Parsing of Myanmar Sentences With Function Taggingkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences.
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Rule Based Transliteration Scheme for English to Punjabikevig
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...cscpconf
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the
attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing
sentiment analysis in Arabic language. Our approach had begun with special pre-processing
steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an
algorithm extracted sentiment-based feature selection method. This feature selection method
had utilized on three sentiment corpora using SVM classifier. We compared our approach with
some traditional methods, followed by most SA works. The experimental results were very
promising for enhancing SA accuracy.
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...csandit
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing sentiment analysis in Arabic language. Our approach had begun with special pre processing steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an algorithm extracted sentiment-based feature selection method. This feature selection method had utilized on three sentiment corpora using SVM classifier. We compared our approach with some traditional methods, followed by most SA works. The experimental results were very promising for enhancing SA accuracy.
ANNOTATED GUIDELINES AND BUILDING REFERENCE CORPUS FOR MYANMAR-ENGLISH WORD A...ijnlc
Reference corpus for word alignment is an important resource for developing and evaluating word alignment methods. For Myanmar-English language pairs, there is no reference corpus to evaluate the word alignment tasks. Therefore, we created the guidelines for Myanmar-English word alignment annotation between two languages over contrastive learning and built the Myanmar-English reference corpus consisting of verified alignments from Myanmar ALT of the Asian Language Treebank (ALT). This reference corpus contains confident labels sure (S) and possible (P) for word alignments which are used to test for the purpose of evaluation of the word alignments tasks. We discuss the most linking ambiguities to define consistent and systematic instructions to align manual words. We evaluated the results of annotators agreement using our reference corpus in terms of alignment error rate (AER) in word alignment tasks and discuss the words relationships in terms of BLEU scores.
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
This article describes the morphological analysis of a standard Arabic natural language processing, as a
part of an electronic dictionary-constricting phase. A fully 3-lettered inflected verbs model are formalized
based on a linguistic classification, using NOOJ platform, the classification gives certain representative
verbs that will considered as lemmas, this verbs form our dictionary entries, they are also conjugated
according to our inflection paradigm relying on certain specific morphological properties. This dictionary
will be considered as an Arabic resource, which will help NLP applications and NOOJ platform to analyse
sophisticated Arabic corpora.
El modelo de traducción de voz de extremo a extremo de alta calidad se basa en una gran escala de datos de entrenamiento de voz a texto,
que suele ser escaso o incluso no está disponible para algunos pares de idiomas de bajos recursos. Para superar esto, nos
proponer un método de aumento de datos del lado del objetivo para la traducción del habla en idiomas de bajos recursos. En particular,
primero generamos paráfrasis del lado objetivo a gran escala basadas en un modelo de generación de paráfrasis
que incorpora varias características de traducción automática estadística (SMT) y el uso común
función de red neuronal recurrente (RNN). Luego, un modelo de filtrado que consiste en similitud semántica
y se propuso la co-ocurrencia de pares de palabras y habla para seleccionar la fuente con la puntuación más alta
pares de paráfrasis de los candidatos. Resultados experimentales en inglés, árabe, alemán, letón, estonio,
La generación de paráfrasis eslovena y sueca muestra que el método propuesto logra resultados significativos.
y mejoras consistentes sobre varios modelos de referencia sólidos en conjuntos de datos PPDB (http://paraphrase.
org/). Para introducir los resultados de la generación de paráfrasis en la traducción de voz de bajo recurso,
proponen dos estrategias: recombinación de pares audio-texto y entrenamiento de referencias múltiples. Experimental
Los resultados muestran que los modelos de traducción de voz entrenados en nuevos conjuntos de datos de audio y texto que combinan
los resultados de la generación de paráfrasis conducen a mejoras sustanciales sobre las líneas de base, especialmente en
lenguas de escasos recursos.
SENTIMENT ANALYSIS OF BANGLA-ENGLISH CODE-MIXED AND TRANSLITERATED SOCIAL MED...indexPub
In this era of technology, online communication has expanded to incorporate different language combinations and expressive styles. Code-mixing, or combining languages such as Bengali and English, is frequent in multilingual environments. In addition, the use of emojis, which are small graphical icons, has increased the complexity of online communication by allowing for more emotional expression. Sentiment analysis of comments on social media that are combined with emojis and written in Bangla-English language is the main topic of this work. This research aims to understand and classify the sentiments expressed in code-mixed comments, including the nuanced role of emojis. The data was preprocessed first, then feature extraction was performed using the TF-IDF Vectorizer and CountVectorizer algorithms. For the analysis, nine different machine learning algorithms were used. With a remarkable accuracy of 85.7% and an F1 score of 85.0%, the Support Vector Classifier stood out as the most successful model. This highlights the utility of including emoji-based features for complex sentiment analysis, particularly when dealing with code-mixed data. The dataset contained 2055 comments from Facebook pages, including Bangla-English comments with and without emojis and comments that only contained emojis. To prepare the dataset for analysis, preprocessing methods included removing irrelevant data and converting emojis into Unicode short names.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
Similar to M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IRAGANA S ENTENCES (20)
Effect of Query Formation on Web Search Engine Resultskevig
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
Speech generation is one of the most important areas of research in speech signal processing which is now gaining a serious attention. Speech is a natural form of communication in all living things. Computers with the ability to understand speech and speak with a human like voice are expected to contribute to the development of more natural man-machine interface. However, in order to give those functions that are even closer to those of human beings, we must learn more about the mechanisms by which speech is produced and perceived, and develop speech information processing technologies that can generate a more natural sounding systems. The so described field of stud, also called speech synthesis and more prominently acknowledged as text-to-speech synthesis, originated in the mid eighties because of the emergence of DSP and the rapid advancement of VLSI techniques. To understand this field of speech, it is necessary to understand the basic theory of speech production. Every language has different phonetic alphabets and a different set of possible phonemes and their combinations.
For the analysis of the speech signal, we have carried out the recording of five speakers in Dogri (3 male and 5 females) and eight speakers in Hindi language (4 male and 4 female). For estimating the durational distributions, the mean of mean of ten instances of vowels of each speaker in both the languages has been calculated. Investigations have shown that the two durational distributions differ significantly with respect to mean and standard deviation. The duration of phoneme is speaker dependent. The whole investigation can be concluded with the end result that almost all the Dogri phonemes have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same phonemes when uttered in Hindi were found to be longer compared to when they were spoken by a person with Dogri as his mother tongue. There are many applications which are directly of indirectly related to the research being carried out. For instance the main application may be for transforming Dogri speech into Hindi and vice versa, and further utilizing this application, we can develop a speech aid to teach Dogri to children. The results may also be useful for synthesizing the phonemes of Dogri using the parameters of the phonemes of Hindi and for building large vocabulary speech recognition systems.
May 2024 - Top10 Cited Articles in Natural Language Computingkevig
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Effect of Singular Value Decomposition Based Processing on Speech Perceptionkevig
Speech is an important biological signal for primary mode of communication among human being and also the most natural and efficient form of exchanging information among human in speech. Speech processing is the most important aspect in signal processing. In this paper the theory of linear algebra called singular value decomposition (SVD) is applied to the speech signal. SVD is a technique for deriving important parameters of a signal. The parameters derived using SVD may further be reduced by perceptual evaluation of the synthesized speech using only perceptually important parameters, where the speech signal can be compressed so that the information can be transformed into compressed form without losing its quality. This technique finds wide applications in speech compression, speech recognition, and speech synthesis. The objective of this paper is to investigate the effect of SVD based feature selection of the input speech on the perception of the processed speech signal. The speech signal which is in the form of vowels \a\, \e\, \u\ were recorded from each of the six speakers (3 males and 3 females). The vowels for the six speakers were analyzed using SVD based processing and the effect of the reduction in singular values was investigated on the perception of the resynthesized vowels using reduced singular values. Investigations have shown that the number of singular values can be drastically reduced without significantly affecting the perception of the vowels.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4,
demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to
identify which specific terms in prompts positively or negatively impact relevance
evaluation with LLMs. We employed two types of prompts: those used in previous
research and generated automatically by LLMs. By comparing the performance of
these prompts in both few-shot and zero-shot settings, we analyze the influence of
specific terms in the prompts. We have observed two main findings from our study.
First, we discovered that prompts using the term ‘answer’ lead to more effective
relevance evaluations than those using ‘relevant.’ This indicates that a more direct
approach, focusing on answering the query, tends to enhance performance. Second,
we noted the importance of appropriately balancing the scope of ‘relevance.’ While
the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments.
The inclusion of few-shot examples helps in more precisely defining this balance.
By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute
to refine relevance criteria. In conclusion, our study highlights the significance of
carefully selecting terms in prompts for relevance evaluation with LLMs.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4, demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with LLMs. We employed two types of prompts: those used in previous research and generated automatically by LLMs. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts. We have observed two main findings from our study. First, we discovered that prompts using the term ‘answer’ lead to more effective relevance evaluations than those using ‘relevant.’ This indicates that a more direct approach, focusing on answering the query, tends to enhance performance. Second, we noted the importance of appropriately balancing the scope of ‘relevance.’ While the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments. The inclusion of few-shot examples helps in more precisely defining this balance. By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute to refine relevance criteria. In conclusion, our study highlights the significance of carefully selecting terms in prompts for relevance evaluation with LLMs.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
Genetic Approach For Arabic Part Of Speech Taggingkevig
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
Improving Dialogue Management Through Data Optimizationkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and computers through natural language stands as a critical advancement. Central to these systems is the dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue management has embraced a variety of methodologies, including rule-based systems, reinforcement learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset imperfections, especially mislabeling, on the challenges inherent in refining dialogue management processes.
Document Author Classification using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statisticalnaturallanguage parserwere explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
Rag-Fusion: A New Take on Retrieval Augmented Generationkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal scores and fusing the documents and scores. Through manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and comprehensive answers due to the generated queries contextualizing the original query from various perspectives. However, some answers strayed off topic when the generated queries' relevance to the original query is insufficient. This research marks significant progress in artificial intelligence (AI) and natural language processing (NLP) applications and demonstrates transformations in a global and multi-industry context.
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their performance. However, this often leads to overlooking other crucial aspects that should be given careful consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches (SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance parameters (standard indices), but also alternative measures such as timing, energy consumption and costs, which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of complex models (LLM and generative models), consuming much less energy and requiring fewer resources. These findings suggest that companies should consider additional considerations when choosing machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
Evaluation of Medium-Sized Language Models in German and English Languagekevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without external document retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results show that combining the best answers from different MLMs yielded an overall correct answer rate of 82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B parameters, which highlights the importance of using appropriate training data for fine-tuning rather than solely relying on the number of parameters. More fine-grained feedback should be used to further improve the quality of answers. The open source community is quickly closing the gap to the best commercial models.
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and
computers through natural language stands as a critical advancement. Central to these systems is the
dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user
goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue
management has embraced a variety of methodologies, including rule-based systems, reinforcement
learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This
research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue
managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as
Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the
introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset
imperfections, especially mislabeling, on the challenges inherent in refining dialogue management
processes.
Document Author Classification Using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the
text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used,
for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern
times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of
using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship
using grammatical structural information extracted using a statistical natural language parser. This paper provides a
proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist
Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted
of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the
features into a lower dimensional space. Statistical experiments on these documents demonstrate that information
from a statistical parser can, in fact, assist in distinguishing authors.
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product
information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots,
but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines
RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal
scores and fusing the documents and scores. Through manually evaluating answers on accuracy,
relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and
comprehensive answers due to the generated queries contextualizing the original query from various
perspectives. However, some answers strayed off topic when the generated queries' relevance to the
original query is insufficient. This research marks significant progress in artificial intelligence (AI) and
natural language processing (NLP) applications and demonstrates transformations in a global and multiindustry context
Performance, energy consumption and costs: a comparative analysis of automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their
performance. However, this often leads to overlooking other crucial aspects that should be given careful
consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into
account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches
(SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance
parameters (standard indices), but also alternative measures such as timing, energy consumption and costs,
which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and
the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of
complex models (LLM and generative models), consuming much less energy and requiring fewer resources.
These findings suggest that companies should consider additional considerations when choosing machine
learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific
world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to
give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEkevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks
clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six
billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative
question answering, which requires models to provide elaborate answers without external document
retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results
show that combining the best answers from different MLMs yielded an overall correct answer rate of
82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B
parameters, which highlights the importance of using appropriate training data for fine-tuning rather than
solely relying on the number of parameters. More fine-grained feedback should be used to further improve
the quality of answers. The open source community is quickly closing the gap to the best commercial
models.
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithmkevig
Information Retrieval (IR) is a very important and vast area. While searching for context web returns all
the results related to the query. Identifying the relevant result is most tedious task for a user. Word Sense
Disambiguation (WSD) is the process of identifying the senses of word in textual context, when word has
multiple meanings. We have used the approaches of WSD. This paper presents a Proposed Dynamic Page
Rank algorithm that is improved version of Page Rank Algorithm. The Proposed Dynamic Page Rank
algorithm gives much better results than existing Google’s Page Rank algorithm. To prove this we have
calculated Reciprocal Rank for both the algorithms and presented comparative results.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IRAGANA S ENTENCES
1. MORPHOLOGICAL ANALYZER USING THE BI-
LSTM MODEL ONLY FOR JAPANESE HIRAGANA
SENTENCES
Jun Izutsu1
and Kanako Komiya2
1
Ibaraki University
2
Tokyo University of Agriculture and Technology
ABSTRACT
This study proposes a method to develop neural models of the morphological analyzer for Japanese
Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that
divides text data into words and assigns information such as parts of speech. In Japanese natural
language processing systems, this technique plays an essential role in downstream applications
because the Japanese language does not have word delimiters between words. Hiragana is a type
of Japanese phonogramic characters, which is used for texts for children or people who cannot read
Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of
ordinary Japanese sentences because there is less information for dividing. For morphological
analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model
based on ordinary Japanese text and examined the influence of training data on texts of various
genres.
KEYWORDS
Morphological analysis, Hiragana texts, Bi-LSTM CRF model, Fine-tuning, Domain adaptation
1. INTRODUCTION
1.1. Components of the Japanese Language and the Acquisition Process
Japanese sentences contain various kinds of character, such as Kanji (Chinese character),
Hiragana, Katakana, numbers, and alphabet, making it difficult to learn. Japanese
speakers usually learn Hiragana first in their school days because the number of characters
is much smaller than the Kanji; Hiragana has 46 characters, and Japanese use thousands
of Kanji. Most Japanese sentences are composed of all kinds of characters called Kanji-
Kana mixed sentences.
However, it is difficult for many non-Japanese speakers to learn thousands of Kanji, so
children and new Japanese language learners use Hiragana.
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
29
DOI: 10.5121/ijnlc.2022.11103
2. In particular, elementary school students are the first to learn Hiragana and Katakana.
After mastering these two, they learn simple Kanji. Similarly, foreigners who are new to
Japanese also learn Hiragana and Katakana first. The first sentences they read are almost
entirely composed of Hiragana.
1.2. Morphological Analysis of Hiragana Sentences
Morphological analysis is a technique that divides natural language text data into words
and assign information such as parts of speech. In Japanese, morphological analysis is one
of the core technologies for natural language processing because the Japanese language
does not have word delimiters between words. Morphological analyzers like MeCab and
Chasen are now commonly used for morphological analysis. However, since the above
systems target Kanji-Kana mixed sentences, it is challenging to perform the use on
sentences written only in Hiragana.
Morphological analysis of Hiragana-only sentences is more challenging than
morphological analysis of Kanji-Kana mixed sentences. If the text is a mixture of Kanji
and Kana (Hiragana and Katakana), it will be divisible between Kanji, Hiragana, and
Katakana. However, if the text consists only of Hiragana, there will be less information
for dividing. Therefore, we propose to fine-tune the model of Kanji-Kana mixed
sentences and investigated whether the accuracy of morphological analysis of Hiragana
sentences can be improved by inheriting the information to be divided into words.
The Bi-LSTM CRF model was used to develop a morphological analyzer for Hiragana
sentences in this paper. We used two types of training data: Wikipedia and Yahoo!
Answers in the Balanced Corpus of Contemporary Written Japanese, in our studies to
investigate the influence of the genre of the training data. We also fine-tune both data to
examine the effect of various genres on the text. [7]
The following are the four contributions of this paper.
l Developed a morphological analyzer for Hiragana sentences using the Bi-
LSTM CRF model,
l Demonstrated the effectiveness of fine-tuning using a model based on Kanji-
Kana mixed text,
l Examined the influence of training data of morphological analysis on texts of
various genres, and
l Demonstrated the effectiveness of fine-tuning using data from Wikipedia and
Yahoo! Answers.
In this paper, we report the results of these experiments. This paper is an extended
version of “Morphological Analysis of Japanese Hiragana Sentences Using the Bi-LSTM
CRF Model”, published in the proceedings of 10th International Conference on Natural
Language Processing (NLP 2021).
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
30
3. 2. RELATED WORK
Izutsu et al. (2020) [3] converted MeCabʼs ipadic dictionary into Hiragana and performed
morphological analysis on Hiragana sentences using a corpus consisting only of Hiragana.
Moriyama et al. (2018) [8] developed morphological analyzer for Kana sentences using
Recurrent Neural Network Language Model. We also developed a morphological analyzer
for only Hiragana sentence, but we try to achieve this goal without any dictionary.
There are some studies focus on morphological analyzers for Hiragana-highly-mixed
sentences and most of them treated a lot of Hiragana words as noises or broken Japanese.
For example, Kudo et al. (2012) [4] used generative model to model the process of
generating Hiragana noise-mixed sentence. They proposed using a large-scale web
corpus and EM algorithm to estimate the modelʼ s parameters to improve the analysis of
Hiragana noise-mixed sentences. Osaki et al. (2016) [9] constructed a corpus for
broken Japanese morphological analysis. They defined new parts of speech and used them
for broken expressions. Fujita et al. (2014) [1] proposed an unsupervised domain
adaptation technique that uses the existing dictionaries and labeled data to build a
morphological analyzer by automatically transforming them for the features of the target
domain. Hayashi and Yamamura (2017) [2] reported that adding Hiragana words to the
dictionary can improve the accuracy of morphological analysis.
We used Bi-LSTM model to develop a morphological analyzer. Ma et al. (2018) [5]
developed a word segmentation model for Chinese using the Bi-LSTM model. They
reported that word segmentation accuracy achieved better accuracy on public datasets
than the Bi-LSTM model, compared to models based on more complex neural network
architectures.
Also, Thattianaphanich and Prom-on (2019) [10] developed the Bi-LSMT CRF model
and performed named entity recognition extraction in Thai. In Thai, there are linguistic
problems such as lack of linguistic resources and boundary indices between words,
phrases, and sentences. Therefore, they prepared word representations and learned text
sequences using Bi-LSTM and CRF to address these problems.
3. METHODS
In this research, we used the Bi-LSTM CRF model to generate a morphological analyzer
for Hiragana sentences. We trained and compared the following five models for Hiragana
sentences in our studies.
1. the Hiragana Wiki model
2. the Hiragana Yahoo! model
3. the Kanji-Kana Wiki+ Hiragana Wiki model
4. the Kanji-Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo! model
5. the Hiragana Wiki+ Hiragana Yahoo! model
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
31
4. The generation processes of the five models for Hiragana sentences are shown in Figure
1.
3.1. Hiragana Wiki Model
The Hiragana Wiki model is the model generated in training B (Figure 1). We used the
data from Wikipedia, where Kanji-Kana mixed sentences are converted to Hiragana, for
training data. Hiragana is a phonogram, and Kanji could be converted into Hiragana
according to its pronunciation. We used MeCab's reading data as a pseudo-correct answer
for the conversion because Wikipedia does not have Hiragana-only data. Here, UniDic
was used as a dictionary of MeCab. [6]
By default, MeCab uses IPAdic as its dictionary. However, BCCWJ corpus used as test
data uses UniDic, so we used Unidic as MeCab's dictionary.
3.2. Hiragana Yahoo! Mode
The model generated in training D (Figure 1) is the Hiragana Yahoo! model. We used
reading data from Yahoo! Answers as training data. We compared its accuracy to a
Hiragana Wiki + Hiragana Yahoo! model (training F in Figure 1) and Kanji-Kana Wiki+
Hiragana Wiki+ Hiragana Yahoo! model (training E in Figure 1).
3.3. Kanji-Kana Wiki+ Hiragana Wiki Model
In this paper, we proposed fine-tuning using a model with Kanji-Kana mixed sentences.
The Kanji-Kana Wiki+ Hiragana Wiki model is generated using the original Wikipedia
data and Hiragana Wikipedia data, which are the actual data converted into only Hiragana.
We fine-tuned Kanji-Kana Wiki model, the model trained by the original Wikipedia data
(training A in Figure 1), with Hiragana Wikipedia data, which is Wikipedia data
automatically converted into Hiragana sentences (training C in Figure 1). Kanji-Kana
mixed sentences contain many clues for morphological analysis, such as borderline
between Kanji and Hiragana and information about Kanji. Therefore, it is expected to
improve accuracy. By comparing the Hiragana Wiki model and this model, we can assess
the effectiveness of fine-tuning based on Kanji-Kana mixed Wikipedia data, when we only
have Wikipedia data.
3.4. Kanji-Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo! Model
Training E in Figure 1 generates the Kanji-Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo!
model. To improve accuracy, we used both Wikipedia and Yahoo! Answers as training
data.
We fine-tuned the Kanji-Kana Wiki+ Hiragana Wiki model with Hiragana Yahoo!
Answers data for this model.
3.5. Hiragana Wiki+ Hiragana Yahoo! Model
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
32
5. The Hiragana Wiki+ Hiragana Yahoo! model is the model generated in training F in
Figure 1. As training data, we used Hiragana Wikipedia data and Hiragana Yahoo!
Answers data. We fine-tuned the Hiragana Wiki model with Hiragana Yahoo! Answers.
We can see the effectiveness of the Kanji-Kana Wiki+ Hiragana Wiki model, i.e., how
much the Kanji-Kana mixed model affects this model, when we have both Wikipedia data
and Yahoo! Answers by comparing the accuracy of this model to Kanji-Kana Wiki+
Hiragana Wiki+ Hiragana Yahoo! model.
Figure 1. Model generation processes of the five models for Hiragana sentences
4. EXPERIMENT
4.1. Network Architecture
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
33
6. We developed Bi-LSTM CRF models using Bi-LSTM CRF module of Pytorch.
Table 1 summarizes the hyperparameters used in this experiment. The dimensions for
the network architecture were determined according to the preliminary experiments.
Table 1. Hyperparameters used in the experiment
EMBEDDING_DIM 5
HIDDEN_DIM 4
Lr 0.01
Weight decay 1e-4
Optimizer sgd
Epoch number 15
The tag size is 38 for all models and the vocabulary size varies according to the model
because it is the total character type number of the training corpus. The vocabulary size
of the models that use only Hiragana sentences are 192 and that of the model that uses
Kanji-Kana Wiki data was 2,742.
4.2. Training Data
We used two kinds of training data in this experiment: Japanese Wikipedia data and
“Yahoo Answers Data” from BCCWJ of the National Institute for Japanese Language and
Linguistics. For this experiment, the data from Wikipedia were extracted from
jawiki-latest-pages-articles.xml.bz2
which is published on the website for this experiment.
We preprocessed the Wikipedia data before training to obtain Hiragana-only sentences.
The preprocessing procedures are as follows. First, we conducted morphological analysis
on the Wikipedia data using MeCab. The output of MeCab has features: word
segmentation, part of speech, part of speech subdivision 1, part of speech subdivision 2,
part of speech subdivision 3, conjugation type, conjugated and basic forms, reading,
pronunciation.
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
34
7. Table 2. MeCab's analysis result for “僕は君と遊ぶ"
僕
(boku)
は
(ha)
君
(kimi)
と
(to)
遊ぶ
(asobu)
part of speech noun particle noun particle verb
part-of-speech
subdivision 1
pronoun binding
particle
pronoun case
particle
non-bound
part-of-speech
subdivision 2
general * general general *
part-of-speech
subdivision 3
* * * * *
conjugation type * * * * godan_verb_ba_column
conjugated form * * * * basic form
basic form 僕
(boku)
は
(ha)
君
(kimi)
と
(to)
遊ぶ
(asobu)
reading ボク
(boku)
ハ
(ha)
キミ
(kimi)
ト
(to)
アソブ
(asobu)
pronunciation ボク
(boku)
ワ
(wa)
キミ
(kimi)
ト
(to)
アソブ
(asobu)
Let us take the example of the sentence, “僕は君と遊ぶ" (boku wa kimi to asobu). Table
2 shows the output result of MeCab when we input this example sentence. This is a Kanji-
Kana mixed sentence, which means “I play with you." If this sentence is expressed entirely
in Hiragana, that would be “ぼくはきみとあそぶ. "
We obtained Hiragana data by replacing the surface forms of words with their readings.
Next, we split the Hiragana data into individual characters and assign a part-of-speech
tag to each of them. Here, B-{Part-of-Speech} is assigned to the first Hiragana character,
and I-{Part-of-Speech} is assigned to the following Hiragana characters, if the Kanji
consisted of more than two syllables.
By formatting this Hiragana sentence as described above, we can obtain the character
data and tags corresponding to the character data with one-to-one correspondence (Table
3).
Table 3. Example of splitting “ぼくはきみとあそぶ"
ぼ く は き み と あ そ ぶ
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
35
8. (bo) (ku) (ha) (ki) (mi) (to) (a) (so) (bu)
B-
Noun
I-
Noun
B-
Particle
B-
Noun
I-Noun B-
Particle
B-Verb I-
Verb
I-
Verb
Please note that Japanese Kanji characters often have more than one reading. For example,
“君" could be Kimi or Kun in Japanese. Therefore, sometimes this conversion makes some
errors. The number of characters in the training data is 1,183,624.
We used the original Japanese Wikipedia data for the model that uses Kanji-Kana mixed
sentences. Table 4 shows the characters and their tags of the Kanji-Kana mixed sentence,
“僕は君と遊ぶ."
Table 4. Example of splitting “僕は君と遊ぶ"
僕
(boku)
は
(ha)
君
(kimi)
と
(to)
遊ぶ
(aso)
ぶ
(bo)
B-Noun B-
Particle
B-Noun B-Particle B-Verb I-Verb
For Yahoo! Answers data, we extracted the reading data from the corpus.
Native Japanese speakers manually annotate them, but sometimes they could be different
from the authors' intent. For example, “⽇本" (Japan in Japanese) has two readings,
“Nihon" and “Nippon," and both of them are correct in most sentences. Therefore, in
these cases, only the author can guarantee which one is correct. In BCCWJ, some words
had a predefined reading to reduce the burden of annotators.
4.3. Test Data
We also used the BCCWJ for the test data.
The BCCWJ provides sub-corpora and we used 12 of them.
The number of characters used in this experiment for each dataset is shown in Table 5.
Table 5. Number of characters for each sub-corpus in BCCWJ
Dataset Name Number of characters
Books (Library sub-corpus) 1,374,216
Bestsellers 1,093,860
Yahoo! Answers 830,960
Legal Documents 2,316,374
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
36
9. National Diet Minutes 2,050,400
PR Documents 2,151,126
Textbooks 956,927
Poems 466,878
Reports 2,546,307
Yahoo! Blogs 1,305,660
Books 1,281,251
Newspapers 1,301,728
The test data are tagged Hiragana characters with a one-to-one correspondence between
the Hiragana character and the tag-based on reading and part-of-speech information,
were created in the same way as the training data.
The data from “Yahoo! Answers" are also used as training data for creating the Hiragana
Yahoo! model, the Hiragana Wiki + Hiragana Yahoo! model, and the Kanji-Kana Wiki+
Hiragana Wiki+ Hiragana Yahoo! model, however we used different parts for the training
and testing.
4.4. BCCWJ Data
The BCCWJ data we used is shown in Table 5. Twelve datasets exist, each with a different
genre of data. The following section describes from which media and when each dataset
was collected. [11]
1. Books (Library sub-corpus)
Books (Library sub-corpus) is a random sample of books published during the 20 years
from 1986 that are held in public libraries in Tokyo.
2. Bestsellers
Bestsellers are a random sample of books that became bestsellers during a 30-year period
from 1976.
3. Yahoo! Answers
Yahoo! Answers is a randomly selected sample of data posted on the Q&A-style
knowledge community service Yahoo! Answers. The original data of Yahoo! Answers
includes 3,120,839 questions posted between October 2004 and October 2005, and
multiple answers to them.
4. Legal Documents
Legal Documents is a random sample of all laws promulgated during the 30 years from
1976 and are in effect in 2009, that is the year the corpus compiled.
5. National Diet Minutes
National Diet Minutes uses meeting minutes data from 32,986 meetings held between
the 77th and 163rd Congresses.
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
37
10. 6. PR Documents
The PR Document is defined as the population of PR Documents published in 2008 by
100 local governments selected from all over Japan. PR Documents has selected 100
municipalities from all over Japan, taking into account the region and population
composition. The population of the 100 municipalities is defined as the number of PR
newspapers published in 2008.
7. Textbooks
Textbooks is randomly sampled from textbooks used by elementary, junior high, and high
school students. However, some specialized subjects in high schools like agriculture and
commerce are excluded.
8. Poems
Poems was sampled from substitute works for three types of poems: Tanka, Haiku, and
poem.
9. Reports
Reports were randomly sampled as Report published between 1976 and 30 years.
10. Yahoo! Blogs
The Yahoo! Blogs is a random sample of article data from the Yahoo! Blogs. The original
data from Yahoo! Blogs contained a total of 3,463,413 articles.
11. Books
Books is a random sample of all books published in Japan during a five-year period
starting in 2001.
12. Newspapers
Newspapers were randomly sampled from all newspapers published in Japan during a
five-year period starting in 2001.
5. RESULT
According to genres of the test data, Tables 6 and 7 summarize the accuracies of the
Hiragana Wiki model, Kanji-Kana Wiki+ Hiragana Wiki model, Kanji-Kana Wiki+
Hiragana Wiki+ Hiragana Yahoo! model, Hiragana Yahoo! model, and Hiragana Wiki+
Hiragana Yahoo! model. For the evaluation of each test dataset, we used macro and micro-
averages of accuracy. Macro represents the macro-averaged accuracy, and micro
represents the micro-averaged accuracy in Tables 6 and 7. In Table 6, blue numbers
indicate that the accuracy of the Kanji-Kana Wiki+ Hiragana Wiki model or the Kanji-
Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo! model was lower than that of the Hiragana
Wiki model, and an asterisk indicates that this difference was significant according to a
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
38
11. chi-square test for accuracy at the 5% significance level. In Table 7, Magenta numbers
indicate that they are lower than the accuracy of the Kanji-Kana Wiki + Hiragana Wiki+
Hiragana Yahoo! model, and italics indicate that they are lower than the accuracy of the
Hiragana Yahoo! model. An asterisk indicates that the difference was significant
according to a chi-square test for accuracy at the 5% significance level. A plus indicates
that the Hiragana Wiki+ Hiragana Yahoo! model was different from the Hiragana Yahoo!
model according to a chi-square test for accuracy at the 5% significance level. The bold
numbers are the best results of the models in Tables 6 and 7.
For the Kanji-Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo! model, the Hiragana Yahoo!
model, and the Hiragana Wiki+ Hiragana Yahoo! model, the accuracies of Yahoo!
Answers, are results of the closed test, and that is why they are written in parentheses.
Also, the Yahoo! Answers evaluation data are removed from the macro and micro averages
of all models. Therefore, they are average of 11 sub-corpora except for Yahoo! Answers
data.
Table 6. Accuracy of Hiragana Wiki model, Kanji-Kana Wiki+ Hiragana Wiki model,
and Kanji-Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo! model according to each text
genre. Blue numbers mean they are lower than the accuracy of the Hiragana Wiki model
and an asterisk indicates that the model was different from the Hiragana Wiki model
according to a chi-square test at the 5% level of significance. The bold numbers are the
best results of the models.
Hiragana Wiki Kanji-Kana Wiki+
Hiragana Wiki
Kanji-Kana Wiki+
Hiragana Wiki+
Hiragana Yahoo!
Books (Library
sub-corpus)
56.98 *57.15 *62.76
Bestsellers 50.14 *50.71 *60.48
Yahoo! Answers 48.32 *50.23 (*65.50)
Legal Documents 63.39 63.32 *60.55
National Diet
Minutes
48.49 *49.59 *60.63
PR Documents 64.27 *63.93 *64.55
Textbooks 57.43 *58.01 *62.04
Poems 41.96 *42.45 *48.82
Reports 65.83 *65.57 *65.15
Yahoo! Blogs 57.12 57.10 *62.97
Books 55.73 55.77 *62.83
Newspapers 62.58 *62.30 *64.50
Macro 56.72 56.90 61.39
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
39
12. Micro 58.67 58.80 62.44
Table 7. Accuracy of Kanji-Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo! model,
Hiragana Yahoo! model, and Hiragana Wiki+ Hiragana Yahoo! model according to each
text genre. Magenta numbers indicate that they are lower than the accuracy of the
Kanji-Kana Wiki + Hiragana Wiki+ Hiragana Yahoo! model, and italics indicate that
they are lower than the accuracy of the Hiragana Yahoo! model. An asterisk indicates
that the model was different from the Kanji-Kana Wiki + Hiragana Wiki+ Hiragana
Yahoo! model according to a chi-square test at the 5% significance level. A plus
indicates that the model is different from Hiragana Yahoo! model at the 5% significance
level. The bold numbers are the best results of the models.
Kanji-Kana Wiki+
Hiragana Wiki+
Hiragana Yahoo!
Hiragana Yahoo! Hiragana Wiki+
Hiragana Yahoo!
Books (Library
sub-corpus)
62.76 *63.11 *+63.52
Bestsellers 60.48 *60.16 *+61.32
Yahoo! Answers (65.50) (*65.85) *+66.38
Legal Documents 60.55 *60.17 *+62.63
National Diet
Minutes
60.63 *60.49 *+61.97
PR Documents 64.55 *65.49 *+63.91
Textbooks 62.04 61.97 *+62.66
Poems 48.82 *47.61 *+49.55
Reports 65.15 *66.89 *+64.89
Yahoo! Blogs 62.97 *64.02 +62.90
Books 62.83 *63.28 *63.36
Newspapers 64.50 *65.54 +64.51
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
40
13. Macro 61.39 61.70 61.93
Micro 62.44 62.93 63.01
We also evaluated the training data using MeCab. For the dictionary of MeCab, we used
ipadic with the conversion of Kanji into Hiragana. The macro-averaged accuracy was
79.71%, and the micro-averaged accuracy was 80.10%. Please note that this result cannot
simply be compared with results in Tables 6 and 7 because our system did not use a
dictionary for the morphological analysis itself.
6. DISCUSSION
Table 6 shows that the Kanji-Kana Wiki+ Hiragana Wiki model's macro and micro-
averaged accuracies (56.90% and 58.80%) are higher than those of the Hiragana Wiki
model (56.72% and 58.67%). The macro-averaged accuracy of the Kanji-Kana Wiki+
Hiragana Wiki model improved by 0.18 points, while the micro-averaged accuracy by 0.13
points. This result indicates that the fine-tuning using the Kanji-Kana Wiki model is
somehow effective.
Furthermore, the macro and micro-averaged accuracies of the Kanji-Kana Wiki+
Hiragana Wiki+ Hiragana Yahoo! model (61.39% and 62.44%) are superior to those of
the Kanji-Kana Wiki+ Hiragana Wiki model (56.90% and 58.80%). The Kanji-Kana
Wiki+ Hiragana Wiki+ Hiragana Yahoo! model improved the accuracy by 4.49 points on
the macro-averaged accuracy and by 3.64 points on the micro-averaged, indicating that
further fine-tuning using Hiragana Yahoo! data considerably improve the permanence.
Additionally, from Tables 6 and 7, we can confirm that the macro- and micro-averaged
accuracies of the Kanji-Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo!, Hiragana Yahoo!,
and Hiragana Wiki+ Hiragana Yahoo! models, the models using Yahoo! Answers as
training data, are higher than those of Wikipedia and Kanji-Kana Wiki+ Hiragana Wiki
models, the models using only Wikipedia as training data. In other words, the
performance of the model is better when using Yahoo! Answers. We believe there could
be two reasons for this result, the quality of the corpus and the similarity of the training
and test data. The Hiragana Yahoo! data quality could be better than the Hiragana Wiki
data because Hiragana Yahoo! data are manually annotated, whereas Hiragana Wiki data
are automatically generated. Moreover, because the training and test data are both sub-
corpora of BCCWJ, they can be more similar than when the training and test data are
Wikipedia and BCCWJ.
Additionally, comparing the macro- and micro-averaged accuracies of the Kanji-Kana
Wiki+ Hiragana Wiki+ Hiragana Yahoo! model with those of the Hiragana Yahoo! model
and the Hiragana Wiki+ Hiragana Yahoo! model in Table 7, we confirmed that the
Hiragana Wiki+ Hiragana Yahoo! model is the best, and the Hiragana Yahoo! model is
the second best, and the Kanji-Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo! model is
the last. This result indicates that when the Hiragana Yahoo! data are available, the fine-
tuning using both the Kanji-Kana Wiki model and the Hiragana Wiki model is not
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
41
14. effective, although the fine-tuning using only the Hiragana Wiki model is useful. Notably,
the accuracy of some types of test data is improved while others are decreased. The Kanji-
Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo! model was higher than the Hiragana
Yahoo! model for the “Bestsellers," “Legal Documents," “National Diet Minutes,"
“Textbook," and “Poems", and it was higher than the Hiragana Wiki+ Hiragana Yahoo!
model for “PR Documents", “Reports", and “Yahoo! Blogs" but there was no genre where
the Kanji-Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo! model was the best in these three
models. Therefore, we think that the reason why the Kanji-Kana Wiki model was useful
when only the Hiragana Wiki data were available could be that the Hiragana Wiki data
were automatically generated.
Now, let us discuss the difference among the genres of the texts. The genres where the
accuracy of the Hiragana Wiki model was more than 60% were four genres: “Legal
Documents," “PR Documents," “Reports," and “Newspapers."
We believe this is because the writing style of these test data is close to that of Wikipedia.
Therefore, we marked these genres with underline in Tables 6 and 7. Additionally, we can
see that, as for these genres, the fine-tuning is not useful. The accuracies of the Kanji-
Kana Wiki+ Hiragana Wiki model decreased for these four genres, and those of the Kanji-
Kana Wiki+ Hiragana Wiki+ Hiragana Yahoo! model also decreased for two genres. As
for the remaining two genres, the improvements are less than two points.
Figure 2 shows the effects or the differences of the accuracies of Hiragana Yahoo! data
according to the genres of the texts. The blue line is its effects when the base model is the
Kanji-Kana Wiki+ Hiragana Wiki model and the orange line is when the base model is
the Hiragana Wiki model. This figure shows that the effects are the same even though the
base models are different.
Figure 2. Effect of Hiragana Yahoo! data
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
42
15. Hiragana Yahoo! data was considerably effective for “Yahoo! Answers," “Books," and
“National Diet Minutes." Hiragana Yahoo! data is text data of the question answering
sites. Therefore, the writing style is rather like spoken language. We believe that it is
effective for “Books" and “National Diet Minutes" because the Hiragana Books include
dialogues and National Diet Minutes is transcription of the discussion of the Diet. In
particular, Wikipedia data rarely include question sentences but question answering sites
contain many. Therefore, Hiragana Yahoo! data is effective for “National Diet Minutes"
that includes many questions.
Figure 3 shows the effects or the differences of the accuracies of Kanji-Kana Wiki data
according to the genres of the texts. The blue line is its effects when the base model is the
Hiragana Wiki model, and the orange line is when the base model is the Hiragana Wiki+
Hiragana Yahoo! model. As contrasted to Figure 2, the effects of the data are almost
opposite depending on the base model. The blue line is similar to the lines in Figure 2 but
the orange line is very different from those. The orange line, the effects of the Hiragana
Yahoo! data when it was added to the Hiragana Wiki+ Hiragana Yahoo! model shows that
the data is effective for “PR Documents,” “Reports," and “Yahoo! Blogs." The Kanji-Kana
Wiki data tends to effective when the fine-tuning of the Hiragana Yahoo! is not effective.
These facts indicate that the fine-tuning is effective when the original accuracy was poor,
regardless of whether it used Kanji-Kana Wiki text or Hiragana Yahoo! data.
Figure 3. Effect of Kanji-Kana Wiki data
Finally, the accuracy of the Bi-LSTM CRF model still has room for improvement
compared to the existing morphological analyzer for Kanji-Kana mixed text. We think
that the use of dictionary and much more data should be attempted in the future.
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
43
16. 7. CONCLUSIONS
This study developed a morphological analysis model for Japanese Hiragana sentences
using the Bi-LSTM CRF model. We showed that the performance of morphological
analysis of Hiragana sentences outperformed when we use in-domain data manually
annotated for fine-tuning by the experiments using Hiragana data of Wikipedia and
Yahoo! Answers data. We also showed that the performance of Hiragana morphological
analysis is improved when we fine-tune the model of Hiragana sentences with Kanji-Kana
mixed sentences. This fine-tuning is effective when the original accuracy was low.
Additionally, we showed that the performance of morphological analysis varied according
to the text genre. In the future, we would like to increase the number of training data to
improve the accuracy of the model for analyzing Hiragana sentences. Additionally, we
would like to make the system capable of outputting word reading, pronunciation, and
part-of-speech classification. The current system is not capable of outputting this
information. Since everything is composed of hiragana, there is a very high possibility that
word sense ambiguity will occur. Therefore, it will be important to solve this problem in
order to continue the research.
In this experiment, we used the Yahoo! Answers as a training example for the Kanji-Kana
Wiki+ Hiragana Wiki+ Hiragana Yahoo! model and Hiragana Yahoo! models to measure
the accuracy, but we would like to train the models using other data to analyze the effect
of genre differences. Furthermore, we would like to investigate the effective domain
adaptation method when the original accuracy is not bad.
REFERENCES
[1] Sanae Fujita, Hirotoshi Taira Tessei Kobayashi, and Takaaki Tanaka. 2014.
“Morphological analysis of picture book texts”. Journal of Natural Language Processing,
21(3):515‒539 (In Japanese).
[2] Masato Hayashi and Takashi Yamamura. 2017. “Considerations for the addition of
hiragana words and the accuracy of morphological analysis”. In Thesis Abstract of School
of Information Science and Technology, Aich Prifectual University, pages 1‒1 (In
Japanese).
[3] Jun Izutsu, Riku Akashi, Ryo Kato, Mika Kishino, Taichiro Kobayashi, Yuta Konno, and
Kanako Komiya. 2020. “Morphological analysis of hiragana-only sentences using
MeCab”. In the Proceedings of NLP2020, pages 65‒68 (In Japanese).
[4] Taku Kudo, Hiroshi Ichikawa, David Talbot, and Hideto Kazawa. 2012. “Robust
morphological analysis for hiragana sentences on the web”. In the Proceedings of NLP
2012, pages 1272‒1275 (In Japanese).
[5] Ji Ma, Kuzman Ganchev, and David Weiss. 2018. “State-of-the-art Chinese word
segmentation with Bi-LSTM”. In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, pages4902‒4908. Association for
Computational Linguistics.
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
44
17. [6] Kikuo Maekawa, Makoto Yamazaki, Takehiko Maruyama, Masaya Yamaguchi, Hideki
Ogura, Wakako Kashino, Toshinobu Ogiso, Hanae Koiso, and Yasuharu Den. 2010.
“Design, compilation, and preliminary analyses of Balanced Corpus of Contemporary
Written Japanese”. In Proceedings of the Seventh International Conference on
Language Resources and Evaluation (LREC 2010), pages1483‒1486.
[7] Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki
Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and
Yasuharu Den. 2014. “Balanced Corpus of Contemporary Written Japanese”. Language
resources and evaluation, 48(2):345‒371.
[8] Shuhei Moriyama, Tomohiro Ohno, Hidetaka Masuda, and Hiroshi Kinukawa, 2018.
“Morphological Analysis of Unsegmented Kana Strings Using Recurrent Neural Network
Language Model”. IPSJ Journal, vol 59, nb 10, pp. 1911-1921 (In Japanese).
[9] Ayaha Osaki, Syouhei Karaguchi, Takuya Ohaku, Syunya Sasaki, Yoshiaki Kitagawa,
Yuya Sakaizawa, and Mamoru Komachi. 2016. “Corpus construction for Japanese
morphological analysis in twitter”. In the Proceedings of NLP 2016, pages 16‒19 (In
Japanese).
[10] Suphanut Thattinaphanich and Santitham Prom-on.2019. “Thai Named Entity
Recognition Using Bi-LSTM-CRF with word and character representation”. In 2019 4th
International Conference on Information Technology (InCIT), pages 149‒154.
[11] Takehiko Maruyama, Wakako Kashino, Tanaka Makiro. 2013. “BCCWJ Usage Manual
v1.0”
International Journal on Natural Language Computing (IJNLC) Vol.11, No.1, February 2022
45