ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
J. Anurag, P. Nupur and Agrawal, S.S.
School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Centre for Development of Advanced Computing, Noida, India
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
J. Anurag, P. Nupur and Agrawal, S.S.
School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Centre for Development of Advanced Computing, Noida, India
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...Rommel Carvalho
Presentation given by Saminda Abeyruwan at the 6th Uncertainty Reasoning for the Semantic Web Workshop at the 9th International Semantic Web Conference in November 7, 2010.
Paper: PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabilistic Methods
Abstract: Formalizing an ontology for a domain manually is well-known as a tedious and cumbersome process. It is constrained by the knowledge acquisition bottleneck. Therefore, researchers developed algorithms and systems that can help to automatize the process. Among them are systems that include text corpora for the acquisition. Our idea is also based on vast amount of text corpora. Here, we provide a novel unsupervised bottom-up ontology generation method. It is based on lexico-semantic structures and Bayesian reasoning to expedite the ontology generation process. We provide a quantitative and two qualitative results illustrating our approach using a high throughput screening assay corpus and two custom text corpora. This process could also provide evidence for domain experts to build ontologies based on top-down approaches.
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
Abstract:The interest in text to speech synthesis increased in the world .text to speech have been developed for many popular languages such as English, Spanish and French and many researches and developments have been applied to those languages. Persian on the other hand, has been given little attention compared to other languages of similar importance and the research in Persian is still in its infancy. Persian languages possess many difficulty and exceptions that increase complexity of text to speech systems. For example: short vowels is absent in written text or existence of homograph words. in this paper we propose a new method for Persian text to phonetic that base on pronunciations by analogy in words, semantic relations and grammatical rules for finding proper phonetic.Keywords:PbA, text to speech, Persian language, Phonetic recognition.
Title:Phonetic Recognition In Words For Persian Text To Speech Systems
Author:Ahmad Musavi Nasab, Ali Joharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
Taking into account communities of practice’s specific vocabularies in inform...inscit2006
L. Damas and C. Million-Rousseau
Condillac Group, LISTIC, Université de Savoie. 73370 Le Bourget du Lac, France
Ontologos Corp. 6, route de Nanfray, 74000 Cran-Gevrier, France
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentence’s document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...Rommel Carvalho
Presentation given by Saminda Abeyruwan at the 6th Uncertainty Reasoning for the Semantic Web Workshop at the 9th International Semantic Web Conference in November 7, 2010.
Paper: PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabilistic Methods
Abstract: Formalizing an ontology for a domain manually is well-known as a tedious and cumbersome process. It is constrained by the knowledge acquisition bottleneck. Therefore, researchers developed algorithms and systems that can help to automatize the process. Among them are systems that include text corpora for the acquisition. Our idea is also based on vast amount of text corpora. Here, we provide a novel unsupervised bottom-up ontology generation method. It is based on lexico-semantic structures and Bayesian reasoning to expedite the ontology generation process. We provide a quantitative and two qualitative results illustrating our approach using a high throughput screening assay corpus and two custom text corpora. This process could also provide evidence for domain experts to build ontologies based on top-down approaches.
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
Abstract:The interest in text to speech synthesis increased in the world .text to speech have been developed for many popular languages such as English, Spanish and French and many researches and developments have been applied to those languages. Persian on the other hand, has been given little attention compared to other languages of similar importance and the research in Persian is still in its infancy. Persian languages possess many difficulty and exceptions that increase complexity of text to speech systems. For example: short vowels is absent in written text or existence of homograph words. in this paper we propose a new method for Persian text to phonetic that base on pronunciations by analogy in words, semantic relations and grammatical rules for finding proper phonetic.Keywords:PbA, text to speech, Persian language, Phonetic recognition.
Title:Phonetic Recognition In Words For Persian Text To Speech Systems
Author:Ahmad Musavi Nasab, Ali Joharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
Taking into account communities of practice’s specific vocabularies in inform...inscit2006
L. Damas and C. Million-Rousseau
Condillac Group, LISTIC, Université de Savoie. 73370 Le Bourget du Lac, France
Ontologos Corp. 6, route de Nanfray, 74000 Cran-Gevrier, France
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentence’s document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
Parsing of Myanmar Sentences With Function Taggingkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences.
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...IJERA Editor
This research paper reports preliminary results of data-driven modeling of segmentalphoneme duration for
Marathi. Classification and Regression Tree based data driven duration modeling for segmental duration
prediction is presented. A number of features are considered and their usefulness and relative contribution for
segmental duration prediction is assessed. Objective evaluation of the duration model, by root mean squared
prediction error and correlation between actual and predicted durations, is performed.
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCEScscpconf
This paper describes a context free grammar (CFG) based grammatical relations for Myanmar
sentences which combine corpus-based function tagging system. Part of the challenge of
statistical function tagging for Myanmar sentences comes from the fact that Myanmar has freephrase-order
and a complex morphological system. Function tagging is a pre-processing step to
show grammatical relations of Myanmar sentences. In the task of function tagging, which tags
the function of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging
and chunking information, we use Naive Bayesian theory to disambiguate the possible function
tags of a word. We apply context free grammar (CFG) to find out the grammatical relations of
the function tags. We also create a functional annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar sentences. Experiments show that our analysis achieves a good result with simple sentences and complex sentences.
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Normunds Grūzītis
We present a currently bilingual but potentially multilingual FrameNet-based grammar library implemented in Grammatical Framework. The contribution of this paper is two-fold. First, it offers a methodological approach to automatically generate the grammar based on semantico-syntactic valence patterns extracted from FrameNet- annotated corpora. Second, it provides a proof of concept for two use cases illustrating how the acquired multilingual grammar can be exploited in different CNL applications in the domains of arts and tourism.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
Text analysis has been attracting increasing attention in this data era. Selecting effective features from
datasets is a particular important part in text classification studies. Feature selection excludes irrelevant
features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy
and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and
comparing the available feature selection methods in general versatility regarding authorship attribution
problems and tries to identify which method is the most effective. The discussions on general versatility of
feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the
accuracy of SVM (support vector machine), and different criteria for determining the rank of feature
selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can
always extract useful information to discriminate the classes. The chi-square was proved to be a better
method overall.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
These are the slides for the technical briefing given at ICSE 2021, given by Alessio Ferrari, Liping Zhao, and Waad Alhoshan
It covers RE tasks to which NLP is applied, an overview of a recent systematic mapping study on the topic, and a hands-on tutorial on using transfer learning for requirements classification.
Please find the links to the colab notebooks here:
https://colab.research.google.com/drive/158H-lEJE1pc-xHc1ISBAKGDHMt_eg4Gn?usp=sharing
https://colab.research.google.com/d rive/1B_5ow3rvS0Qz1y-KyJtlMNnm gmx9w3kJ?usp=sharing
https://colab.research.google.com/d rive/1Xrm0gNaa41YwlM5g2CRYYX cRvpbDnTRT?usp=sharing
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
Presentation PPT in MT SUMMIT 2013.
Language-independent Model for Machine Translation Evaluation with Reinforced Factors
International Association for Machine Translation2013
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013). Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
El modelo de traducción de voz de extremo a extremo de alta calidad se basa en una gran escala de datos de entrenamiento de voz a texto,
que suele ser escaso o incluso no está disponible para algunos pares de idiomas de bajos recursos. Para superar esto, nos
proponer un método de aumento de datos del lado del objetivo para la traducción del habla en idiomas de bajos recursos. En particular,
primero generamos paráfrasis del lado objetivo a gran escala basadas en un modelo de generación de paráfrasis
que incorpora varias características de traducción automática estadística (SMT) y el uso común
función de red neuronal recurrente (RNN). Luego, un modelo de filtrado que consiste en similitud semántica
y se propuso la co-ocurrencia de pares de palabras y habla para seleccionar la fuente con la puntuación más alta
pares de paráfrasis de los candidatos. Resultados experimentales en inglés, árabe, alemán, letón, estonio,
La generación de paráfrasis eslovena y sueca muestra que el método propuesto logra resultados significativos.
y mejoras consistentes sobre varios modelos de referencia sólidos en conjuntos de datos PPDB (http://paraphrase.
org/). Para introducir los resultados de la generación de paráfrasis en la traducción de voz de bajo recurso,
proponen dos estrategias: recombinación de pares audio-texto y entrenamiento de referencias múltiples. Experimental
Los resultados muestran que los modelos de traducción de voz entrenados en nuevos conjuntos de datos de audio y texto que combinan
los resultados de la generación de paráfrasis conducen a mejoras sustanciales sobre las líneas de base, especialmente en
lenguas de escasos recursos.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
Text analysis has been attracting increasing attention in this data era. Selecting effective features from datasets is a particular important part in text classification studies. Feature selection excludes irrelevant features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Similar to Word Segmentation and Lexical Normalization for Unsegmented Languages (20)
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Word Segmentation and Lexical Normalization for Unsegmented Languages
1. Word Segmentation and
Lexical Normalization for
Unsegmented Languages
Doctoral Defense
December 16, 2021.
Shohei Higashiyama
NLP Lab, Division of Information Science, NAIST
2. This slide is a slightly modified version of that used for the author’s
doctoral defense at NAIST on December 16, 2021.
The major contents of this slide were taken from the following papers.
• [Study 1] Higashiyama et al., “Incorporating Word Attention into Character-Based Word
Segmentation”, NAACL-HLT, 2019
https://www.aclweb.org/anthology/N19-1276
• [Study 1] Higashiyama et al., “Character-to-Word Attention for Word Segmentation”,
Journal of Natural Language Processing, 2020 (Paper Award)
https://www.jstage.jst.go.jp/article/jnlp/27/3/27_499/_article/-char/en
• [Study 2] Higashiyama et al., “Auxiliary Lexicon Word Prediction for Cross-Domain Word
Segmentation”, Journal of Natural Language Processing, 2020
https://www.jstage.jst.go.jp/article/jnlp/27/3/27_573/_article/-char/en
• [Study 3] Higashiyama et al., “User-Generated Text Corpus for Evaluating Japanese
Morphological Analysis and Lexical Normalization”, NAACL-HLT, 2021
https://www.aclweb.org/anthology/2021.naacl-main.438/
• [Study 4] Higashiyama et al., “A Text Editing Approach to Joint Japanese Word
Segmentation, POS Tagging, and Lexical Normalization”, W-NUT, 2021 (Best Paper Award)
https://aclanthology.org/2021.wnut-1.9/
2
3. Overview
◆Research theme
- Word Segmentation (WS) and
Lexical Normalization (LN) for Unsegmented Languages
◆Studies in this dissertation
[Study 1] Japanese/Chinese WS for general domains
[Study 2] Japanese/Chinese WS for specialized domains
[Study 3] Construction of Japanese user-generated text (UGT)
corpus for WS and LN
[Study 4] Japanese WS and LN for UGT domains
◆Structure of this presentation
- Background → Detail on each study → Conclusion
3
4. ◆Segmentation/Tokenization
- The (almost) necessary first step of NLP,
which segments a sentence into tokens
◆Word
- Human-understandable unit
- Processing unit of traditional NLP
- Mandatory unit for linguistic analysis (e.g., parsing and PAS)
- Useful information as a feature or an intermediate unit of
subword for application-oriented tasks (e.g., NER and MT)
4
Char ニ,ュ,ー,ラ,ル,ネ,ッ,ト,ワ,ー,ク,に,よ,る,自,然,言,語,処,理
Subword ニュー,ラル,ネット,ワーク,による,自然,言語,処理
Word ニューラル,ネットワーク,に,よる,自然,言語,処理
ニューラルネットワークによる自然言語処理
‘Natural language processing
based on neural networks’
Background (1/4)
5. Background (2/4)
◆Word Segmentation (WS) in unsegmented languages
- Task to segment sentences into words
using annotated data based on a segmentation standard
- Nontrivial task because of the ambiguity problem
- Segmentation accuracy degrades in domains w/o sufficient
labeled data mainly due to the unknown word problem.
Research issue 1
- How to achieve high accuracy in various text domains,
including those w/o labeled data
5
彼は日本人だ 彼 | は | 日本 | 人 | だ
‘He is a Japanese.’
日本 ‘Japan’
本 ‘book’
本人 ‘the person’
?
6. Background (3/4)
◆Effective WS approaches for different domain types
- [Study 1] General domains:
Use of labeled data (and other resources)
- [Study 2] Specialized domains:
Use of general domain labeled data and target domain resources
- [Study 3&4] User-generated text (UGT):
Handling nonstandard words → Lexical normalization
6
Domain Type Example
Labeled
data
Unlabeled
data
Lexicon
Other
characteristics
General dom. News ✓ ✓ ✓
Specialized dom.
Scientific
documents
✕ ✓ △
UGT dom. Social media ✕ ✓ △
Nonstandard
words
✓: available
△: sometimes available
×: almost unavailable
7. Background (4/4)
◆The frequent use of nonstandard words in UGT
- Examples: オハヨー ohayoo ‘good morning’ (おはよう)
すっっげええ suggee ‘awesome’ (すごい)
- Achieving accurate WS and downstream processing is difficult.
◆Lexical Normalization (LN)
- Task to transform nonstandard words into standard forms
- Main problem: the lack of public labeled data for evaluating
and training Japanese LN models
Research issue 2
- How to train/evaluate WS and LN models for Japanese UGT
under the low-resource situation
7
日本語 まぢ ムズカシイ 日本 語 まじ 難しい/むずかしい
Japanese Majimu Zukashii
Japanese Majimuzukashii
Online
Translators A, B
Japanese is really difficult
Japanese is difficult
8. Contributions of This Dissertation (1/2)
1. How to achieve accurate WS in various text domains
- We proposed effective approach for each of three domain types.
➢Our methods can be effective options to achieve
accurate WS and downstream tasks in these domains.
8
General
domains
Specialized
domains
UGT
domains
[Study 1] Neural model combining
character and word features
[Study 2] Auxiliary prediction task based on
unlabeled data and lexicon
[Study 4] Joint prediction of WS and LN
9. Contributions of This Dissertation (1/2)
2. How to train/evaluate WS and LN models for Ja UGT
- We constructed manually/automatically-annotated corpora.
➢Our evaluation corpus can be a useful benchmark to compare
and analyze existing and future systems.
➢Our LN method can be a good baseline to develop more
practical Japanese LN methods in future.
9
UGT
domains
[Study 3] Evaluation corpus annotation
[Study 4] Pseudo-training data generation
10. ◆I focused on improvements of WS and LN accuracy
for each domain type.
Corpus
annotation for
fair evaluation
Overview of Studies in This Dissertation
10
Development of
More accurate
models
General
domains
Specialized
domains
UGT
domains
Study 1, 2, 4
Study 3
Development of
More fast
models
Prerequisite Performance
11. Study 1: Word Segmentation for
General Domains
Higashiyama et al., “Incorporating Word Attention into Character-Based
Word Segmentation”, NAACL-HLT, 2019
Higashiyama et al., “Character-to-Word Attention for Word Segmentation”,
Journal of Natural Language Processing, 2020 (Paper Award)
12. Study 1: WS for General Domains
◆Goal: Achieve more accurate WS in general domains
◆Background
- Limited efforts have been devoted to leverage
complementary char and word information for neural WS.
◆Our contributions
- Proposed a char-based model incorporating word information
- Achieved performance better than or competitive to existing
SOTA models on Japanese and Chinese datasets
12
テ キ ス ト の 分 割
テ キ ス ト|の|分割
テ キ ス ト の 分 割
テキスト
の
分割
Char-
based
Word-
based
Efficient prediction via
first-order sequence labeling
Easy use of
word-level info
13. [Study 1] Proposed Model Architecture
◆Char-based model with char-to-word attention
to learn the importance of candidate words
13
本
日本
本人
S S B E S
BiLSTM
Char
context
vector hi Word
embedding
ew
j
Word
summary
vector ai
Attend
Aggregate
Word vocab
Char embedding
Lookup
日本 ‘Japan’
本 ‘book’
本人 ‘the person’
?
Input sentence
BiLSTM
CRF
彼 は 日 本 人
14. [Study 1] Character-to-Word Attention
14
本
日本
本人
は日本
日本人
本人だ
…
Word
embedding
ew
j
Word vocab
彼 は 日 本 人 だ 。
Char context
vector hi
αij ew
j
exp(hi
T Wew
j)
∑k exp(hi
T Wew
k)
αij =
Input sentence
Lookup
Attend
Max word length = 4
WAVG
(weighted
average)
WCON
(weighted
concat)
OR
Aggregate
…
Word summary vector ai
16. [Study 1] Experimental Datasets
◆Training/Test data
- Chinese: 2 source domains
- Japanese: 4 source domains and 7 target domains
◆Unlabeled text for pre-training word embeddings
- Chinese: 48M sentences in Chinese Gigaword 5
- Japanese: 5.9M sentences in BCCWJ non-core data
16
17. [Study 1] Experimental Settings
◆Hyperparameters
- num_BiLSTM_layers=2 or 3, num_BiLSTM_units=600,
char/word_emb_dim=300, min_word_freq=5, max_word_length=4, etc.
◆Evaluation
1. Comparison of baseline and proposed model variants
(and analysis on model size)
2. Comparison with existing methods on in-domain
and cross-domain datasets
3. Effect of semi-supervised learning
4. Effect of word frequency and length
5. Effect of attention for segmentation performance
6. Effect of additional word embeddings from target domains
7. Analysis of segmentation examples
17
18. [Study 1] Exp 1. Comparison of Model Variants
◆F1 on development sets (mean of three runs)
- Word-integrated models outperformed BASE by up to 1.0
(significant in 20 of 24 cases).
- Attention-based models outperformed non-attention
counterparts in 10 of 12 cases (significant in 4 cases).
- WCON achieved the best performance,
which may be because of word length and char position info.
18
† significant at the 0.01 level over the baseline
‡ significant at the 0.01 level over the variant w/o attention
BiLSTM-CRF
Attention-
based
19. [Study 1] Exp 2. Comparison with Existing Methods
◆F1 on test sets (mean of three runs)
- WCON achieved better/competitive performance than
existing methods.
(More recent work achieved further improvements on Chinese datasets.)
19
20. 20
[Study 1] Exp 5. Effect of Attention for Segmentation
本
本 日本
本人
本
本 日本
本人
0.1
0.1
0.8 0.1
0.1
0.8
if p≧pt if p<pt
p~Uniform(0,1)
◆Character-level accuracy on BCCWJ-dev
(Most frequent cases where both correct and incorrect candidate words
exist for a character)
- Segmentation label accuracy: 99.54%
- Attention accuracy for proper words: 93.25%
◆Segmentation accuracy of the trained model increased
for larger “correct attention probability” pt
21. [Study 1] Conclusion
- We proposed a neural word segmenter with attention,
which incorporates word information into
a character-level sequence labeling framework.
- Our experiments showed that
• the proposed method, WCON, achieved performance better than or
competitive to existing methods, and
• learning appropriate attention weights contributed to accurate
segmentation.
21
22. Study 2: Word Segmentation for
Specialized Domains
Higashiyama et al., “Auxiliary Lexicon Word Prediction for Cross-Domain Word
Segmentation”, Journal of Natural Language Processing, 2020
23. Study 2: WS for Specialized Domains
◆Goal
- Improve WS performance for specialized domains
where labeled data is non-available
➢Our focus: how to use linguistic resources in target domain
◆Our contributions
- Proposed a WS method to learn signals of word occurrences
from unlabeled sentences and a lexicon (in target domain)
- Our method improved performance for various Chinese and
Japanese target domains.
23
Domain Type
Labeled
data
Unlabeled
data
Lexicon
Specialized domains ✕ ✓ △
✓: available
△: sometimes available
×: almost unavailable
24. [Study 2] Cross-Domain WS with Linguistic Resources
◆Methods for Cross-Domain WS
◆Our Model
- To overcome the limitation of lexicon features,
we mode lexical information via auxiliary task for neural models.
➢Assumption:
24
(Liu+ 2019), (Gan+ 2019), Ours
Neural representation learning
Lexicon feature
Modeling Lexical Information
Modeling Statistical Information
Generating pseudo-labeled data
(Neubig+ 2011), (Zhang+ 2018)
Domain Labeled data Unlabeled data Lexicon
Source ✓ ✓ ✓
Target ✕ ✓ ✓
26. [Study 2] Our Lexicon Word Prediction
- We introduce auxiliary tasks to predict whether each character
corresponds to specific positions in lexical words.
- The model learns parameters also from target unlabeled sentences.
26
Seg label
Sentence
Auxiliary
labels
Lexicon
[B]
[I]
[E]
[S]
長短期記憶ネットワーク
{長,短,期,記,憶,長短,短期,記憶,
ネット,ワーク,ネットワーク, …}
11010100100
00000011110
01101001001
11111000000
Generate
Predict/Learn
Predict/
Learn
source sentence target sentence
Predict/Learn
週末の外出自粛要請
BESBEBEBE
(self-restraint request in the weekend)
{週,末,の,外,出,自,粛,要,請,
週末,外出,出自,自粛,要請, …}
(Long short-term memory)
100111010
000000000
010011101
111111111
27. [Study 2] Methods and Experimental Data
◆Linguistic resources for training
- Source domain labeled data
- General and domain-specific unlabeled data
- Lexicon: UniDic (JA) or Jieba (ZH) and
semi-automatically constructed domain-specific lexicons
(390K-570K source words & 0-134K target words)
◆Methods
- Baselines: BiLSTM (BASE), BASE + self-training (ST), and
BASE + lexicon feature (LF)
- Proposed: BASE + MLPs for Segmentation and auxiliary LWP tasks
27
JNL: CS Journal; JPT, CPT: Patent; RCP: Recipe; C-ZX, P-ZX, FR, DL: Novel; DM: Medical
28. [Study 2] Experimental Settings
◆Hyperparameter
- num_BiLSTM_layers=2, num_BiLSTM_units=600, char_emb_dim=300,
num_MLP_units=300, min_word_len=1, max_word_len=4, etc.
◆Evaluation
1. In-domain results
2. Cross-domain results
3. Comparison with SOTA methods
4. Influence of weight for auxiliary loss
5. Results for non-adapted domains
6. Performance of unknown words
28
29. [Study 2] Exp 2. Cross-Domain Results
◆F1 on test sets (mean of three runs)
- LWP-S (source) outperformed BASE and ST.
- LWP-T (target) significantly outperformed the three baselines.
(+3.2 over BASE, +3.0 over ST, +1.2 over LF on average)
- Results of LWP-O (oracle) using gold test words
indicates more improvements by higher-coverage lexicons.
29
Japanese Chinese
★ significant at the 0.001 level over BASE
† significant at the 0.001 level over ST
‡ significant at the 0.001 level over LF
30. [Study 2] Exp 3. Comparison with SOTA Methods
◆F1 on test sets
- Our method achieved better or competitive performance on
Japanese and Chinese datasets, compared to SOTA methods,
including Higashiyama+’19 (our method in the first study).
30
Japanese Chinese
31. [Study 2] Exp 6. Performance for Unknown Words
◆Recall of top 10 frequent OOTV words
- For out-of-training-vocabulary (OOTV) words in test sets,
our method achieved better recall for words in lexicon,
but worse recall for words not in the lexicon (Ls∪Lt).
JPT (Patent) FR (Novel)
31
JA ZH
32. [Study 2] Conclusion
- We proposed a cross-domain WS method to incorporate lexical
knowledge via an auxiliary prediction task.
- Our method achieved better performance for various target
domains than the lexicon feature baseline and existing methods
(while preventing performance degradation for source domains).
32
33. Study 3: Construction of a Japanese
UGT corpus for WS and LN
Higashiyama et al., “User-Generated Text Corpus for Evaluating Japanese
Morphological Analysis and Lexical Normalization”, NAACL-HLT, 2021
34. Study 3: UGT Corpus Construction
◆Background
- The lack of public evaluation corpus for Japanese WS and LN
◆Goal
- Construct a public evaluation corpus for development and
fair comparison of Japanese WS and LN systems
◆Our contributions
- Constructed a corpus of blog and Q&A forum text annotated
with morphological and normalization information
- Conducted a detailed evaluation of UGT-specific problems of
existing methods
34
日本語まぢムズカシイ 日本語 まぢ ムズカシイ
まじ 難しい/むずかしい
‘Japanese is really difficult.’
35. [Study 3] Corpus Construction Policies
1. Available and restorable
- Use blog and Chiebukuro (Yahoo! Answers) sentences in
the BCCWJ non-core data and publish annotation information
2. Compatible with existing segmentation standard
- Follow the NINJAL’s SUW (short unit word, 短単位) and
extend the specification regarding non-standard words
3. Enabling a detailed evaluation on UGT-specific
phenomena
- Organize linguistic phenomena frequently observed
into several categories and
annotate every token with a category
35
36. [Study 3] Example Sentence in Our Corpus
36
イイ歌ですねェ
Raw sentence
イイ 歌 です ねェ
形容詞 名詞 助動詞 助詞
良い,よい,いい - - ね
Char type - - Sound change
variant variant
ii uta desu nee ‘It’s a good song, isn’t it?’
Word boundary
Standard forms
desu
(copula)
nee
(emphasis marker)
Part-of-speech
Categories
ii
‘good’
uta
‘song’
37. [Study 3] Corpus Details
◆Word categories
- 11 categories were defined for non-general or nonstandard words that
may often cause segmentation errors.
◆Corpus statistics
37
新語/スラング
固有名
オノマトペ
感動詞
方言
外国語
顔文字/AA
異文字種
代用表記
音変化
誤表記
Our most categories overlap with (Kaji+ 2015)’s classification
38. [Study 3] Experiments
Using our corpus, we evaluated two existing systems
trained only with annotated corpus for WS and POS tagging.
• MeCab (Kudo+ 2004) with UniDic v2.3.0
- A popular Japanese morphological analyzer based on CRFs
• MeCab+ER (Expansion Rules)
- Our MeCab-based implementation of (Sasano+ 2013)’s
rule-based lattice expansion method
38
Cited from (Sasano+ 2014)
‘It was delicious.’
Dynamically add nodes
by human-crafted rules
39. [Study 3] Experiments
◆Evaluation
1. Overall results
2. Results for each category
3. Analysis of segmentation results
4. Analysis of normalization results
39
40. Study 3. Exp 1. Overall Performance
40
◆Results
- MeCab+ER achieved better performance for Seg and POS by
2.5-2.9 F1 points, but achieved poor Norm recall.
41. ◆Results
- Both achieved high Seg and POS performance for general and
standard words, but lower performance for UGT-characteristic words.
- MeCab+ER correctly normalized 30-40% of SCV and AR nonstandard
words, but none of those in other two categories.
Study 3. Exp 2. Recall for Each Category
41
Norm
42. [Study 3] Conclusion
- We constructed a public Japanese UGT corpus
annotated with morphological and normalization information.
(https://github.com/shigashiyama/jlexnorm)
- Experiments on the corpus demonstrated the limited performance
of the existing systems for non-general and non-standard words.
42
43. Study 4: WS and LN for Japanese
UGT
Higashiyama et al., “A Text Editing Approach to Joint Japanese Word
Segmentation, POS Tagging, and Lexical Normalization”, W-NUT, 2021
(Best Paper Award)
44. Study 4: WS and LN for Japanese UGT
◆Goal
- Develop a Japanese WS and LN model with better
performance than existing systems, under the condition that
normalization labeled data for LN is non-available
◆Our contributions
- Proposed generation methods of pseudo-labeled data and
a text editing-based method for Japanese WS, POS tagging,
and LN
- Achieved better normalization performance than
an existing method
44
45. [Study 4] Background and Motivation
◆Frameworks for text generation
◆Our approach
- Generate pseudo-labeled data for LN using lexical knowledge
- Use a text editing-based model to learn efficiently from
small amount of (high-quality) training data
45
⚫ Text editing method
for English lexical normalization
(Chrupała 2014)
⚫ Encoder-Decoder model
for Japanese sentence
normalization (Ikeda+ 2016)
46. [Study 4] Task Formulation
◆Formulation as multiple sequence labeling tasks
◆Normalization tags for Japanese char sets
- String edit operation (SEdit):
{KEEP, DEL, INS_L(c), INS_R(c), REP(c)} (c: hiragana or katakana)
- Character type conversion (CConv): {KEEP, HIRA, KATA, KANJI}
◆Kana-kanji conversion
46
日 本 語 ま ぢ ム ズ カ シ ー
B E S B E B I I I E
Noun Noun Noun Adv Adv Adj Adj Adj Adj Adj
KEEP KEEP KEEP KEEP REP(じ) KEEP KEEP KEEP KEEP REP(い)
KEEP KEEP KEEP KEEP KEEP HIRA HIRA HIRA HIRA KEEP
x =
ys =
yp =
ye =
yc =
⇒ まじ ⇒ むずかしい
Seg
POS
Norm
Sentence
も う あ き だ
KANJI
Kana-kanji
converter
(n-gram LM)
あき
秋 ‘autumn’
空き ‘vacancy’
飽き ‘bored’
…
KANJI
’It’s already
autumn.’
KEEP
KEEP
KEEP
CConv tags
47. [Study 4] Variant Pair Acquisition
◆Standard and nonstandard word variant pairs for
pseudo-labeled data generation
A) Dictionary-based:
Extract variant pairs from
UniDic with hierarchical
lemma definition
B) Rule-based:
Apply hand-crafted rules to transform standard forms into
nonstandard forms
47
…
⇒ 404K pairs
⇒ 47K pairs
6 out of 10 rules are similar to those in (Sasano+ 2013) and (Ikeda+ 2016).
48. [Study 4] Pseudo-labeled Data Generation
◆Input
- (Auto-) segmented sentence x and
- Pair v of source (nonstandard) and target (standard) word variants
48
x = スゴく|気|に|なる
ye = K K K K K K K
yc = H H K K K K K
“(I’m) very curious.”
v = (スゴく, すごく)
ス ゴ く 気 に な る
K=KEEP,H=HIRA,
D=DEL, IR=INS_R
x = ほんとう|に|心配
ye = K K D K K K
yc = K K K K K K K
v = (ほんっと, ほんとう)
ほ ん っ と に 心 配
IR(う)
“(I’m) really worried.”
⇒ すごく 気になる
⇒ ほんとう に心配
⚫ Target-side distant supervision (DStgt)
⚫ Source-side distant supervision (DSsrc)
src tgt
src tgt
Synthetic target sentence
Synthetic source sentence
Pro: Actual sentences can be used
Pro: Any number of synthetic sentences can be generated
49. [Study 4] Experimental Data
◆Pseudo labeled data for training (and development)
- Dict/Rule-derived variant pairs: Vd and Vr
- BCCWJ: a mixed domain corpus of news, blog, Q&A forum, etc.
◆Test data: BQNC
- Manually-annotated 929 sentences constructed in our third study
49
Du
Dt
Vd
Vd
At
Ad
DStgt
Vr Ar
57K sent.
173K syn. sent.
170K syn. sent.
57K sent.
Top np=20K
freq pairs At most ns=10 sent. were
extracted for each pair
DSsrc
core data Dt
with manual Seg&POS tags
non-core data Du
with auto Seg&POS tags
3.5M sent.
50. [Study 4] Experimental Settings
◆Our model
- BiLSTM + task-specific softmax layers
- Character embedding, pronunciation embedding, and
nonstandard word lexicon binary features
- Hyperparameter
• num_BiLSTM_layers=2, num_BiLSTM_units=1,000, char_emb_d=200, pron_emb_d=30, etc.
◆Baseline methods
- MeCab and MeCab+ER (Sasano+ 2013)
◆Evaluation
1. Main results
2. Effect of dataset size
3. Detailed results of normalization
4. Performance for known and unknown normalization instances
5. Error analysis
50
51. [Study 4] Exp 1. Main Results
◆Results
- Our method achieved better Norm performance
when trained on more types of pseudo-labeled data
- MeCab+ER achieved the best performance on Seg and POS
51
At: DSsrc(Vdic)
Ar: DStgt(Vrule)
Ad: DStgt(Vdic)
(BiLSTM)
Postprocessing
52. [Study 4] Exp 5. Error Analysis
◆Detailed normalization performance
- Our method outperformed MeCab+ER for all categories.
- Major errors ( ) by our model were mis-detection and
invalid tag prediction.
- Kanji conversion accuracy was 97% (67/70).
52
ほんと (に少人数で) → ほんとう ‘actually’ すげぇ → すごい ‘great’
フツー (の話をして) → 普通 ‘ordinary’ そーゆー → そう|いう ‘such’
な~に (言ってんの) → なに ‘what’ まぁるい → まるい ‘round’
Examples
of TPs
ガコンッ → ガコン ‘thud’ ゴホゴホ → ごほごホ ‘coughing sound’
はぁぁ → はああ ‘sighing sound’ おお~~ → 王 ‘king’
ケータイ → ケイタイ ‘cell phone’ ダルい → だるい ‘dull’
Examples
of FPs ×
?
53. [Study 4] Conclusion
- We proposed a text editing-based method for Japanese WS,
POS tagging, and LN.
- We proposed effective generation methods of pseudo-labeled data
for Japanese LN.
- The proposed method outperformed an existing method
on the joint segmentation and normalization task.
53
55. Summary of This Dissertation
1. How to achieve accurate WS in various text domains
- We proposed approaches for three domain types,
which can be effective options to achieve accurate
WS and downstream tasks in these domains.
2. How to train/evaluate WS and LN models for Ja UGT
- We constructed a public evaluation corpus, which can be
a useful benchmark to compare existing and future systems.
- We proposed a joint WS&LN method trained on pseudo-
labeled data, which can be a good baseline to develop
more practical Japanese LN methods in future.
55
56. Directions for Future Work
◆Model size and inference speed
- Knowledge distillation is a prospective approach to train a fast and
lightweight student model from an accurate teacher model.
◆Investigation of optimal segmentation unit
- Optimal units and effective combination of different units
(char/subwrod/word) for downstream tasks have room to be explored.
◆Performance improvement on UGT processing
- Incorporating knowledge in large pretrained LMs may be effective.
◆Evaluation on broader UGT domains and phenomena
- Constructing evaluation data in various UGT domains is beneficial to
evaluate system performance for frequently-occurring phenomena in
other UGT domains, such as proper names and neologisms.
56