Tomoyuki Kajiwara, Kazuhide Yamamoto.
Noun Paraphrasing Based on a Variety of Contexts.
In Proceedings of the 28th Pacific Asia Conference on Language, Information and Computation (PACLIC 28), pp.644-649. Phuket, Thailand, December 2014.
The document discusses text normalization, which involves segmenting and standardizing text for natural language processing. It describes tokenizing text into words and sentences, lemmatizing words into their root forms, and standardizing formats. Tokenization involves separating punctuation, normalizing word formats, and segmenting sentences. Lemmatization determines that words have the same root despite surface differences. Sentence segmentation identifies sentence boundaries, which can be ambiguous without context. Overall, text normalization prepares raw text for further natural language analysis.
This document discusses methods for evaluating language models, including intrinsic and extrinsic evaluation. Intrinsic evaluation involves measuring a model's performance on a test set using metrics like perplexity, which is based on how well the model predicts the test set. Extrinsic evaluation embeds the model in an application and measures the application's performance. The document also covers techniques for dealing with unknown words like replacing low-frequency words with <UNK> and estimating its probability from training data.
The document discusses N-gram language models, which assign probabilities to sequences of words. An N-gram is a sequence of N words, such as a bigram (two words) or trigram (three words). The N-gram model approximates the probability of a word given its history as the probability given the previous N-1 words. This is called the Markov assumption. Maximum likelihood estimation is used to estimate N-gram probabilities from word counts in a corpus.
Statistically-Enhanced New Word IdentificationAndi Wu
This document discusses a method for identifying new words in Chinese text using a combination of rule-based and statistical approaches. Candidate character strings are selected as potential new words based on their independent word probability being below a threshold. Parts of speech are then assigned to candidate strings by examining the part of speech patterns of their component characters and comparing them to existing words in a dictionary to determine the most likely part of speech based on word formation patterns in Chinese. This hybrid approach avoids the overgeneration of rule-based systems and data sparsity issues of purely statistical approaches.
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
Parts-of-speech can be divided into closed classes and open classes. Closed classes have a fixed set of members like prepositions, while open classes like nouns and verbs are continually changing with new words being created. Parts-of-speech tagging is the process of assigning a part-of-speech tag to each word using statistical models trained on tagged corpora. Hidden Markov Models are commonly used, where the goal is to find the most probable tag sequence given an input word sequence.
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
The document discusses text normalization, which involves segmenting and standardizing text for natural language processing. It describes tokenizing text into words and sentences, lemmatizing words into their root forms, and standardizing formats. Tokenization involves separating punctuation, normalizing word formats, and segmenting sentences. Lemmatization determines that words have the same root despite surface differences. Sentence segmentation identifies sentence boundaries, which can be ambiguous without context. Overall, text normalization prepares raw text for further natural language analysis.
This document discusses methods for evaluating language models, including intrinsic and extrinsic evaluation. Intrinsic evaluation involves measuring a model's performance on a test set using metrics like perplexity, which is based on how well the model predicts the test set. Extrinsic evaluation embeds the model in an application and measures the application's performance. The document also covers techniques for dealing with unknown words like replacing low-frequency words with <UNK> and estimating its probability from training data.
The document discusses N-gram language models, which assign probabilities to sequences of words. An N-gram is a sequence of N words, such as a bigram (two words) or trigram (three words). The N-gram model approximates the probability of a word given its history as the probability given the previous N-1 words. This is called the Markov assumption. Maximum likelihood estimation is used to estimate N-gram probabilities from word counts in a corpus.
Statistically-Enhanced New Word IdentificationAndi Wu
This document discusses a method for identifying new words in Chinese text using a combination of rule-based and statistical approaches. Candidate character strings are selected as potential new words based on their independent word probability being below a threshold. Parts of speech are then assigned to candidate strings by examining the part of speech patterns of their component characters and comparing them to existing words in a dictionary to determine the most likely part of speech based on word formation patterns in Chinese. This hybrid approach avoids the overgeneration of rule-based systems and data sparsity issues of purely statistical approaches.
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
Parts-of-speech can be divided into closed classes and open classes. Closed classes have a fixed set of members like prepositions, while open classes like nouns and verbs are continually changing with new words being created. Parts-of-speech tagging is the process of assigning a part-of-speech tag to each word using statistical models trained on tagged corpora. Hidden Markov Models are commonly used, where the goal is to find the most probable tag sequence given an input word sequence.
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
The document discusses context-free grammars for modeling English syntax. It introduces key concepts like constituency, grammatical relations, and subcategorization. Context-free grammars use rules and symbols to generate sentences. They consist of terminal symbols (words), non-terminal symbols (phrases), and rules to expand non-terminals. Context-free grammars can model syntactic knowledge and generate sentences in both a top-down and bottom-up manner through parsing.
This document discusses parsing with context-free grammars. It begins by introducing context-free grammars and their use in parsing sentences. It then discusses parsing as a search problem, and presents top-down and bottom-up parsing algorithms. Top-down parsing builds trees from the root node down, while bottom-up parsing builds trees from the leaves up. Both approaches have advantages and disadvantages related to efficiency. The document also introduces probabilistic context-free grammars, which augment grammars with rule probabilities, and discusses how these can be used for disambiguation.
Derric A. Alkis C
Abstract:
Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad.
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
Part of speech (POS) tagging is the process of assigning a part of speech tag like noun, verb, adjective to each word in a sentence. It involves determining the most likely tag sequence given the probabilities of tags occurring before or after other tags, and words occurring with certain tags. POS tagging is the first step in many NLP applications and helps determine the grammatical role of words. It involves calculating bigram and lexical probabilities from annotated corpora to find the tag sequence with the highest joint probability.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
This presentation about comparing different word embedding models and libraries like word2vec and fastText, describing their difference and showing pros and cons.
The document describes language-independent methods for clustering similar contexts without using syntactic or lexical resources. It discusses representing contexts as vectors of lexical features and clustering them based on similarity. Feature selection involves identifying unigrams, bigrams, and co-occurrences based on frequency or association measures. Contexts can then be represented in first-order or second-order feature spaces and clustered. Applications include word sense discrimination, document clustering, and name discrimination.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Methoddannyijwest
Tibetan is a low-resource language. In order to alleviate the shortage of parallel corpus between Tibetan
and Chinese, this paper uses two monolingual corpora and a small number of seed dictionaries to learn the
semi-supervised method with seed dictionaries and self-supervised adversarial training method through the
similarity calculation of word clusters in different embedded spaces and puts forward an improved self-
supervised adversarial learning method of Tibetan and Chinese monolingual data alignment only. The
experimental results are as follows. The seed dictionary of semi-supervised method made before 10
predicted word accuracy of 66.5 (Tibetan - Chinese) and 74.8 (Chinese - Tibetan) results, to improve the
self-supervision methods in both language directions have reached 53.5 accuracy.
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHODIJwest
Tibetan is a low-resource language. In order to alleviate the shortage of parallel corpus between Tibetan and Chinese, this paper uses two monolingual corpora and a small number of seed dictionaries to learn the semi-supervised method with seed dictionaries and self-supervised adversarial training method through the similarity calculation of word clusters in different embedded spaces and puts forward an improved selfsupervised adversarial learning method of Tibetan and Chinese monolingual data alignment only. The experimental results are as follows. The seed dictionary of semi-supervised method made before 10 predicted word accuracy of 66.5 (Tibetan - Chinese) and 74.8 (Chinese - Tibetan) results, to improve the self-supervision methods in both language directions have reached 53.5 accuracy.
The document discusses parts-of-speech (POS) tagging. It defines POS tagging as labeling each word in a sentence with its appropriate part of speech. It provides an example tagged sentence and discusses the challenges of POS tagging, including ambiguity and open/closed word classes. It also discusses common tag sets and stochastic POS tagging using hidden Markov models.
The document discusses language independent methods for clustering similar contexts without using syntactic or lexical resources. It describes representing contexts as vectors of lexical features, reducing dimensionality, and clustering the vectors. Key methods include identifying unigram, bigram and co-occurrence features from corpora using frequency counts and association measures, and representing contexts in first or second order vectors based on feature presence.
The document discusses context-free grammars for modeling English syntax. It introduces key concepts like constituency, grammatical relations, and subcategorization. Context-free grammars use rules and symbols to generate sentences. They consist of terminal symbols (words), non-terminal symbols (phrases), and rules to expand non-terminals. Context-free grammars can model syntactic knowledge and generate sentences in both a top-down and bottom-up manner through parsing.
This document discusses parsing with context-free grammars. It begins by introducing context-free grammars and their use in parsing sentences. It then discusses parsing as a search problem, and presents top-down and bottom-up parsing algorithms. Top-down parsing builds trees from the root node down, while bottom-up parsing builds trees from the leaves up. Both approaches have advantages and disadvantages related to efficiency. The document also introduces probabilistic context-free grammars, which augment grammars with rule probabilities, and discusses how these can be used for disambiguation.
Derric A. Alkis C
Abstract:
Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad.
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
Part of speech (POS) tagging is the process of assigning a part of speech tag like noun, verb, adjective to each word in a sentence. It involves determining the most likely tag sequence given the probabilities of tags occurring before or after other tags, and words occurring with certain tags. POS tagging is the first step in many NLP applications and helps determine the grammatical role of words. It involves calculating bigram and lexical probabilities from annotated corpora to find the tag sequence with the highest joint probability.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
This presentation about comparing different word embedding models and libraries like word2vec and fastText, describing their difference and showing pros and cons.
The document describes language-independent methods for clustering similar contexts without using syntactic or lexical resources. It discusses representing contexts as vectors of lexical features and clustering them based on similarity. Feature selection involves identifying unigrams, bigrams, and co-occurrences based on frequency or association measures. Contexts can then be represented in first-order or second-order feature spaces and clustered. Applications include word sense discrimination, document clustering, and name discrimination.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Methoddannyijwest
Tibetan is a low-resource language. In order to alleviate the shortage of parallel corpus between Tibetan
and Chinese, this paper uses two monolingual corpora and a small number of seed dictionaries to learn the
semi-supervised method with seed dictionaries and self-supervised adversarial training method through the
similarity calculation of word clusters in different embedded spaces and puts forward an improved self-
supervised adversarial learning method of Tibetan and Chinese monolingual data alignment only. The
experimental results are as follows. The seed dictionary of semi-supervised method made before 10
predicted word accuracy of 66.5 (Tibetan - Chinese) and 74.8 (Chinese - Tibetan) results, to improve the
self-supervision methods in both language directions have reached 53.5 accuracy.
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHODIJwest
Tibetan is a low-resource language. In order to alleviate the shortage of parallel corpus between Tibetan and Chinese, this paper uses two monolingual corpora and a small number of seed dictionaries to learn the semi-supervised method with seed dictionaries and self-supervised adversarial training method through the similarity calculation of word clusters in different embedded spaces and puts forward an improved selfsupervised adversarial learning method of Tibetan and Chinese monolingual data alignment only. The experimental results are as follows. The seed dictionary of semi-supervised method made before 10 predicted word accuracy of 66.5 (Tibetan - Chinese) and 74.8 (Chinese - Tibetan) results, to improve the self-supervision methods in both language directions have reached 53.5 accuracy.
The document discusses parts-of-speech (POS) tagging. It defines POS tagging as labeling each word in a sentence with its appropriate part of speech. It provides an example tagged sentence and discusses the challenges of POS tagging, including ambiguity and open/closed word classes. It also discusses common tag sets and stochastic POS tagging using hidden Markov models.
The document discusses language independent methods for clustering similar contexts without using syntactic or lexical resources. It describes representing contexts as vectors of lexical features, reducing dimensionality, and clustering the vectors. Key methods include identifying unigram, bigram and co-occurrence features from corpora using frequency counts and association measures, and representing contexts in first or second order vectors based on feature presence.
文献紹介:Simple English Wikipedia: A New Text Simplification TaskTomoyuki Kajiwara
William Coster, David Kauchak. Simple wikipedia: A new simplification task. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp.665–669, 2011.
Evaluation Dataset and System for Japanese Lexical SimplificationTomoyuki Kajiwara
Tomoyuki Kajiwara, Kazuhide Yamamoto. Evaluation Dataset and System for Japanese Lexical Simplification. In Proceedings of the ACL-IJCNLP 2015 Student Research Workshop, pp.35-40. Beijing, China, July 2015.
Incorporating word reordering knowledge into attention-based neural machine t...sekizawayuuki
The document proposes a method to incorporate word reordering knowledge into attention-based neural machine translation using a distortion model. The method extends the attention mechanism to consider both the semantic requirements and a word reordering penalty. It achieves state-of-the-art performance on translation quality and improves word alignment quality compared to baseline neural machine translation and prior work.
This lectures provides students with an introduction to natural language processing, with a specific focus on the basics of two applications: vector semantics and text classification.
(Lecture at the QUARTZ PhD Winter School (http://www.quartz-itn.eu/training/winter-school/ in Padua, Italy on February 12, 2018)
DETECTING OXYMORON IN A SINGLE STATEMENTWarNik Chow
This document proposes a method to detect oxymorons in single statements by analyzing word vector representations. It introduces word vectors and word analogy tests. The proposed method constructs offset vector sets for antonyms and synonyms to check if word pairs in statements are contradictory. It applies techniques like part-of-speech tagging, lemmatization, and negation counting. The experiment uses pre-trained GloVe vectors and oxymoron/truism datasets with mixed results. Future work could apply dependency parsing and word embeddings specialized for antonyms to improve accuracy.
The document proposes a linguistically motivated approach to Japanese to English translation that segments source phrases using dependency structure and translates each phrase. It presents results of this method on the ASPEC corpus and discusses errors, including issues with dependency parsing and translating basic frames and dependent phrases. The method creates a basic frame and dependent phrases, translates them separately, and replaces words in the basic frame with translations of corresponding dependent phrases.
Chat bot using text similarity approachdinesh_joshy
1. There are three main techniques for chat bots to generate responses: static responses using templates, dynamic responses by scoring potential responses from a knowledge base, and generated responses using deep learning to generate novel responses from training data.
2. Text similarity can be measured using string-based, corpus-based, or knowledge-based approaches. String-based measures operate on character sequences while corpus-based measures use word co-occurrence statistics and knowledge-based measures use information from semantic networks like WordNet.
3. Popular corpus-based measures include LSA, ESA, and PMI-IR which analyze word contexts and co-occurrences in corpora. Knowledge-based measures like Resnik, Lin, and Leacock
This chapter introduces vector semantics for representing word meaning in natural language processing applications. Vector semantics learns word embeddings from text distributions that capture how words are used. Words are represented as vectors in a multidimensional semantic space derived from neighboring words in text. Models like word2vec use neural networks to generate dense, real-valued vectors for words from large corpora without supervision. Word vectors can be evaluated intrinsically by comparing similarity scores to human ratings for word pairs in context and without context.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...csandit
Extraction of semantic relations from various sourc
es such as corpus, web pages, dictionary
definitions etc. is one of the most important issue
in study of Natural Language Processing
(NLP). Various methods have been used to extract se
mantic relation from various sources.
Pattern-based approach is one of the most popular m
ethod among them. In this study, we
propose a model to extract antonym pairs from Turki
sh corpus automatically. Using a set of
seeds, we automatically extract lexico-syntactic pa
tterns (LSPs) for antonym relation from
corpus. Reliability score is calculated for each pa
ttern. The most reliable patterns are used to
generate new antonym pairs. Study conduct on only a
djective-adjective and noun-noun pairs.
Noun and adjective target words are used to measure
success of method and candidate
antonyms are generated using reliable patterns. For
each antonym pair consisting of candidate
antonym and target word, antonym score is calculat
ed. Pairs that have a certain score are
assigned to antonym pair. The proposed method shows
good performance with 77.2% average
accuracy.
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...cscpconf
Extraction of semantic relations from various sources such as corpus, web pages, dictionary
definitions etc. is one of the most important issue in study of Natural Language Processing
(NLP). Various methods have been used to extract semantic relation from various sources.
Pattern-based approach is one of the most popular method among them. In this study, we
propose a model to extract antonym pairs from Turkish corpus automatically. Using a set of
seeds, we automatically extract lexico-syntactic patterns (LSPs) for antonym relation from
corpus. Reliability score is calculated for each pattern. The most reliable patterns are used to
generate new antonym pairs. Study conduct on only adjective-adjective and noun-noun pairs.
Noun and adjective target words are used to measure success of method and candidate
antonyms are generated using reliable patterns. For each antonym pair consisting of candidate
antonym and target word, antonym score is calculated. Pairs that have a certain score are
assigned to antonym pair. The proposed method shows good performance with 77.2% average
accuracy.
The document discusses several methods for calculating the similarity between text documents, including document vectors, word embeddings, TF-IDF, cosine similarity, and Jaccard similarity. It explains that document vectors transform documents into real-valued vectors to measure similarity as distance. Word embeddings represent words as vectors to capture semantic similarity. TF-IDF measures word importance, and cosine similarity measures the angle between document vectors to indicate similarity. Jaccard similarity calculates the overlap between word sets in two documents.
This document summarizes an approach for identifying word translations from non-parallel (unrelated) English and German corpora. It uses a co-occurrence clue, assuming words that frequently co-occur in one language will have translations that co-occur frequently in the other language. The approach computes association vectors for words based on log-likelihood ratios of co-occurrence counts within a 3-word window. It determines the translation of an unknown German word by finding the most similar association vector in an English co-occurrence matrix. Evaluation on 100 test words showed an accuracy of around 72%, a significant improvement over previous methods for non-parallel texts.
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
Deep Learning Architectures for Semantic Relation Detection Tasks
Recognizing and distinguishing specific semantic relations from other types of semantic relations is an essential part of language understanding systems. Identifying expressions with similar and contrasting meanings is valuable for NLP systems which go beyond recognizing semantic relatedness and require to identify specific semantic relations. In this talk, I will first present novel techniques for creating labelled datasets required for training deep learning models for classifying semantic relations between phrases. I will further present various neural network architectures that integrate morphological features into integrated path-based and distributional relation detection algorithms and demonstrate that this model outperforms state-of-the-art models in distinguishing semantic relations and is capable of efficiently handling multi-word expressions.
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
Analyzing Arguments during a Debate using Natural Language Processing in PythonAbhinav Gupta
This presentation will guide you through the application of Python NLP Techniques to analyze arguments during a debate and define a strategy to figure out the winner of the debate on the basis of strength and relevance of the arguments.
This is made for PyCon India 2015.
For details : https://in.pycon.org/cfp/pycon-india-2015/proposals/analyzing-arguments-during-a-debate-using-natural-language-processing-in-python/
Contact me : abhinav.gpt3@gmail.com
Word embedding is a technique in natural language processing where words are represented as dense vectors in a continuous vector space. These representations are designed to capture semantic and syntactic relationships between words based on their distributional properties in large amounts of text. Two popular word embedding models are Word2Vec and GloVe. Word2Vec uses a shallow neural network to learn word vectors that place words with similar meanings close to each other in the vector space. GloVe is an unsupervised learning algorithm that trains word vectors based on global word-word co-occurrence statistics from a corpus.
Two Level Disambiguation Model for Query TranslationIJECEIAES
Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named „two level disambiguation model‟. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively.
Embeddings are generally better than TF-IDF features for several reasons. TF-IDF represents words as high-dimensional, sparse vectors based on word counts, but embeddings represent words as low-dimensional, dense vectors that encode semantic relationships between words. Embeddings reduce dimensionality and computation costs compared to TF-IDF while capturing similarities between related words. While TF-IDF assumes words are independent, embeddings place similar words in close proximity in vector space.
The document describes a system for semantic textual similarity (STS) that uses various techniques to estimate the semantic similarity between texts. The system combines lexical, syntactic, and semantic information sources using state-of-the-art algorithms. In SemEval 2016 tasks, the system achieved a mean Pearson correlation of 75.7% on the monolingual English task and 86.3% on the cross-lingual Spanish-English task, ranking first in the cross-lingual task. The system utilizes techniques such as word embeddings, paragraph vectors, tree-structured LSTMs, and word alignment to capture semantic similarity.
This document discusses various topics related to programming languages including:
- It defines different types of programming languages like procedural, functional, scripting, logic, and object-oriented.
- It explains why studying programming languages is important for expressing ideas, choosing appropriate languages, learning new languages, and better understanding implementation.
- It describes programming paradigms like imperative, object-oriented, functional, and logic programming.
- It covers concepts like syntax, semantics, pragmatics, grammars, and translation models in programming languages.
There’s been a lot of recent work on representing words as vectors with neural networks. These representations referred to as “neural embeddings” or “word embeddings”.
Similar to Noun Paraphrasing Based on a Variety of Contexts (20)
文献紹介:SemEval-2012 Task 1: English Lexical SimplificationTomoyuki Kajiwara
Lucia Specia, Sujay Kumar Jauhar, Rada Mihalcea. SemEval-2012 Task 1: English Lexical Simplification. In Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval-2012), pp.347-355, 2012.
Tomoyuki Kajiwara, Hiroshi Matsumoto, Kazuhide Yamamoto.
Selecting Proper Lexical Paraphrase for Children.
In Proceedings of the 25th Conference on Computational Linguistics and Speech Processing (ROCLING 2013), pp.769-772. Kaohsiung, Taiwan, October 2013.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
ESPP presentation to EU Waste Water Network, 4th June 2024 “EU policies driving nutrient removal and recycling
and the revised UWWTD (Urban Waste Water Treatment Directive)”
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
What is greenhouse gasses and how many gasses are there to affect the Earth.
Noun Paraphrasing Based on a Variety of Contexts
1. Noun Paraphrasing
Based on a Variety of Contexts
Tomoyuki Kajiwara and Kazuhide Yamamoto
Nagaoka University of Technology, Japan
2. Abstract
We propose a method to paraphrase nouns
in consideration of the contexts.
The Characteristic of Our Proposed Method
– It can paraphrase robust without the word frequency.
• Our Number of Differences based method is better
than the Co-occurrence Frequency based method.
– It can paraphrase depending on the context.
• e.g. Reduce the burdens on the back.
• NoD : load, stress, damage, exhaustion, tense, etc.
• CoF : cost, expense, actual cost, etc. (money-related)
2
4. Application of the Lexical Paraphrasing
• For Reading Assistance (Lexical Simplification)
– Never judge people by external appearance.
– Never judge people by outside appearance.
• For Machine Translation (pre-editing)
– その本なら書類の下にある
It is under the papers if it is the book.
– その本 は 書類の下にある
The book is under the papers. ✔
✔
✔
4
5. Difficulty of the Lexical Paraphrasing
• Force someone to shoulder a huge increase in his
financial burdens .
– Force someone to shoulder a huge increase in
his financial costs .
– Force someone to shoulder a huge increase in
his financial loads .
• Reduce the burdens on the back.
– Reduce the costs on the back.
– Reduce the loads on the back.
✔
✔
It changes depending on the context
whether paraphrasing is possible or impossible.
5
6. Input: Look for the access to the airport.
Output: Look for the way to the airport.
Approach
restaurant
market
purpose
transfer
fee
way
bus
transportation
delivery
look for the *** *** to the airport
1. way 2. transfer 3. fee
To sort by the context similarity
6
7. Input: Look for the access to the airport.
Output: Look for the way to the airport.
Approach
restaurant
market
purpose
transfer
fee
way
bus
transportation
delivery
look for the *** *** to the airport
1. way 2. transfer 3. fee
To sort by the context similarity
To generate a proper sentence
To select a suitable paraphrase
7
8. Proposed Method
We propose a method to paraphrase nouns
in consideration of the contexts.
1. To extract candidate words
used in the same context as the input sentence
2. To calculate the similarity
between the original and candidate words
• The number of differences of the context
in the candidate word.
• The number of differences of the common context
between the original and the candidate word.
3. To select a candidate word
with the maximum similarity as the paraphrase
original → paraphrase
8
9. Proposed Method
We propose a method to paraphrase nouns
in consideration of the contexts.
1. To extract candidate words
used in the same context as the input sentence
2. To calculate the similarity
between the original and candidate words
• The number of differences of the context
in the candidate word.
• The number of differences of the common context
between the original and the candidate word.
3. To select a candidate word
with the maximum similarity as the paraphrase
original → paraphrase
9
10. To extract candidate words
• To extract candidate words used in the same context
• But words used in the completely same context is hardly found
↓
• On the basis of an object word access ,
an input sentence is divided into a pre- and a post-context.
Look for the access to the airport.
look for the *** *** to the airport
pre-
context
post-
context
restaurant transfer
market fee
purpose way
transfer bus
fee transportation
way delivery
10
11. To extract candidate words
Look for the access to the airport.
look for the *** *** to the airport
pre-
context
post-
context
restaurant transfer
market fee
purpose way
transfer bus
fee transportation
way delivery
• Words appearing in common
may be used in the input sentence
• We can generate a proper sentence
11
12. Proposed Method
We propose a method to paraphrase nouns
in consideration of the contexts.
1. To extract candidate words
used in the same context as the input sentence
2. To calculate the similarity
between the original and candidate words
• The number of differences of the context
in the candidate word.
• The number of differences of the common context
between the original and the candidate word.
3. To select a candidate word
with the maximum similarity as the paraphrase
original → paraphrase
12
13. To calculate similarity between words
The larger number of differences of the common context
between the original and the candidate word,
the larger paraphrasability.
1
The larger number of differences of the context
in the candidate word, the smaller paraphrasability.2
common(A, B): The number of differences of the common context between A and B
difference(A): The number of differences of the context in A
TNC: The total number of differences of the context
13
similarity(original,candidate) =
common(original,candidate)× log(
TNC
difference(candidate)
)
1 2
14. tf(w): The number of occurrences of the word
df(w): The number of documents occurring the word
TND: The total number of documents
common(A, B): The number of differences of the common context
difference(A): The number of differences of the context
TNC: The total number of differences of the context
tf (word)× log(
TND
df (word)
)
common(original,candidate)× log(
TNC
difference(candidate)
)
TF-IDF
14
New Statistics:
Number of Occurrences → Number of Differences
15. Proposed Method
We propose a method to paraphrase nouns
in consideration of the contexts.
1. To extract candidate words
used in the same context as the input sentence
2. To calculate the similarity
between the original and candidate words
• The number of differences of the context
in the candidate word.
• The number of differences of the common context
between the original and the candidate word.
3. To select a candidate word
with the maximum similarity as the paraphrase
original → paraphrase
15
16. The characteristic of our proposed method
• Extraction
– We can generate a proper sentence
based on the common contexts.
• Selection
– We can select a suitable paraphrase
based on the number of differences of the context.
To compare with the co-occurrence frequency
and pointwise mutual information experimentally
16
17. Comparative Methods
• Marton et al. (2009) Improved Statistical Machine Translation
Using Monolingually-Derived Paraphrases.
• Bhagat and Ravichandran (2008) Large Scale Acquisition of
Paraphrases for Learning Surface Patterns.
1. Both of these methods generate a feature vector
from contexts of the target word original .
2. They calculate a cosine similarity
between the feature vectors.
3. They select a word with the maximum similarity
as the paraphrase .
17
18. Comparative Methods
• [Marton 09]:Co-occurrence frequency based method
• [Bhagat 08]: Pointwise mutual information based method
1. Both of these methods generate a feature vector
from contexts of the target word original .
2. They calculate a cosine similarity
between the feature vectors.
3. They select a word with the maximum similarity
as the paraphrase .
18
19. Experimental setup
• Japanese
– In this experiment, we paraphrase for Japanese nouns.
– This approach is language-independent.
• Definition of a context
– We define the content words in the phrase which is
dependency to a noun as context.
Look for the access to the airport.
19
20. Experimental setup
• Web Japanese N-gram: To extract candidate words
– Japanese word N (1-7) grams. (We use 7-gram as sentence.)
– Each N-gram appears more than 20 times in the Web.
– We use 200 sentences in the following 1.3M sentences.
• Noun … Noun(paraphrase target) … Verb(original form).
* Japanese is SOV language.
• Kyoto University case frame: To calculate similarity
– Japanese predicate and Japanese noun pairs from the Web.
– It is contained 34k predicates and 824k nouns. (We use all.)
– We define these predicates as context,
and we calculate similarity between these nouns.
20
22. Number of paraphrasable nouns
to the 1st place of similarity
High frequent words (e.g. こと(thing)) have a bad influence.
Postfix words have a bad influence.
(e.g. the word that describe the number of items)
22
The proposed method is robust
because we don t depend on the word frequency.
24. Relationship by rank of similarity
and number of paraphrasable nouns
There are few differences.
24
Many paraphrase appear with rank 1.
25. Examples of the paraphrasing
in consideration of context
• Assign a maximum penalty of N$.
– Comparative method: imprisonment, pecuniary penalty, etc.
– Our method: paying penalty, administrative penalty, etc.
• imprisonment does not appear as a candidate.
• Reduce the burdens on the back.
– Comparative method: cost, expenses, actual cost, etc.
• All of which are money-related.
• Any words listed within the top 10 are not appropriate.
– Our method: load, stress, damage, exhaustion, tense, etc.
• All of which are appropriate paraphrase in the context.
25
26. Conclusion
We propose a method to paraphrase nouns
in consideration of the contexts.
26
The Characteristic of Our Proposed Method
– It can paraphrase robust without the word frequency.
• Our Number of Differences based method is better
than the Co-occurrence Frequency based method.
– It can paraphrase depending on the context.
• e.g. Reduce the burdens on the back.
• NoD : load, stress, damage, exhaustion, tense, etc.
• CoF : cost, expense, actual cost, etc. (money-related)