In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
This research describes an attempt to establish a pedagogically useful list of the most frequent semantically non-compositional multi-word combinations for English for Journalism learners in an EFL context, who need to read English news in their field of study. The list was compiled from the NOW (News on the Web) Corpus, the largest English news database by far. In consideration of opaque multi-word combinations in widespread use and pedagogical value, the researcher applied a set of selection criteria when using the corpus. Based on frequency, meaningfulness, and semantic non-compositionality, a total of 318 non-compositional multi-word combinations of 2 to 5 words with the exclusion of phrasal verbs were selected and they accounted for approximately 2% of the total words in the corpus. The list, not highly technical in nature, contains the most commonly-used multi-word units traversing various topic areas and news readers may encounter these phrasal expressions very often. As with other individual word lists, it is hoped that this opaque expressions list may serve as a reference for English for Journalism teaching.
International Journal of Engineering and Science Invention (IJESI) inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
WSD is a task with a long history in computational linguistics. It is open problem in NLP. This research focuses on increasing the accuracy of Lesk algorithm with assistant of annotated corpus using Narodowy Korpus Jezyka Polskiego (NKJP “Polish National Corpus”). The
NKJP_WSI (NKJP Word Sense Inventory) is used as senses inventory. A Lesk algorithm is firstly implemented on the whole corpus (training and test) and then getting the results. This is done with assistance of special dictionary that contains all possible senses for each ambiguous
word. In this implementation, the similarity equation is applied to information retrieval using tfidf with small modification in order to achieve the requirements. Experimental results show that the accuracy of 82.016% and 84.063% without and with deleting stop words respectively. Moreover, this paper practically solves the challenge of an execution time. Therefore, we proposed special structure for building another dictionary from the corpus in order to reduce time complicity of the training process. The new dictionary contains all the possible words (only these which help us in solving WSD) with their tf-idf from the existing dictionary with assistant of annotated corpus. Furthermore, eexperimental results show that the two tests are identical. The execution time - of the second test dropped down to 20 times compared to first test with same accuracy
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
Lecture 2: From Semantics To Semantic-Oriented ApplicationsMarina Santini
From the "Natural Language Processing" LinkedIn group:
John Kontos, Professor of Artificial Intelligence
I wonder whether translating into formal logic is nothing more than transliteration which simply isolates the part of the text that can be reasoned upon using the simple inference mechanism of formal logic. The real problem I think lies with the part of text that CANNOT be translated one the one hand and the one that changes its meaning due to civilization advances. My own proposal is to leave NL text alone and try building inference mechanisms for the UNTRANSLATED text depending on the task requirements.
All the best
John"
This research describes an attempt to establish a pedagogically useful list of the most frequent semantically non-compositional multi-word combinations for English for Journalism learners in an EFL context, who need to read English news in their field of study. The list was compiled from the NOW (News on the Web) Corpus, the largest English news database by far. In consideration of opaque multi-word combinations in widespread use and pedagogical value, the researcher applied a set of selection criteria when using the corpus. Based on frequency, meaningfulness, and semantic non-compositionality, a total of 318 non-compositional multi-word combinations of 2 to 5 words with the exclusion of phrasal verbs were selected and they accounted for approximately 2% of the total words in the corpus. The list, not highly technical in nature, contains the most commonly-used multi-word units traversing various topic areas and news readers may encounter these phrasal expressions very often. As with other individual word lists, it is hoped that this opaque expressions list may serve as a reference for English for Journalism teaching.
International Journal of Engineering and Science Invention (IJESI) inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
WSD is a task with a long history in computational linguistics. It is open problem in NLP. This research focuses on increasing the accuracy of Lesk algorithm with assistant of annotated corpus using Narodowy Korpus Jezyka Polskiego (NKJP “Polish National Corpus”). The
NKJP_WSI (NKJP Word Sense Inventory) is used as senses inventory. A Lesk algorithm is firstly implemented on the whole corpus (training and test) and then getting the results. This is done with assistance of special dictionary that contains all possible senses for each ambiguous
word. In this implementation, the similarity equation is applied to information retrieval using tfidf with small modification in order to achieve the requirements. Experimental results show that the accuracy of 82.016% and 84.063% without and with deleting stop words respectively. Moreover, this paper practically solves the challenge of an execution time. Therefore, we proposed special structure for building another dictionary from the corpus in order to reduce time complicity of the training process. The new dictionary contains all the possible words (only these which help us in solving WSD) with their tf-idf from the existing dictionary with assistant of annotated corpus. Furthermore, eexperimental results show that the two tests are identical. The execution time - of the second test dropped down to 20 times compared to first test with same accuracy
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
Lecture 2: From Semantics To Semantic-Oriented ApplicationsMarina Santini
From the "Natural Language Processing" LinkedIn group:
John Kontos, Professor of Artificial Intelligence
I wonder whether translating into formal logic is nothing more than transliteration which simply isolates the part of the text that can be reasoned upon using the simple inference mechanism of formal logic. The real problem I think lies with the part of text that CANNOT be translated one the one hand and the one that changes its meaning due to civilization advances. My own proposal is to leave NL text alone and try building inference mechanisms for the UNTRANSLATED text depending on the task requirements.
All the best
John"
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
The spread and abundance of electronic documents requires automatic techniques for extracting useful information from the text they contain. The availability of conceptual taxonomies can be of great help, but manually building them is a complex and costly task. Building on previous work, we propose a technique to automatically extract conceptual graphs from text and reason with them. Since automated learning of taxonomies needs to be robust with respect to missing or partial knowledge and flexible with respect to noise, this work proposes a way to deal with these problems. The case of poor data/sparse concepts is tackled by finding generalizations among disjoint pieces of knowledge. Noise is
handled by introducing soft relationships among concepts rather than hard ones, and applying a probabilistic inferential setting. In particular, we propose to reason on the extracted graph using different kinds of relationships among concepts, where each arc/relationship is associated to a number that represents its likelihood among all possible worlds, and to face the problem of sparse knowledge by using generalizations among distant concepts as bridges between disjoint portions of knowledge.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
Chinese discourse coherence modeling remains a challenge taskin Natural Language Processing
field.Existing approaches mostlyfocus on the need for feature engineering, whichadoptthe sophisticated
features to capture the logic or syntactic or semantic relationships acrosssentences within a text.In this
paper, we present an entity-drivenrecursive deep modelfor the Chinese discourse coherence evaluation
based on current English discourse coherenceneural network model. Specifically, to overcome the
shortage of identifying the entity(nouns) overlap across sentences in the currentmodel, Our combined
modelsuccessfully investigatesthe entities information into the recursive neural network
freamework.Evaluation results on both sentence ordering and machine translation coherence rating
task show the effectiveness of the proposed model, which significantly outperforms the existing strong
baseline.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same
graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
Automatic Identification of False Friends in Parallel Corpora: Statistical an...Svetlin Nakov
Scientific article: False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the wordsâ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available.
A Natural Logic for Artificial Intelligence, and its Risks and Benefits gerogepatton
This paper is a multidisciplinary project proposal, submitted in the hopes that it may garner enough interest to launch it with members of the AI research community along with linguists
and philosophers of mind and language interested in constructing a semantics for a natural logic for AI. The paper outlines some of the major hurdles in the way of “semantics-driven” natural language processing based on standard predicate logic and sketches out the steps to be
taken toward a “natural logic”, a semantic system explicitly defined on a well-regimented (but indefinitely expandable) fragment of a natural language that can, therefore, be “intelligently” processed by computers, using the semantic representations of the phrases of the fragment.
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...Svetlin Nakov
Nakov P., Nakov S., Paskaleva E., Improved Word Alignments Using the Web as a Corpus, Proceedings of the International Conference RANLP 2007 (Recent Advances in Natural Language Processing), pp. 400-405, ISBN 978-954-91743-7-3, Borovets, Bulgaria, 27-29 September 2007
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co-occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co-occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
In this paper, we made a survey on Word Sense Disambiguation (WSD). Near about in all major languages
around the world, research in WSD has been conducted upto different extents. In this paper, we have gone
through a survey regarding the different approaches adopted in different research works, the State of the
Art in the performance in this domain, recent works in different Indian languages and finally a survey in
Bengali language. We have made a survey on different competitions in this field and the bench mark
results, obtained from those competitions.
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
The spread and abundance of electronic documents requires automatic techniques for extracting useful information from the text they contain. The availability of conceptual taxonomies can be of great help, but manually building them is a complex and costly task. Building on previous work, we propose a technique to automatically extract conceptual graphs from text and reason with them. Since automated learning of taxonomies needs to be robust with respect to missing or partial knowledge and flexible with respect to noise, this work proposes a way to deal with these problems. The case of poor data/sparse concepts is tackled by finding generalizations among disjoint pieces of knowledge. Noise is
handled by introducing soft relationships among concepts rather than hard ones, and applying a probabilistic inferential setting. In particular, we propose to reason on the extracted graph using different kinds of relationships among concepts, where each arc/relationship is associated to a number that represents its likelihood among all possible worlds, and to face the problem of sparse knowledge by using generalizations among distant concepts as bridges between disjoint portions of knowledge.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
Chinese discourse coherence modeling remains a challenge taskin Natural Language Processing
field.Existing approaches mostlyfocus on the need for feature engineering, whichadoptthe sophisticated
features to capture the logic or syntactic or semantic relationships acrosssentences within a text.In this
paper, we present an entity-drivenrecursive deep modelfor the Chinese discourse coherence evaluation
based on current English discourse coherenceneural network model. Specifically, to overcome the
shortage of identifying the entity(nouns) overlap across sentences in the currentmodel, Our combined
modelsuccessfully investigatesthe entities information into the recursive neural network
freamework.Evaluation results on both sentence ordering and machine translation coherence rating
task show the effectiveness of the proposed model, which significantly outperforms the existing strong
baseline.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same
graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
Automatic Identification of False Friends in Parallel Corpora: Statistical an...Svetlin Nakov
Scientific article: False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the wordsâ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available.
A Natural Logic for Artificial Intelligence, and its Risks and Benefits gerogepatton
This paper is a multidisciplinary project proposal, submitted in the hopes that it may garner enough interest to launch it with members of the AI research community along with linguists
and philosophers of mind and language interested in constructing a semantics for a natural logic for AI. The paper outlines some of the major hurdles in the way of “semantics-driven” natural language processing based on standard predicate logic and sketches out the steps to be
taken toward a “natural logic”, a semantic system explicitly defined on a well-regimented (but indefinitely expandable) fragment of a natural language that can, therefore, be “intelligently” processed by computers, using the semantic representations of the phrases of the fragment.
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...Svetlin Nakov
Nakov P., Nakov S., Paskaleva E., Improved Word Alignments Using the Web as a Corpus, Proceedings of the International Conference RANLP 2007 (Recent Advances in Natural Language Processing), pp. 400-405, ISBN 978-954-91743-7-3, Borovets, Bulgaria, 27-29 September 2007
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co-occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co-occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
In this paper, we made a survey on Word Sense Disambiguation (WSD). Near about in all major languages
around the world, research in WSD has been conducted upto different extents. In this paper, we have gone
through a survey regarding the different approaches adopted in different research works, the State of the
Art in the performance in this domain, recent works in different Indian languages and finally a survey in
Bengali language. We have made a survey on different competitions in this field and the bench mark
results, obtained from those competitions.
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
ABSTRACT
Natural Language processing is an interdisciplinary branch of linguistic and computer science studied under the Artificial Intelligence (AI) that gave birth to an allied area called
‘Computational Linguistic’ which focuses on processing of natural languages on computational devices. A natural language consists of a large number of sentences which are linguistic units involving one or more words linked together in accordance with a set of predefined rules called grammar. Grammar checking is the task of validating sentences syntactically and is a prominent tool within language engineering. Our review draws on the recent development of various grammar checkers to look at past, present and the future in a new light. Our review covers grammar checkers of many languages with the aim of seeking their approaches, methodologies for developing new tool and system as a whole. The survey concludes with the discussion of various features included in existing grammar checkers of foreign languages as well as a few Indian Languages.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Automatic classification of bengali sentences based on sense definitions pres...ijctcm
Based on the sense definition of words available in the Bengali WordNet, an attempt is made to classify the
Bengali sentences automatically into different groups in accordance with their underlying senses. The input
sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL
project of the Govt. of India, while information about the different senses of particular ambiguous lexical
item is collected from Bengali WordNet. In an experimental basis we have used Naive Bayes probabilistic
model as a useful classifier of sentences. We have applied the algorithm over 1747 sentences that contain a
particular Bengali lexical item which, because of its ambiguous nature, is able to trigger different senses
that render sentences in different meanings. In our experiment we have achieved around 84% accurate
result on the sense classification over the total input sentences. We have analyzed those residual sentences
that did not comply with our experiment and did affect the results to note that in many cases, wrong
syntactic structures and less semantic information are the main hurdles in semantic classification of
sentences. The applicational relevance of this study is attested in automatic text classification, machine
learning, information extraction, and word sense disambiguation
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING cscpconf
In the last decade, ontologies have played a key technology role for information sharing and agents interoperability in different application domains. In semantic web domain, ontologies are efficiently used toface the great challenge of representing the semantics of data, in order to bring the actual web to its full
power and hence, achieve its objective. However, using ontologies as common and shared vocabularies requires a certain degree of interoperability between them. To confront this requirement, mapping ontologies is a solution that is not to be avoided. In deed, ontology mapping build a meta layer that allows different applications and information systems to access and share their informations, of course, after resolving the different forms of syntactic, semantic and lexical mismatches. In the contribution presented in this paper, we have integrated the semantic aspect based on an external lexical resource, wordNet, to design a new algorithm for fully automatic ontology mapping. This fully automatic character features the
main difference of our contribution with regards to the most of the existing semi-automatic algorithms of ontology mapping, such as Chimaera, Prompt, Onion, Glue, etc. To better enhance the performances of our algorithm, the mapping discovery stage is based on the combination of two sub-modules. The former
analysis the concept’s names and the later analysis their properties. Each one of these two sub-modules is
it self based on the combination of lexical and semantic similarity measures.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
Continuous bag of words (cbow) word2vec word embedding work is that it tends to predict the
probability of a word given a context. A context may be a single word or a group of words. But for
simplicity, I will take a single context word and try to predict a single target word.
The purpose of this question is to be able to create a word embedding for the given data set.
data set text:
In linguistics word embeddings were discussed in the research area of distributional semantics. It
aims to quantify and categorize semantic similarities between linguistic items based on their
distributional properties in large samples of language data. The underlying idea that "a word is
characterized by the company it keeps" was popularized by Firth.
The technique of representing words as vectors has roots in the 1960s with the development of
the vector space model for information retrieval. Reducing the number of dimensions using
singular value decomposition then led to the introduction of latent semantic analysis in the late
1980s.In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language
models" to reduce the high dimensionality of words representations in contexts by "learning a
distributed representation for words". (Bengio et al, 2003). Word embeddings come in two different
styles, one in which words are expressed as vectors of co-occurring words, and another in which
words are expressed as vectors of linguistic contexts in which the words occur; these different
styles are studied in (Lavelli et al, 2004). Roweis and Saul published in Science how to use
"locally linear embedding" (LLE) to discover representations of high dimensional data structures.
The area developed gradually and really took off after 2010, partly because important advances
had been made since then on the quality of vectors and the training speed of the model.
There are many branches and many research groups working on word embeddings. In 2013, a
team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit which can train
vector space models faster than the previous approaches. Most new word embedding techniques
rely on a neural network architecture instead of more traditional n-gram models and unsupervised
learning.
Limitations
One of the main limitations of word embeddings (word vector space models in general) is that
possible meanings of a word are conflated into a single representation (a single vector in the
semantic space). Sense embeddings are a solution to this problem: individual meanings of words
are represented as distinct vectors in the space.
For biological sequences: BioVectors
Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for
bioinformatics applications have been proposed by Asgari and Mofrad. Named bio-vectors
(BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins
(amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representa.
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
Exploiting rules for resolving ambiguity in marathi language texteSAT Journals
Abstract
Natural language ambiguity is a situation involving some words having multiple meanings/senses. This paper discusses natural
language ambiguity and its types. Further we propose a knowledge based solution to resolve various types of ambiguity occurring
in Marathi language text. The task of resolving semantic and lexical ambiguity occurring in words to obtain the actual sense is
referred as Word Sense Disambiguation (WSD). Marathi language is the official and commonly spoken language of Maharashtra
state in India. Plenty of words in Marathi are spelled same as well as uttered same but are semantically (meaning-wise/ sensewise)
different. During the automatic translation, these words lead to ambiguity. Our method successfully removes the ambiguity
by identifying the correct sense of the given text from the predefined possible senses available in Marathi Wordnet using word and
sentence rules. The method is applicable only for word level ambiguity. Structural ambiguity is not handled by this system. This
system may be successfully used as a subsystem in other Natural Language Processing (NLP) applications.
Key Words: Word Sense Disambiguation, Natural Language Processing, Marathi, Marathi Wordnet, ambiguity,
knowledge based
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
AMBIGUITY-AWARE DOCUMENT SIMILARITY
1. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
DOI : 10.5121/ijnlc.2013.2302 13
AMBIGUITY-AWARE DOCUMENT
SIMILARITY
Fabrizio Caruso
Fabrizio_Caruso@hotmail.com
Neodata Group
Giovanni Giuffrida
giovanni.giuffrida@dmi.unict.it
Dept of Social Science, University of Catania
Diego Reforgiato
diego.reforgiato@neodatagroup.com
Neodata Group
Giuseppe Tribulato
gtsystem@gmail.com
Calogero Zarba
Neodata Intelligence
calogero.zarba@neodatagroup.com
ABSTRACT
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
KEYWORDS
Disambiguation, Recommendation Systems, Term-Frequency Inverse-Document-Frequency, Online
Newspapers, Text Mining
1. INTRODUCTION
When reading newspapers or documents, people are usually unaware of the ambiguities in the
language they use: by using document context and their knowledge of the world they can resolve
them very quickly. When we think about how computers may tackle this problem we realize how
hard it is to solve it algorithmically.
First of all, let us define what ambiguity is. Ambiguityis the property of being ambiguous
(homonyms); a word, term, notation, sign, symbol, phrase, sentence, or any other form used for
communication, is called ambiguous if it can be interpreted in more than one way. Ambiguity is
2. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
14
different from vagueness, which arises when the boundaries of meaning are indistinct. Ambiguity
is context-dependent: the same linguistic item (be it a word, phrase, or sentence) may be
ambiguous in one context and unambiguous in another context. In general, something is
ambiguous when it has more than one meaning.When a single word is associated with multiple
senses we have lexical ambiguity. Almost any word has more than one meaning. For example,
words like "wound", "produce", "lead", "desert" have multiple meanings.
In this paper we are interested in resolving this form of ambiguity. Other forms of ambiguity are
syntactic ambiguity, in which a sentence can be parsed in different ways (e.g., "He ate the cookies
on the couch"), and semantic ambiguity, in which a word may have different meanings when used
in an informal or idiomatic expression.
For various applications, such as information retrieval or machine translation, it is important to be
able to distinguish between the different senses of a word. We address the problem of deciding
whether two documents possibly containing ambiguous words talk about the same topic or not.
Documents are represented using the bag-of-words model and the term frequency-inverse
document frequency metric. In this model, one can determine whether two documents treat the
same topic by computing the cosine similarity of the vectors representing the documents.
We present three disambiguation algorithms (which have been included in an article
recommendation system) that aim at discerning whether a word shared by two documents is used
with the same meaning or with different meanings in the two documents. We remark that our
focus is not the investigation of the meaning of the words. Our goal is to improve the estimate of
the similarity between documents by heuristically taking out from the similarity computation
most of the words that are probably used with different meanings in different documents. To the
best of our knowledge, the application of techniques to detect ambiguities for article
recommendation is the first in literature.
This paper is organized as follows: Section 2 presents an overview and a literature survey of the
problem of word sense disambiguation. Section 3 introduces the term frequency-inverse
document frequency metric (tfidf) used throughout the pa-per. Section 4 presents some sense
disambiguation issues related to the tfidf we had to face. Section 5 describes the multi-language
algorithms we propose. Section 6 depicts the article recommendation system we have used and
some results from real data. Finally, Section 7 contains some final remarks on the presented
findings.
2. RELATED WORK
Semantic relations (e.g., "has-part", "is-a", etc.) have been used to measure the semantic distance
between words, which can be used to disambiguate the meaning of words in texts. Also semantic
fields (set of words grouped by the same topic) have proved to be very useful for the
disambiguation task. Therefore, with a tool like WordNet [13] and MultiWordNet [5] we have the
possibility to work on several problems of computational linguistics for many different languages.
Any algorithm created for a language is usually generalizable for all the other languages.
WordNet has been widely used to solve a large number of problems related to text mining as
document clustering [8]. Below we report some past work where Word-Net and MultiWordNet
have been used in approaching the problem of word sense disambiguation (WSD).
The earliest attempts to use WordNet in word sense disambiguation were in the field of
information retrieval. Fragos et al. [4] used a dictionary to disambiguate a specific word
appearing in a context. Sense definitions of the specific word, “syn-set" definitions, the “is-a"
3. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
15
relation, and definitions of the context features (words in the same sentence) are retrieved from
the WordNet database and used as an input for a disambiguation algorithm. In [11] the authors
used WordNet for WSD. Unlike many others approaches on that area, it exploits the structure of
WordNet in an indirect manner. To disambiguate the words it measures the semantic similarity of
the words glosses. The similarity is calculated using the SynPath algorithm. Its essence is the
replacement of each word by a sequence of WordNet synset identifiers that describe related
concepts. To measure the similarity of such sequence the standard tfidf formula is used. In [6]
authors deal with the problem of providing users with cross-language recommendations by
comparing two different content-based techniques: the first one relies on a knowledge-based word
sense disambiguation algorithm that uses MultiWordNet as sense inventory, while the latter is
based on the so-called distributional hypothesis and exploits a dimensionality reduction technique
called Random Indexing in order to build language-independent user profiles. In [12] the authors
propose a method to find Near-Synonyms and Similar-Looking (NSSL) words and designed three
experiments to investigate whether NSSL matching exercises could increase Chinese EFL
learners' awareness of NSSL words.
Chen et al [3] proposed a novel idea of combining WordNet and ConceptNet for WSD. First, they
present a novel method to automatically disambiguate the concepts in ConceptNet; and then they
enrich WordNet with large amounts of semantic relations from the disambiguated ConceptNet for
WSD.
One of the word sense disambiguation algorithms described in this paper (the one which uses
MultiWordNet) builds upon a recent approach, presented in [1, 2, 6], where a method for solving
the semantic ambiguity of all words contained in a text is presented. The authors propose a hybrid
WSD algorithm that combines a knowledge-based WSD algorithm, called JIGSAW, which they
designed to work by exploiting WordNet-like dictionaries as sense repository, with a supervised
machine learning algorithm (K-Nearest Neighbor classifier). WordNet-like dictionaries combine a
dictionary with structured semantic network, supplying definitions for the different senses and
defining groups of synonymous words by means of synsets, which represent distinct lexical
concepts.
MultiWordNet
MultiWordNet [5, 7] is a multilingual lexical database developed at "Fondazione Bruno Kessler"
in which the Italian WordNet is strictly aligned with Princeton WordNet 1.6 [13]. The current
version includes around 44,400 Italian lemmas organized into 35,400 synsets which are aligned,
whenever possible, with their corresponding English Princeton synsets. The MultiWordNet
database can be freely browsed through its online interface, and is distributed both for research
and commercial use [5]. The Italian synsets are created in correspondence with the Princeton
WordNet synsets, whenever possible, and semantic relations are imported from the corresponding
English synsets; i.e., it is assumed that if there are two synsets in the Princeton WordNet and a
relation holding between them, the same relation holds between the corresponding Italian synsets.
While the project stresses the usefulness of a strict alignment between wordnets of different
languages, the multilingual hierarchy implemented is able to represent true lexical idiosyncrasies
between languages, such as lexical gaps and denotation differences.
The information contained in the database can be browsed through the MultiWordNet online
browser, which facilitates the comparison of the lexica of the aligned languages.
Synsets are the most important units in MultiWordNet. Here is an example of an Italian synset for
the word "computer": “elaboratore”, “computer”, “cervello elettronico”, “calcolatore”. Important
4. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
16
information is attached to synsets, such as semantic fields and semantic relations. The semantic
field describes the topic of a synset. For example, the above synset belongs to the "Computer
Science" semantic field.
3. TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY
We compute the similarity between two documents using the well-known bag-of-words model
[10]. This model does not consider the ordering of the words in a document. For example, the two
sentences "John is quicker than Mary" and "Mary is quicker than John" have the same
representation.
In the following, W is the set of all words, and A is the set of the documents. We assume that both
W and A are finite. In an actual implementation, usually the set of all words does not contain the
so-called stop words, which are frequent words with a too general meaning, such as “and”, “is”,
“the”, etc. Moreover, words may optionally be stemmed using, for instance, Porter's stemming
algorithm [9].
Let a be a document, and let w be a word. The term frequency tf (w, a) of w in a is the number of
occurrences of w in a. The raw term frequency does not give enough information to compare two
documents. The key point is to consider rare terms.
Let w be a word. The inverse document frequency idf (w, A) of w is given by idf(w, A) =
|A|+
|A|
w1
log , where Aw is the set of documents containing the word w. Let w be a word, and let a
be a document. For sake of simplicity we will consider A an implicit argument in the subsequent
definitions. The term frequency-inverse document frequency tfidf (w, a) of w in a is given by tfidf
(w, a) = tf (w, a) idf (w).
Given a document a, we represent its content with a function fa : W
let
fa(w) = f ’a (w) / | f ’a |
where
f ’a (w) = tfidf(w,a); | f ’a | = √∑
w' ∈W
[f 'a (w ' )]
2
Note that this representation does not consider the ordering of words in documents. Moreover,
this representation gives more importance to words that occur frequently within a document (the
term frequency part), and to rare words that occur in few documents (the inverse document
frequency part). We can also think of fa and fa
’
as vectors whose i-th component corresponds to
the i-th word w in W, and | f ’a | as the Euclidean norm of f ’a.
Given two documents a, b the cosine similarity of a and b is defined as the scalar product
of fa and fb:
σ (a ,b )= f a× f b= ∑
w ∈W
f a(w )× f b (w )
Note that two documents are similar when they have many words in common. In particular, the
cosine similarity is higher when the words that the two documents have in common have a high
5. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
17
term-frequency and a high inverse-document frequency. Other similarity measures may be
considered such as the Pearson correlation index (equivalent to the cosine similarity if the average
value is assumed to be zero).
4. RECOMMENDING NEWS DOCUMENTS USING THE TFIDF METRIC IN AN
ARTICLE RECOMMENDATION SYSTEM
In this section we describe how the article recommendation system, (see Section 6 for more
details) uses the cosine similarity with the tfidf metric to compare documents.
Each time a new document d is processed, its similarity with respect to all other documents is
computed. For each document d, we keep in memory its k-nearest-neighbor document list
according to the cosine similarity: the first entry contains the most similar document to d whereas
the last entry contains the least similar. However, it may happen that, for a given document d,
some of the k documents entries are not related to d at all. This event may happen when a word
with high tfidf value has different meanings in different documents. In this case we should declare
the word ambiguous and take it out of the similarity computation.
Therefore, we need to perform a word disambiguation task between a given document d and each
document entry in its k-nearest-neighbor document list. Formally, we need to make sure that in
the pairs (d, d1), (d, d2),…, (d, dk) no ambiguous words with a high tfidf value are used for the
computation of the similarity. Ambiguous words with low tfidf value do not affect significantly
the similarity of a given document pair (d, di). Hence we do not need to take them into account.
Given a document pair (d, di), we first multiply the tfidf values of words occurring in both
documents and sort the resulting values in decreasing order. Then, we consider only the resulting
terms with a value greater than a fixed threshold imp_weight. We will refer to such terms as the
imp_words set. For the documents we have taken under consideration, the cardinality of
imp_words ranged between five and ten.
As discussed above, the intuition behind this is that words with a high tfidf value highly
contribute to the overall score of the similarity between the two documents, and, consequently,
highly contribute to the creation of the k-nearest-neighbor list.
If a word in imp_words is declared ambiguous for a pair of documents (d, di), we remove it from
further computation of the similarity of that pair. If the value associated to the ambiguous word is
high enough, a new computation of k-nearest documents of d would shift the document di to a
more distant position in the nearest-neighbor list, that is, di would be declared less similar to d
than it was previously.
5. THE DISAMBIGUATION STRATEGIES
Given two documents d1 and d2, and a word w in the imp_words set, we want to decide whether
the word w is ambiguous for d1 and d2. To accomplish this task, we have developed three
strategies: (i) the uppercase words strategy, (ii) the word pair frequency strategy, and (iii) the
MultiWordNet knowledge strategy. Each strategy consists of an algorithm returning one of three
possible values: ambiguous, unambiguous, or undecided. The last value is returned when the
algorithm is not able to decide whether the word w is ambiguous or unambiguous for the two
documents d1 and d2.
Uppercase words strategy (UWS)
The uppercase words strategy handles words that are in uppercase, that is, those words denoting
6. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
18
named entities such as persons, companies or locations. The idea is that there is ambiguity if an
uppercase word w denotes two different entities in the two documents. This can be checked by
inspecting the other uppercase words near w in the two documents.
Let d1 and d2 be two documents both containing a word w. The word w may be in uppercase in at
least one of the two documents. Algorithm 1 attempts to decide whether w is ambiguous with
respect to d1 and d2 using the upper case word strategy.
functionUWS(d1, d2, w)
ifwis uppercase in one document and lowercase in the otherthen
returnambiguous
elseifwis lowercase in both documentsthen
returnunambiguous
else//At this point,wis uppercase in both documents
ifthere are uppercase words nearwin both documentsthen
ifthese uppercase words are the samethen
returnunambiguous
else
returnambiguous
else
returnundecided
Algorithm 1:Upper case word strategy.
Examples
1. Let d1 be a document about an Italian company whose name is "Guru". Let also d2 be a
document where the common noun "guru" is used. Since the word "guru" is used in
uppercase in d1 and in lowercase in d2, the uppercase word strategy declares the word
"guru" as ambiguous for d1 and d2.
2. Let d1 be a document about "Carlos Santana", a famous Mexican guitarist. Let also d2 be
a document about "Mario Alberto Santana", a famous Italian soccer player. In this case,
the uppercase word strategy declares the word "Santana" as ambiguous for d1 and d2.
Word pair frequency strategy (WPFS)
The word pair frequency strategy uses two-word expressions, or word pairs, to try to understand
when a word can be classified as ambiguous. The idea behind this strategy is that if a word is used
in the same expression within two different documents, then the two documents refer to the same
concept.
Given our corpus of documents, we create a set WP of commonly occurring word pairs. A word
pair, denoted as w1|w2, is an expression of two words separated by a space. A word pair w1|w2
belongs to the set WP if and only if (i) the word wi is not a stop word, for i = 1, 2, (ii) the word wi
occurs in the corpus at least a predefined minimum number of times supp, for i = 1,2, and (iii) the
expression w1|w2 occurs in the corpus at least a predefined minimum number of times
min_support.
Let d1 and d2 be two documents both containing a word w. Algorithm 3 attempts to decide
whether w is ambiguous for d1 and d2 using the word pair frequency strategy.
7. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
19
function WPFS(d1, d2, w)
if w|w’
WP d1 d2, for some wordw’
then
return unambiguous
else if w’
|w WP d1 d2, for some wordw’
then
return unambiguous
else if w|w’ WP d1andw|w’’ WP d2, for some wordsw’
, w’’
then
return ambiguous
else if w’
|w WP d1andw’’
|w WP d2, for some wordsw’
, w’’
then
return ambiguous
else
return undecided
Algorithm 2: Word pair frequency strategy.
Example
Let d1 be a document containing the expression “lodo Alfano”, and let d2 be a document
containing the expression “lodo Schifani”. Assume that both expressions are present in our
dictionary of commonly recurrent expressions.
In this case, the word pair frequency strategy declares that the word “lodo” id ambiguous for d1
and d2.
MultiWordNet knowledge strategy (MKS)
The third strategy exploits the MultiWordNet semantic network. The idea is that, in order to
disambiguate a word w with respect to two documents d1 and d2, we look at the semantic fields
associated to the senses of the words near w in the two documents. If the word w is used in one
document in a context entailing a set of semantic fields which is different from the ones entailed
by the other documents, it means that the word w is ambiguous.
Let d1 and d2 be two documents both containing a word w. Algorithm 3 attempts to decide
whether w is ambiguous for d1 and d2 using the MultiWordNet knowledge strategy.
function MKS(d1, d2, w)
v1 = BuildContextVector(d1,w)
v2 = BuildContextVector(d2,w)
s = v1, v2) // denotes the cosine similarity function
if s <min_mw then
return ambiguous
else if s <max_mw then
return undecided
else
return unambiguous
function BuildContextVector(d, w)
Let v be an empty vector
for each wordw’
dcontainingw do
for each sensesassociated to the wordw’
do
8. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
20
15: Let f be the semantic field of the sense s
16: if fis the Factotum semantic field then
v[f] = v[f] + 0.1 * tfidf(w’,d)
else
v[f] = v[f] + tfidf(w’,d)
Normalize v so that its norm is 1
return v
Algorithm 3: MultiWordNet knowledge strategy.
Example
Let d1 be a document containing the following sentence, where the numbers given are the
tfidf values of the words in the document d1.
The virus infected the computer
0.15 0.1 0.2
Note that there is no tfidf value associated with the word “the”, because this is a stop word.
Let also d2 be a document containing the following sentence.
The HIV virus is the cause of AIDS
0.05 0.08 0.07 0.1
Using the information from MultiWordNet, we obtain the following vectors of semantic
fields for d1 and d2.
Factotum Computer science
0.196 0.981
Factotum Biology Law Sociology Medicine
0.044 0.443 0.089 0.089 0.886
The cosine similarity of the two vectors is 0.196*0.044 = 0.009. Assuming a minimum threshold
value of 0.2, the MultiWordnet knowledge strategy declares that the word "virus" is ambiguous
for d1 and d2
6. ARTICLE RECOMMENDATION SYSTEM
We have developed an article recommendation system, which works on Italian documents and
includes the disambiguation strategies described above. The considered text documents were the
Italian news published daily by one of the most important Italian newspapers which contains
news about politics, environment, weather, technology, world events, finance, sport, travel, etc. A
dedicated crawler continuously extracts the information about each new document (title, text, url,
section, date, etc.). The extracted documents are first cleaned of all HTML entities, spell-checked
and, finally, stored in a highly optimized database that we have designed on top of the
PostgreSQL system. The database consists of different tables containing extensive information
about the extracted documents (document id, title, content, url, date, category); moreover, for
9. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
21
each word there is a table containing its frequency in the dataset. On average, 50 new documents
per day are published, and, consequently, extracted by our crawler and stored into the database.
The collected documents at the time of the current study were about 150,000. Given a document
d, our article recommendation system returns a list of the most similar documents to d filtering
out any potential ambiguities.
Dates Impressions Clicks CTR
2011-09-16 96142 2640 2.745%
2011-09-17 58918 1949 3.307%
2011-09-18 58365 2193 3.757%
2011-09-19 80596 2483 3.080%
2011-09-20 86306 2271 2.631%
2011-09-21 80190 2043 2.547%
2011-09-22 98253 2307 2.348%
2011-09-23 79597 1984 2.492%
2011-09-24 49353 1821 3.689%
2011-09-25 46284 1798 3.884%
Table 1: Impressions, Click and CTR for recommended articles between 2011-09-16 and 2011-09-25.
Results on real data
In the Table 1 we have collected some results on real data. Once a user reads an article, a list of
suggestions is shown at the end of the article. We have counted the number of impressions and
clicks on recommended articles in one of the sites that use our article recommendation system in
the period between 2011-09-16 and 2011-09-25. There is an impression for an article a, whenever
the article a appears in the list of suggestions. There is a click for an article a, whenever a
suggested article a is clicked by the user. The data in Table 6.1 show a high click-through rate,
i.e., the ratio between clicks and impression, which is about 3%. This means that a large number
of articles are read as a consequence of the suggestions produced by our article recommendation
system.
CONCLUSION
This paper presents a novel way to calculate document similarity using the term frequency-
inverse document frequency metric improved by three different disambiguation strategies. In
particular, we have proposed a strategy, which exploits information about word case, a strategy
which uses information about frequency of multi-word expressions, and one more strategy which
uses MultiWordNet semantic information. These disambiguation strategies can be embedded in a
system for any language (in particular, the Italian language for the results and experiments
presented in this paper). They have improved the precision of the article recommendation system
where they have been embedded.
REFERENCES
10. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
22
[1] P. Basile, M. de Gemmis, A. L. Gentile, P. Lops, G. Semeraro, The jigsaw algorithm for word sense
disambiguation and semantic indexing of documents, in: Proceedings of AI*IA07, 10th Congress of
Italian Association of Artificial Intelligence. Roma Italy, pp. 314-325.
[2] P. Basile, M. de Gemmis, P. Lops, G. Semeraro, Combining knowledge-based methods and
supervised learning for effective Italian word sense disambiguation, in: Symposium on Semantics in
Systems for Text Processing, STEP 2008, Venice, Italy, volume 1.
[3] J. Chen, J. Liu, Combining conceptnet and wordnet for word sense disambiguation, in: IJCNLP 2011.
[4] K. Fragos, I. Maistros, C. Skourlas, Word sense disambiguation using wordnet relations, in:
Proceedings of 1st Balkan Conference in Informatics. Thessaloniki, Greece.
[5] multiwordnet, MultiWordNet, http://multiwordnet.itc.it/english/home.php.
[6] C. Musto, F. Narducci, P. Basile, P. Lops, M. de Gemmis, G. Semeraro, Comparing word sense
disambiguation and distributional models for cross-language information filtering, in: Proceedings of
the 3th Italian Information Retrieval Workshop (IIR 2012), Bari, Italy, pp. 117-120.
[7] E. Pianta, L. Bentivogli, C. Girardi, Multiwordnet: developing an aligned multilingual database, in:
Proceedings of the First International Conference on Global WordNet, Mysore, India, pp. 293-302.
[8] D. Reforgiato, A new unsupervised method for document clustering by using wordnet lexical and
conceptual relations, in: Journal of Information Retrieval, Springer Netherlands, pp. 563-579.
[9] M. F. Porter, An algorithm for suffix stripping, Program 14 (1980) 130-137.
[10] G. Salton, J. McGill, Introduction to modern information retrieval, in: McGraw-Hill, ISBN
0070544840.
[11] A. Sieminski, Wordnet based word sense disambiguation, in: Proceedings of the Third international
conference on Computational collective intelligence: technologies and applications - Volume Part II,
ICCCI'11, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 405-414.
[12] K.-T. Sun, Y.-M. Huang, M.-C. Liu, A wordnet-based near-synonyms and similar-looking word
learning system, Proceedings of Educational Technology and Society 14 (2011) 121-134.
[13] WordNet, http://wordnet.princeton.edu/, 1996.