Reverse dictionaries are widely used for a reference work that is organized by concepts,
phrases, or the definitions of words. This paper describe the many challenges inherent in building a
reverse lexicon, and map drawback to the well known abstract similarity problem The criterion web
search engines are basic versions of system; they take benefit of huge scale which permits inferring
general interest concerning documents from link information. This paper describe the basic study of
database driven reverse dictionary using three large-scale dataset namely person names, general English
words and biomedical concepts. This paper analyzes difficulties arising in the use of documents
produced by Reverse dictionary.
This document summarizes and compares two computerized methods for analyzing political text documents: Wordscores and Wordfish. Wordscores relies on selecting reference texts to anchor its analysis, while Wordfish treats each text as its own data point without reference texts. Studies have found Wordscores performs better when positions are less polarized or data is limited, while Wordfish performs equally well or better when positions are more polarized. Both methods have limitations, such as Wordscores' inability to reliably analyze changes in word meaning over time and Wordfish's treatment of all words equally. Overall, the document evaluates the strengths and weaknesses of these two automated content analysis techniques.
Information retrieval systems use indexes and inverted indexes to quickly search large document collections by mapping terms to their locations. Boolean retrieval uses an inverted index to process Boolean queries by intersecting postings lists to find documents that contain sets of terms. Key aspects of information retrieval systems include precision, recall, and ranking search results by relevance.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document introduces inverted files, which are a core data structure for text search engines. It describes inverted files and how they allow for efficient indexing, construction, and querying. The document then outlines some common extensions to inverted file indexes, such as compression, phrase querying, and distribution. It concludes by providing context on text search and information retrieval.
The document discusses text mining, including defining it as the extraction of information from unstructured text using computational methods. It covers topics such as structured vs unstructured data, common text mining practice areas like information retrieval and document clustering, and challenges in text mining including ambiguity in language. Pre-processing techniques for text mining are also outlined, such as normalization, tokenization, stemming and removing stop words to clean and prepare text for analysis.
What can corpus software do? Routledge chpt 11RajpootBhatti5
Corpus software can perform several functions to analyze text data:
1. It can generate concordances to locate words or phrases within texts and show surrounding context. Concordances are generated either by processing texts on-the-fly or building an index of word locations.
2. It can create word lists by identifying words as alphanumeric strings separated by non-alphanumeric characters like spaces.
3. It can identify key words that occur unusually frequently in a given text by comparing word frequencies to a reference corpus. This helps find important or distinguishing terms.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
1) The paper proposes an efficient Tamil text compaction system that reduces Tamil text to around 40% of the original by identifying word categories and mapping words to compact forms while maintaining meaning.
2) The system handles common Tamil words, abbreviations/acronyms, and numbers by using a morphological analyzer to identify word roots and a generator to re-add suffixes. Compact forms are retrieved from mappings stored in data structures like trees and hashmaps.
3) Testing on over 10,000 words showed the final text was reduced to 40% of the original size, providing a more efficient way to communicate in Tamil on platforms with character limits like social media and text messages.
This document summarizes and compares two computerized methods for analyzing political text documents: Wordscores and Wordfish. Wordscores relies on selecting reference texts to anchor its analysis, while Wordfish treats each text as its own data point without reference texts. Studies have found Wordscores performs better when positions are less polarized or data is limited, while Wordfish performs equally well or better when positions are more polarized. Both methods have limitations, such as Wordscores' inability to reliably analyze changes in word meaning over time and Wordfish's treatment of all words equally. Overall, the document evaluates the strengths and weaknesses of these two automated content analysis techniques.
Information retrieval systems use indexes and inverted indexes to quickly search large document collections by mapping terms to their locations. Boolean retrieval uses an inverted index to process Boolean queries by intersecting postings lists to find documents that contain sets of terms. Key aspects of information retrieval systems include precision, recall, and ranking search results by relevance.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document introduces inverted files, which are a core data structure for text search engines. It describes inverted files and how they allow for efficient indexing, construction, and querying. The document then outlines some common extensions to inverted file indexes, such as compression, phrase querying, and distribution. It concludes by providing context on text search and information retrieval.
The document discusses text mining, including defining it as the extraction of information from unstructured text using computational methods. It covers topics such as structured vs unstructured data, common text mining practice areas like information retrieval and document clustering, and challenges in text mining including ambiguity in language. Pre-processing techniques for text mining are also outlined, such as normalization, tokenization, stemming and removing stop words to clean and prepare text for analysis.
What can corpus software do? Routledge chpt 11RajpootBhatti5
Corpus software can perform several functions to analyze text data:
1. It can generate concordances to locate words or phrases within texts and show surrounding context. Concordances are generated either by processing texts on-the-fly or building an index of word locations.
2. It can create word lists by identifying words as alphanumeric strings separated by non-alphanumeric characters like spaces.
3. It can identify key words that occur unusually frequently in a given text by comparing word frequencies to a reference corpus. This helps find important or distinguishing terms.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
1) The paper proposes an efficient Tamil text compaction system that reduces Tamil text to around 40% of the original by identifying word categories and mapping words to compact forms while maintaining meaning.
2) The system handles common Tamil words, abbreviations/acronyms, and numbers by using a morphological analyzer to identify word roots and a generator to re-add suffixes. Compact forms are retrieved from mappings stored in data structures like trees and hashmaps.
3) Testing on over 10,000 words showed the final text was reduced to 40% of the original size, providing a more efficient way to communicate in Tamil on platforms with character limits like social media and text messages.
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
In this article, a spell corrector has been designed for the Hausa language which is the second most spoken language in Africa and do not yet have processing tools. This study is a contribution to the automatic processing of the Hausa language. We used existing techniques for other languages and adapted them to the special case of the Hausa language. The corrector designed operates essentially on Mijinguini’s dictionary and characteristics of the Hausa alphabet. After a brief review on spell checking and spell correcting techniques and the state of art in the Hausa language processing, we opted for the data structures trie and hash table to represent the dictionary. The edit distance and the specificities of the Hausa alphabet have been used to detect and correct spelling errors. The implementation of the spell corrector has been made on a special editor developed for that purpose (LyTexEditor) but also as an extension (add-on) for OpenOffice.org. A comparison was made on the performance of the two data structures used.
14. Michael Oakes (UoW) Natural Language Processing for TranslationRIILP
This document discusses information retrieval and describes its three main phases: 1) asking a question to define an information need, 2) constructing an answer by matching queries to documents, and 3) assessing the relevance of the retrieved answers. It also covers several important information retrieval concepts like keywords, indexing documents, stemming words, calculating TF-IDF weights, and evaluating system performance using recall and precision.
The document discusses various natural language processing (NLP) techniques including implementing search, document level analysis, sentence level analysis, and concept extraction. It provides details on tokenization, word normalization, stop word removal, stemming, evaluating search results, parsing and part-of-speech tagging, entity extraction, word sense disambiguation, concept extraction, dependency analysis, coreference, question parsing systems, and sentiment analysis. Implementation details and useful tools are mentioned for various techniques.
This document describes a study that analyzed features for a supervised transition-based dependency parser on the Latin Dependency Treebank. It found that using part-of-speech and case features achieved the highest accuracy. The corpus and parsing approach are described, including how dependency graphs are encoded and the transition system used to parse sentences. Projective and non-projective graphs are distinguished, and roughly half the sentences in the corpus exhibited non-projective structures.
Author credits - Maaz Nomani
his paper presents our work on the annotation of intra-chunk dependencies on an English Treebank that was previously annotated with Inter-chunk dependencies. Given a natural language sentence, Intra-chunk dependencies show a relationship between words in a chunk and Inter-chunk dependencies showcase relationship among the chunks. For e.g. “James cut an apple with a knife”; In this sentence, chunking will result in (James) (cut) (an apple) (with a knife). Inter-chunk dependencies show dependency relation among these 4 chunks and Intra-chunk dependency show dependency relation between the words present in each chunk.
This exercise provides fully parsed dependency trees for the English treebank. We also report an analysis of the inter-annotator agreement for this chunk expansion task. Further, these fully expanded parallel Hindi and English treebanks were word aligned and an analysis for the task has been given. Issues related to intra-chunk expansion and alignment for the language pair Hindi-English are discussed and guidelines for these tasks have been prepared and released.
What are the basics of Analysing a corpus? chpt.10 RoutledgeRajpootBhatti5
This document provides an overview of the basics of analyzing a corpus through various techniques including frequency analysis, normalization, keyword analysis, and concordance analysis. It explains that frequency lists show how often words occur, normalization adjusts for corpus size differences, keyword analysis finds statistically significant words compared to a reference corpus, and concordance analysis displays keywords in context to better understand usage. The document serves as an introduction to basic corpus analysis methods and tools.
The document describes Agaraadhi, a novel online dictionary framework for the Tamil language. The framework indexes over 3 lakh Tamil words, providing morphological analysis, word usage statistics, translations to English, and more. It consists of online and offline components that together enable features like spelling correction, word suggestions, analyzing word usage in literature and social media, and games to support learning. The framework aims to provide more robust Tamil language reference than existing dictionaries.
This document defines and summarizes key terms in corpus linguistics. It discusses bootstrapping, the Brill tagger, competence-performance dichotomy, computational linguistics, computer assisted language learning, corpus linguistics, extensible markup language, Penn Treebank, Kolhapur Corpus, Hyderabad Corpus, Text Encoding Initiative, Unicode, Linguistic Data Consortium, and alignment.
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
The document discusses natural language processing (NLP) techniques, current trends, and applications in industry. It covers common NLP techniques like morphology, syntax, semantics, and pragmatics. It also discusses word embeddings like Word2Vec and contextual embeddings like BERT. Finally, it discusses applications of NLP in healthcare like analyzing clinical notes and brand monitoring through sentiment analysis of user reviews.
16. Anne Schumann (USAAR) Terminology and Ontologies 1RIILP
This document provides an overview of terminology and ontologies. It discusses why terminology is important, including for expert communication, knowledge transfer, and management. Terms are defined as linguistic symbols that represent concepts, with the relationship between terms and concepts being one-to-one in terminology. Conceptual relations between concepts are also discussed, including hierarchical relations like "is-a" that define a concept's location within a concept system. The document emphasizes that terminology work should be concept-oriented, structuring concepts into organized concept systems.
Diacritic Oriented Arabic Information Retrieval SystemCSCJournals
Arabic language support in search engines and operating systems is improved in recent years. Searching in the Internet is reliable and can be compared to the excellent support for several other languages, including English. However, for text with diacritics there are some limitations. For this reason, most Information retrieval (IR) systems remove diacritics from text and ignore it for its complexity. Searching text with diacritics is important for some kinds of documents, such as those of religious books, some newspapers and children stories. This research shows the design and development of the system that overcome the problem. The proposed system considers diacritics. The proposed system includes the design complexity in the retrieving algorithm rather than the information repository, which is database in this study. Also, this study analyses the results and the performance. Results are promising and performance analysis shows methods to enhance design and increase the performance. The proposed system can be integrated in search engines, text editors and any information retrieval system that include Arabic text. Performance analysis of the proposed system shows that this system is reliable. The proposed system is applied on database of Hadeeth, which is religious book includes the prophet action and statements. The system can be applied in any kind of data repository.
Presentation given at ISWC2008. It analyzes complex network characteristics of dependence between terms (i.e. classes and properties) on the Semantic Web as well as dependence between Web ontologies.
Detailed presentation on various analytical tools widely used in Corpus Linguistics for corpora analysis including WORDCRUNCHER, LEXA, CWB , TACT, MICROCONCORD etc.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
17. Anne Schuman (USAAR) Terminology and Ontologies 2RIILP
This document discusses current research topics in terminology and ontologies. It covers trends like term variation, culture-specific semantic differences, definitions, contexts, and knowledge-rich contexts. It also discusses term extraction and mapping. Key areas of research include improving techniques for specialised domains, identifying term variants, providing richer semantic descriptions, and supporting terminological workflows and users.
Reference List Citations - APA 6th EditionJanice Orcutt
This document provides information on APA citation rules, including how to format reference list citations for different source types such as periodicals, books, and journal articles. It discusses the elements included in citations, such as author, date, title, and publisher. Ordering principles for reference lists are also covered, such as alphabetical ordering and distinguishing works by the same author. Examples are provided to illustrate different citation formats.
This document describes a project to mine named entities from Wikipedia. It discusses using Wikipedia's internal links, redirect links, external links, and categories to identify named entities and their synonyms with high accuracy. It presents an algorithm for generic named entity recognition that classifies Wikipedia entries based on capitalization, title formatting, and other features. The project aims to build a search system that matches queries to candidates using vector space modeling and considers contextual windows around search terms.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
The document describes PRECIS (PREserved Context Indexing System), an indexing system developed in the 1970s. It aims to represent meaning in index entries without disturbing user understanding. PRECIS uses role operators and strings of terms to preserve context across permuted index entries. It was used for indexing the British National Bibliography but was replaced by COMPASS in 1990. PRECIS requires analyzing documents, organizing concepts, and assigning role codes to terms to generate automated two-line index entries preserving semantics and syntax.
POPSI (Postulate based permuted subject indexing) is a pre-coordinate indexing system developed by G. Bhattacharyya that uses an analytic-synthetic method and permutation of terms to approach documents from different perspectives. It is based on Ranganathan's postulates and classification principles. POPSI helps formulate subject headings, derive index entries, determine subject queries, and formulate search strategies. The main POPSI table contains notation used in the indexing process. Key steps include analysis, formalization, modulation, standardization, and generating organized and associative classification entries and references.
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
In this article, a spell corrector has been designed for the Hausa language which is the second most spoken language in Africa and do not yet have processing tools. This study is a contribution to the automatic processing of the Hausa language. We used existing techniques for other languages and adapted them to the special case of the Hausa language. The corrector designed operates essentially on Mijinguini’s dictionary and characteristics of the Hausa alphabet. After a brief review on spell checking and spell correcting techniques and the state of art in the Hausa language processing, we opted for the data structures trie and hash table to represent the dictionary. The edit distance and the specificities of the Hausa alphabet have been used to detect and correct spelling errors. The implementation of the spell corrector has been made on a special editor developed for that purpose (LyTexEditor) but also as an extension (add-on) for OpenOffice.org. A comparison was made on the performance of the two data structures used.
14. Michael Oakes (UoW) Natural Language Processing for TranslationRIILP
This document discusses information retrieval and describes its three main phases: 1) asking a question to define an information need, 2) constructing an answer by matching queries to documents, and 3) assessing the relevance of the retrieved answers. It also covers several important information retrieval concepts like keywords, indexing documents, stemming words, calculating TF-IDF weights, and evaluating system performance using recall and precision.
The document discusses various natural language processing (NLP) techniques including implementing search, document level analysis, sentence level analysis, and concept extraction. It provides details on tokenization, word normalization, stop word removal, stemming, evaluating search results, parsing and part-of-speech tagging, entity extraction, word sense disambiguation, concept extraction, dependency analysis, coreference, question parsing systems, and sentiment analysis. Implementation details and useful tools are mentioned for various techniques.
This document describes a study that analyzed features for a supervised transition-based dependency parser on the Latin Dependency Treebank. It found that using part-of-speech and case features achieved the highest accuracy. The corpus and parsing approach are described, including how dependency graphs are encoded and the transition system used to parse sentences. Projective and non-projective graphs are distinguished, and roughly half the sentences in the corpus exhibited non-projective structures.
Author credits - Maaz Nomani
his paper presents our work on the annotation of intra-chunk dependencies on an English Treebank that was previously annotated with Inter-chunk dependencies. Given a natural language sentence, Intra-chunk dependencies show a relationship between words in a chunk and Inter-chunk dependencies showcase relationship among the chunks. For e.g. “James cut an apple with a knife”; In this sentence, chunking will result in (James) (cut) (an apple) (with a knife). Inter-chunk dependencies show dependency relation among these 4 chunks and Intra-chunk dependency show dependency relation between the words present in each chunk.
This exercise provides fully parsed dependency trees for the English treebank. We also report an analysis of the inter-annotator agreement for this chunk expansion task. Further, these fully expanded parallel Hindi and English treebanks were word aligned and an analysis for the task has been given. Issues related to intra-chunk expansion and alignment for the language pair Hindi-English are discussed and guidelines for these tasks have been prepared and released.
What are the basics of Analysing a corpus? chpt.10 RoutledgeRajpootBhatti5
This document provides an overview of the basics of analyzing a corpus through various techniques including frequency analysis, normalization, keyword analysis, and concordance analysis. It explains that frequency lists show how often words occur, normalization adjusts for corpus size differences, keyword analysis finds statistically significant words compared to a reference corpus, and concordance analysis displays keywords in context to better understand usage. The document serves as an introduction to basic corpus analysis methods and tools.
The document describes Agaraadhi, a novel online dictionary framework for the Tamil language. The framework indexes over 3 lakh Tamil words, providing morphological analysis, word usage statistics, translations to English, and more. It consists of online and offline components that together enable features like spelling correction, word suggestions, analyzing word usage in literature and social media, and games to support learning. The framework aims to provide more robust Tamil language reference than existing dictionaries.
This document defines and summarizes key terms in corpus linguistics. It discusses bootstrapping, the Brill tagger, competence-performance dichotomy, computational linguistics, computer assisted language learning, corpus linguistics, extensible markup language, Penn Treebank, Kolhapur Corpus, Hyderabad Corpus, Text Encoding Initiative, Unicode, Linguistic Data Consortium, and alignment.
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
The document discusses natural language processing (NLP) techniques, current trends, and applications in industry. It covers common NLP techniques like morphology, syntax, semantics, and pragmatics. It also discusses word embeddings like Word2Vec and contextual embeddings like BERT. Finally, it discusses applications of NLP in healthcare like analyzing clinical notes and brand monitoring through sentiment analysis of user reviews.
16. Anne Schumann (USAAR) Terminology and Ontologies 1RIILP
This document provides an overview of terminology and ontologies. It discusses why terminology is important, including for expert communication, knowledge transfer, and management. Terms are defined as linguistic symbols that represent concepts, with the relationship between terms and concepts being one-to-one in terminology. Conceptual relations between concepts are also discussed, including hierarchical relations like "is-a" that define a concept's location within a concept system. The document emphasizes that terminology work should be concept-oriented, structuring concepts into organized concept systems.
Diacritic Oriented Arabic Information Retrieval SystemCSCJournals
Arabic language support in search engines and operating systems is improved in recent years. Searching in the Internet is reliable and can be compared to the excellent support for several other languages, including English. However, for text with diacritics there are some limitations. For this reason, most Information retrieval (IR) systems remove diacritics from text and ignore it for its complexity. Searching text with diacritics is important for some kinds of documents, such as those of religious books, some newspapers and children stories. This research shows the design and development of the system that overcome the problem. The proposed system considers diacritics. The proposed system includes the design complexity in the retrieving algorithm rather than the information repository, which is database in this study. Also, this study analyses the results and the performance. Results are promising and performance analysis shows methods to enhance design and increase the performance. The proposed system can be integrated in search engines, text editors and any information retrieval system that include Arabic text. Performance analysis of the proposed system shows that this system is reliable. The proposed system is applied on database of Hadeeth, which is religious book includes the prophet action and statements. The system can be applied in any kind of data repository.
Presentation given at ISWC2008. It analyzes complex network characteristics of dependence between terms (i.e. classes and properties) on the Semantic Web as well as dependence between Web ontologies.
Detailed presentation on various analytical tools widely used in Corpus Linguistics for corpora analysis including WORDCRUNCHER, LEXA, CWB , TACT, MICROCONCORD etc.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
17. Anne Schuman (USAAR) Terminology and Ontologies 2RIILP
This document discusses current research topics in terminology and ontologies. It covers trends like term variation, culture-specific semantic differences, definitions, contexts, and knowledge-rich contexts. It also discusses term extraction and mapping. Key areas of research include improving techniques for specialised domains, identifying term variants, providing richer semantic descriptions, and supporting terminological workflows and users.
Reference List Citations - APA 6th EditionJanice Orcutt
This document provides information on APA citation rules, including how to format reference list citations for different source types such as periodicals, books, and journal articles. It discusses the elements included in citations, such as author, date, title, and publisher. Ordering principles for reference lists are also covered, such as alphabetical ordering and distinguishing works by the same author. Examples are provided to illustrate different citation formats.
This document describes a project to mine named entities from Wikipedia. It discusses using Wikipedia's internal links, redirect links, external links, and categories to identify named entities and their synonyms with high accuracy. It presents an algorithm for generic named entity recognition that classifies Wikipedia entries based on capitalization, title formatting, and other features. The project aims to build a search system that matches queries to candidates using vector space modeling and considers contextual windows around search terms.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
The document describes PRECIS (PREserved Context Indexing System), an indexing system developed in the 1970s. It aims to represent meaning in index entries without disturbing user understanding. PRECIS uses role operators and strings of terms to preserve context across permuted index entries. It was used for indexing the British National Bibliography but was replaced by COMPASS in 1990. PRECIS requires analyzing documents, organizing concepts, and assigning role codes to terms to generate automated two-line index entries preserving semantics and syntax.
POPSI (Postulate based permuted subject indexing) is a pre-coordinate indexing system developed by G. Bhattacharyya that uses an analytic-synthetic method and permutation of terms to approach documents from different perspectives. It is based on Ranganathan's postulates and classification principles. POPSI helps formulate subject headings, derive index entries, determine subject queries, and formulate search strategies. The main POPSI table contains notation used in the indexing process. Key steps include analysis, formalization, modulation, standardization, and generating organized and associative classification entries and references.
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
This document summarizes a research paper that proposes a method for classifying news and research articles using text pattern mining. The method involves preprocessing text to remove stop words and perform stemming. Frequent and closed patterns are then discovered from the preprocessed text. These patterns are structured into a taxonomy and deployed to classify new documents. The method also involves evolving patterns by reshuffling term supports within patterns to reduce the effects of noise from negative documents. Over 80% of documents were successfully classified using this pattern-based approach.
1) This document discusses stemming algorithms that have been used for the Odia language. Stemming is the process of reducing inflected words to their root or stem for purposes like information retrieval.
2) It reviews different stemming algorithms that have been applied to Odia text, including suffix stripping, affix removal, and stochastic algorithms. It also discusses common errors in stemming like over-stemming and under-stemming.
3) Applications of stemming discussed include information retrieval, text summarization, machine translation, indexing, and question answering systems. The document concludes by surveying prior work on stemming algorithms for Odia.
Literature Based Framework for Semantic Descriptions of e-Science resourcesHammad Afzal
Hammad Afzal gave a seminar at the National University of Sciences and Technology in Islamabad about developing a literature-based framework for semantic descriptions of e-Science resources. He discussed using natural language processing techniques to automatically generate semantic profiles of bioinformatics resources by extracting information from relevant scientific literature. His approach involved building a controlled vocabulary from literature and then mining literature to find semantic descriptions of resources.
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECHIJCI JOURNAL
This document presents a study that compiled an abbreviation dictionary to normalize abbreviations used in hate speech tweets in English. The study:
1) Used a Python library and keywords from an annotated hate speech dataset to collect over 24,000 tweets containing abbreviations.
2) Manually reviewed the abbreviations from the tweets according to developed rules to determine their complete forms.
3) Compiled a dictionary of 300 abbreviations and their full forms to help normalize abbreviations in Twitter hate speech detection.
The study compared its methodology and results to previous work on abbreviation dictionaries and found it extracted more tweets and abbreviations than similar prior studies. The dictionary will be used to normalize hate speech data in future research.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
This document provides an overview of information retrieval models, including vector space models, TF-IDF, Doc2Vec, and latent semantic analysis. It begins with basic concepts in information retrieval like document indexing and relevance scoring. Then it discusses vector space models and how documents and queries are represented as vectors. TF-IDF weighting is explained as assigning higher weight to rare terms. Doc2Vec is introduced as an extension of word2vec to learn document embeddings. Latent semantic analysis uses singular value decomposition to project documents to a latent semantic space. Implementation details and examples are provided for several models.
The document provides a survey of word sense disambiguation (WSD) research. It discusses the history and applications of WSD, and categorizes the main WSD approaches as knowledge-based, supervised, and unsupervised. For each category, it outlines several common algorithms used, such as Lesk algorithm, decision trees, Naive Bayes, and support vector machines. The document surveys the state-of-the-art in WSD performance and compares different algorithm types. It also provides an overview of WSD research in Indian languages.
Lexical Analysis to Effectively Detect User's Opinion dannyijwest
In this paper we present a lexical approach that will identify opinion of web users popularly expressed
using short words or sms words. These words are pretty popular with diverse web users and are used for
expressing their opinion on the web. The study of opinion from web arises to know the diverse opinion of
web users. The opinion expressed by web users may be on diverse topics such as politics, sports, products,
movies etc. These opinions will be very useful to others such as, leaders of political parties, selection
committees of various sports, business analysts and other stake holders of products, directors and
producers of movies as well as to the other concerned web users. We use semantic based approach to find
users opinion from short words or sms words apart of regular opinionated phrases. Our approach
efficiently detects opinion from opinionated texts using lexical analysis and is found to be better than the
other approaches on different data sets.
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCSCJournals
This study is attention to the car-following model, an important part in the micro traffic flow. Different from Nagel–Schreckenberg’s studies in which car-following model without agent drivers and diligent ones, agent drivers and diligent ones are proposed in the car-following part in this work and lane-changing is also presented in the model. The impact of agent drivers and diligent ones under certain circumstances such as in the case of evacuation is considered. Based on simulation results, the relations between evacuation time and diligent drivers are obtained by using different amounts of agent drivers; comparison between previous (Nagel–Schreckenberg) and proposed model is also found in order to find the evacuation time. Besides, the effectiveness of reduction the evacuation time is presented for various agent drivers and diligent ones.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
IRJET - Automatic Text Summarization of News ArticlesIRJET Journal
The document discusses automatic text summarization of news articles. It proposes a method that focuses on identifying the most important parts of the text and generating coherent summaries using lexical chains, without full linguistic analysis. Key steps include constructing lexical chains based on semantically related nouns, resolving pronouns to their referent nouns, scoring sentences based on lexical chain coverage, and selecting highly scored sentences to include in the summary. The method aims to provide an optimized and efficient algorithm for generating text outlines using lexical chains.
This document evaluates the reliability of using Wikipedia as a source for cross-lingual query translation between English and Portuguese in the medical domain. Experiments were conducted translating single-word, two-word, and three-word queries between the two languages using Wikipedia links. Results showed Wikipedia coverage of single-word queries was around 81% for English and 80% for Portuguese. For query translation, coverage was 60% from English to Portuguese and 88% from Portuguese to English for single-word queries. Coverage decreased as query length increased. The study demonstrates Wikipedia has potential for cross-lingual information retrieval but coverage varies with query complexity.
An Engineering-to-Biology Thesaurus for Engineering Design.pdfNaomi Hansen
This document presents an engineering-to-biology thesaurus that aims to help engineers leverage biological information during the design process by providing synonymous biological terms mapped to engineering function and flow terminology. The thesaurus integrates terms from research at Oregon State University, the Indian Institute of Science, and the University of Toronto. Biological terms in the thesaurus correspond to terms in the Functional Basis lexicon, an established set of engineering function and flow terms. The thesaurus is intended to ease the use of biological knowledge for engineers without extensive biological backgrounds. An example application of comprehension and functional modeling using the thesaurus is also presented.
Similar to Survey On Building A Database Driven Reverse Dictionary (20)
A NEW DATA ENCODER AND DECODER SCHEME FOR NETWORK ON CHIPEditor IJMTER
System-on-chip (soc) based system has so many disadvantages in power-dissipation as
well as clock rate while the data transfer from one system to another system in on-chip. At the same
time, a higher operated system does not support the lower operated bus network for data transfer.
However an alternative scheme is proposed for high speed data transfer. But this scheme is limited to
SOCs. Unlike soc, network-on-chip (NOC) has so many advantages for data transfer. It has a special
feature to transfer the data in on-chip named as transitional encoder. Its operation is based on input
transitions. At the same time it supports systems which are higher operated frequencies. In this
project, a low-power encoding scheme is proposed. The proposed system yields lower dynamic
power dissipation due to the reduction of switching activity and coupling switching activity when
compared to existing system. Even-though many factors which is based on power dissipation, the
dynamic power dissipation is only considerable for reasonable advantage. The proposed system is
synthesized using quartus II 9.1 software. Besides, the proposed system will be extended up to
interlink PE communication with help of routers and PE’s which are performed by various
operations. To implement this system in real NOC’s contains the proposed encoders and decoders for
data transfer with regular traffic scenarios should be considered.
A RESEARCH - DEVELOP AN EFFICIENT ALGORITHM TO RECOGNIZE, SEPARATE AND COUNT ...Editor IJMTER
Coins are important part of our life. We use coins in a places like stores, banks, buses, trains
etc. So it becomes a basic need that coins can be sorted, counted automatically. For this, there is
necessary that the coins can be recognized automatically. Automated Coin Recognition System for the
Indian Coins of Rs. 1, 2, 5 and 10 with the rotation invariance. We have taken images from the both
sides of coin. So this system is capable to recognizing coins from both sides. Features are taken from the
images using techniques as a Hough Transformation, Pattern Averaging etc.
Analysis of VoIP Traffic in WiMAX EnvironmentEditor IJMTER
This document reviews several studies that analyzed the performance of VoIP traffic over WiMAX networks using different VoIP codecs and WiMAX service classes. It summarizes the findings of various papers on how QoS parameters like throughput, delay, jitter compared for codecs like G.711, G.723, G.729 when using the UGS, rtPS, nrtPS and BE service classes. Most studies found that UGS generally performed best for VoIP due to its ability to guarantee bandwidth and minimize jitter and delay, while G.711 typically provided the best voice quality. The document aims to compare the results across different service classes and codecs.
A Hybrid Cloud Approach for Secure Authorized De-DuplicationEditor IJMTER
The cloud backup is used for the personal storage of the people in terms of reducing the
mainlining process and managing the structure and storage space managing process. The challenging
process is the deduplication process in both the local and global backup de-duplications. In the prior
work they only provide the local storage de-duplication or vice versa global storage de-duplication in
terms of improving the storage capacity and the processing time. In this paper, the proposed system
is called as the ALG- Dedupe. It means the Application aware Local-Global Source De-duplication
proposed system to provide the efficient de-duplication process. It can provide the efficient deduplication process with the low system load, shortened backup window, and increased power
efficiency in the user’s personal storage. In the proposed system the large data is partitioned into
smaller part which is called as chunks of data. Here the data may contain the redundancy it will be
avoided before storing into the storage area.
Aging protocols that could incapacitate the InternetEditor IJMTER
The biggest threat to the Internet is the fact that it was never really designed. For e.g., the
BGP protocol is used by Internet routers to exchange information about changes to the Internet's
network topologies. However, it also is among the most fundamentally broken; as Internet routing
information can be poisoned with bogus routing information. Instead, it evolved in fits and start,
thanks to various protocols that have been cobbled together to fulfill the needs of the moment. Few
of protocols from them were designed with security in mind. or if they were sported no more than
was needed to keep out a nosy neighbor, not a malicious attacker. The result is a welter of aging
protocols susceptible to exploit on an Internet scale. Here are six Internet protocols that could stand
to be replaced sooner rather than later or are (mercifully) on the way out.
A Cloud Computing design with Wireless Sensor Networks For Agricultural Appli...Editor IJMTER
1. The document proposes a design for using wireless sensor networks and cloud computing together for agricultural applications. It describes how sensor nodes can collect environmental data and send it to the cloud for storage, analysis and decision making.
2. The proposed system has three main components - a sensing cluster with various sensors to collect data, a cloud service cluster to process and analyze the data, and a mechanism cluster with actuator nodes that can take actions based on the cloud's decisions.
3. Some potential applications discussed are image processing of unhealthy plants, predicting crop diseases based on sensor readings, and automatically controlling the cultivation environment through actuators. The system is aimed to help farmers optimize resources and increase productivity.
A CAR POOLING MODEL WITH CMGV AND CMGNV STOCHASTIC VEHICLE TRAVEL TIMESEditor IJMTER
Carpooling (also car-sharing, ride-sharing, lift-sharing), is the sharing of car journeys so
that more than one person travels in a car. It helps to resolve a variety of problems that continue to
plague urban areas, ranging from energy demands and traffic congestion to environmental pollution.
Most of the existing method used stochastic disturbances arising from variations in vehicle travel
times for carpooling. However it doesn’t deal with the unmet demand with uncertain demand of the
vehicle for car pooling. To deal with this the proposed system uses Chance constrained
formulation/Programming (CCP) approach of the problem with stochastic demand and travel time
parameters, under mild assumptions on the distribution of stochastic parameters; and relates it with a
robust optimization approach. Since real problem sizes can be large, it could be difficult to find
optimal solutions within a reasonable period of time. Therefore solution algorithm using tabu
heuristic solution approach is developed to solve the model. Therefore, we constructed a stochastic
carpooling model that considers the in- fluence of stochastic travel times. The model is formulated as
an integer multiple commodity network flow problem. Since real problem sizes can be large, it could
be difficult to find optimal solutions within a reasonable period of time.
Sustainable Construction With Foam Concrete As A Green Green Building MaterialEditor IJMTER
This document discusses the use of foam concrete as a sustainable building material. Foam concrete is produced using cement, fine sand, water, and aluminium powder, which reacts to produce hydrogen gas bubbles that lighten the concrete. It has benefits like lower carbon dioxide emissions in production than traditional concrete, good thermal and sound insulation, fire resistance, and cost-effectiveness. The document reports on tests showing that foam concrete made with quarry dust has higher compressive strength than that made with sand. Strength generally decreases as aluminium powder content increases. Foam concrete is proposed as a sustainable alternative building material.
USE OF ICT IN EDUCATION ONLINE COMPUTER BASED TESTEditor IJMTER
A good education system is required for overall prosperity of a nation. A tremendous
growth in the education sector had made the administration of education institutions complex. Any
researches reveal that the integration of ICT helps to reduce the complexity and enhance the overall
administration of education. This study has been undertaken to identify the various functional areas
to which ICT is deployed for information administration in education institutions and to find the
current extent of usage of ICT in all these functional areas pertaining to information administration.
The various factors that contribute to these functional areas were identified. A theoretical model was
derived and validated.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
Testing of Matrices Multiplication Methods on Different ProcessorsEditor IJMTER
There are many algorithms we found for matrices multiplication. Until now it has been
found that complexity of matrix multiplication is O(n3). Though Further research found that this
complexity can be decreased. This paper focus on the algorithm and its complexity of matrices
multiplication methods.
Malware is a worldwide pandemic. It is designed to damage computer systems without
the knowledge of the owner using the system. Software‟s from reputable vendors also contain
malicious code that affects the system or leaks information‟s to remote servers. Malware‟s includes
computer viruses, spyware, dishonest ad-ware, rootkits, Trojans, dialers etc. Malware detectors are
the primary tools in defense against malware. The quality of such a detector is determined by the
techniques it uses. It is therefore imperative that we study malware detection techniques and
understand their strengths and limitations. This survey examines different types of Malware and
malware detection methods.
SURVEY OF TRUST BASED BLUETOOTH AUTHENTICATION FOR MOBILE DEVICEEditor IJMTER
Practical requirements for securely demonstrating identities between two handheld
devices are an important concern. The adversary can inject a Man-In- The-Middle (MITM) attack to
intrude the protocol. Protocols that employ secret keys require the devices to share private
information in advance, in which it is not feasible in the above scenario. Apart from insecurely
typing passwords into handheld devices or comparing long hexadecimal keys displayed on the
devices’ screen, many other human-verifiable protocols have been proposed in the literature to solve
the problem. Unfortunately, most of these schemes are unsalable to more users. Even when there are
only three entities attempt to agree a session key, these protocols need to be rerun for three times.
So, in the existing method a bipartite and a tripartite authentication protocol is presented using a
temporary confidential channel. Besides, further extend the system into a transitive authentication
protocol that allows multiple handheld devices to establish a conference key securely and efficiently.
But this method detects only the outsider attacks. Method does not consider the insider attacks. So,
in the proposed method trust score based method is introduced which computes the trust values for
the nodes and provide the security. The trust score is computed has a positive influence on the
confidence with which an entity conducts transactions with that node. Network the behavior of the
node will be monitored periodically and its trust value is also updated .So depending on the behavior
of the node in the network trust relation will be established between two nodes.
GLAUCOMA is a chronic eye disease that can damage optic nerve. According to WHO It
is the second leading cause of blindness, and is predicted to affect around 80 million people by 2020.
Development of the disease leads to loss of vision, which occurs increasingly over a long period of
time. As the symptoms only occur when the disease is quite advanced so that glaucoma is called the
silent thief of sight. Glaucoma cannot be cured, but its development can be slowed down by
treatment. Therefore, detecting glaucoma in time is critical. However, many glaucoma patients are
unaware of the disease until it has reached its advanced stage. In this paper, some manual and
automatic methods are discussed to detect glaucoma. Manual analysis of the eye is time consuming
and the accuracy of the parameter measurements also varies with different clinicians. To overcome
these problems with manual analysis, the objective of this survey is to introduce a method to
automatically analyze the ultrasound images of the eye. Automatic analysis of this disease is much
more effective than manual analysis.
Survey: Multipath routing for Wireless Sensor NetworkEditor IJMTER
Reliability is playing very vital role in some application of Wireless Sensor Networks
and multipath routing is one of the ways to increase the probability of reliability. More over energy
consumption is constraint. In this paper, we provide a survey of the state-of-the-art of proposed
multipath routing algorithm for Wireless Sensor Networks. We study the design, analyze the tradeoff
of each design, and overview several presenting algorithms.
Step up DC-DC Impedance source network based PMDC Motor DriveEditor IJMTER
This paper is devoted to the Quasi Z source network based DC Drive. The cascaded
(two-stage) Quasi Z Source network could be derived by the adding of one diode, one inductor,
and two capacitors to the traditional quasi-Z-source inverter The proposed cascaded qZSI inherits all
the advantages of the traditional solution (voltage boost and buck functions in a single stage,
continuous input current, and improved reliability). Moreover, as compared to the conventional qZSI,
the proposed solution reduces the shoot-through duty cycle by over 30% at the same voltage boost
factor. Theoretical analysis of the two-stage qZSI in the shoot-through and non-shoot-through
operating modes is described. The proposed and traditional qZSI-networks are compared. A
prototype of a Quasi Z Source network based DC Drive was built to verify the theoretical
assumptions. The experimental results are presented and analyzed.
SPIRITUAL PERSPECTIVE OF AUROBINDO GHOSH’S PHILOSOPHY IN TODAY’S EDUCATIONEditor IJMTER
The paper reflects the spiritual philosophy of Aurobindo Ghosh which is helpful in today’s
education. In 19th century he wrote about spirituality, in accordance with that it is a core and vital part
of today’s education. It is very much essential for today’s kid. Here I propose the overview of that
philosophy.At the utmost regeneration of those values in today’s generation is the great deal with
education system. To develop the values and spiritual education in the youngers is the great moto of
mine. It is the materialistic world and without value redefinition among them is the harder task but not
difficult.
Software Quality Analysis Using Mutation Testing SchemeEditor IJMTER
The software test coverage is used measure the safety measures. The safety critical analysis is
carried out for the source code designed in Java language. Testing provides a primary means for
assuring software in safety-critical systems. To demonstrate, particularly to a certification authority, that
sufficient testing has been performed, it is necessary to achieve the test coverage levels recommended or
mandated by safety standards and industry guidelines. Mutation testing provides an alternative or
complementary method of measuring test sufficiency, but has not been widely adopted in the safetycritical industry. The system provides an empirical evaluation of the application of mutation testing to
airborne software systems which have already satisfied the coverage requirements for certification.
The system mutation testing to safety-critical software developed using high-integrity subsets of
C and Ada, identify the most effective mutant types and analyze the root causes of failures in test cases.
Mutation testing could be effective where traditional structural coverage analysis and manual peer
review have failed. They also show that several testing issues have origins beyond the test activity and
this suggests improvements to the requirements definition and coding process. The system also
examines the relationship between program characteristics and mutation survival and considers how
program size can provide a means for targeting test areas most likely to have dormant faults. Industry
feedback is also provided, particularly on how mutation testing can be integrated into a typical
verification life cycle of airborne software. The system also covers the safety and criticality levels of
Java source code.
Software Defect Prediction Using Local and Global AnalysisEditor IJMTER
The software defect factors are used to measure the quality of the software. The software
effort estimation is used to measure the effort required for the software development process. The defect
factor makes an impact on the software development effort. Software development and cost factors are
also decided with reference to the defect and effort factors. The software defects are predicted with
reference to the module information. Module link information are used in the effort estimation process.
Data mining techniques are used in the software analysis process. Clustering techniques are used
in the property grouping process. Rule mining methods are used to learn rules from clustered data
values. The “WHERE” clustering scheme and “WHICH” rule mining scheme are used in the defect
prediction and effort estimation process. The system uses the module information for the defect
prediction and effort estimation process.
The proposed system is designed to improve the defect prediction and effort estimation process.
The Single Objective Genetic Algorithm (SOGA) is used in the clustering process. The rule learning
operations are carried out sing the Apriori algorithm. The system improves the cluster accuracy levels.
The defect prediction and effort estimation accuracy is also improved by the system. The system is
developed using the Java language and Oracle relation database environment.
Software Cost Estimation Using Clustering and Ranking SchemeEditor IJMTER
Software cost estimation is an important task in the software design and development process.
Planning and budgeting tasks are carried out with reference to the software cost values. A variety of
software properties are used in the cost estimation process. Hardware, products, technology and
methodology factors are used in the cost estimation process. The software cost estimation quality is
measured with reference to the accuracy levels.
Software cost estimation is carried out using three types of techniques. They are regression based
model, anology based model and machine learning model. Each model has a set of technique for the
software cost estimation process. 11 cost estimation techniques fewer than 3 different categories are
used in the system. The Attribute Relational File Format (ARFF) is used maintain the software product
property values. The ARFF file is used as the main input for the system.
The proposed system is designed to perform the clustering and ranking of software cost
estimation methods. Non overlapped clustering technique is enhanced with optimal centroid estimation
mechanism. The system improves the clustering and ranking process accuracy. The system produces
efficient ranking results on software cost estimation methods.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Survey On Building A Database Driven Reverse Dictionary
1. Scientific Journal Impact Factor (SJIF): 1.711
International Journal of Modern Trends in Engineering
and Research
www.ijmter.com
@IJMTER-2014, All rights Reserved 154
e-ISSN: 2349-9745
p-ISSN: 2393-8161
Survey On Building A Database Driven Reverse Dictionary
Akanksha Tiwari1
, Prof. Rekha P. Jadhav2
1,2
G.H.Raisoni institute of engineering and technology(GHRIET),Pune
Abstract: Reverse dictionaries are widely used for a reference work that is organized by concepts,
phrases, or the definitions of words. This paper describe the many challenges inherent in building a
reverse lexicon, and map drawback to the well known abstract similarity problem The criterion web
search engines are basic versions of system; they take benefit of huge scale which permits inferring
general interest concerning documents from link information. This paper describe the basic study of
database driven reverse dictionary using three large-scale dataset namely person names, general English
words and biomedical concepts. This paper analyzes difficulties arising in the use of documents
produced by Reverse dictionary.
Keywords: Reverse Dictionaries(RD), Phrase, Lexicon, Database and WordNet.
I. INTRODUTION
Since last decade, people have used dictionaries for two well-defined purpose. First is to find the
meaning of specific word with their equivalent in another language. Second is to find the words listed
alphabetically in specific language which contain their usage information , definition, phonetics,
pronunciations and other linguistic features. When these ideas comes together we, understand , why this
resource has not lost important and continue to be widely used around the world.
The change in technology evolution from last years , dictionaries are now available in electronic
format An online dictionary [1] is a dictionary that is accessible via the Internet through a web browser.
Basically two types of online dictionary .
1. Forward dictionary : Dictionary is one which maps from word to their definition. Example : 'chef' :
is a person who is a highly skilled professional cook who is proficient in all aspects of food
preparation.
2. Reverse dictionary : we already had the meaning or the idea but aren’t too sure of the appropriate
word, then reverse dictionary is the right one for use. Example : person who is a highly skilled
professional cook who is proficient in all aspects of food preparation :- ' chef'.
In order to build a reverse dictionary, first a forward dictionary is needed. WordNet is the forward
dictionary used in this work. WordNet [2] is a lexical database which is available online and provides a
large repository of English lexical items. WordNet .Such dictionaries have become more approach
discussed in [5] deal with the pre-creation of a context vector for each word in WordNet during the
learning phase.
Three large datasets are used to build reverse dictionary such as, dataset s with person names,
biomedical concept names, and general English words.Again classifying these dataset into the five
databases simultaneously. Five database such as synonym db which give the relevant meaning for that
important word, RMS db creates the parse tree for that dictionary definition , hyponym db a word that is
2. International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 01, Issue 06, [December - 2014] e-ISSN: 2349-9745, p-ISSN: 2393-8161
@IJMTER-2014, All rights Reserved 155
more specific or generic than a given input word , antonym db which gives opposite answer to given
word and definitions db is describing the word briefly[8].
This paper contributes as II section Literature survey, consist of survey on the Reverse dictionary
approach. and later discuss the problems and constraints with existing system. In section III
representing future enhancement. In section IV consist of conclusion of the this survey.
II. LITERATURE SURVEY
In recent years many research has done over database driven reverse dictionary. the idea of arranging the
vocabulary of language in reverse order is not new. Since 19th century the reverse dictionary were
published, as simple collection of words. In 1915 in Russian language,the first reverse dictionary of
modern language , compiled for the purpose of decoding the military news. In fifties and later many
reverse dictionaries were published on many different languages. In traditional model for using
dictionary, forward concept is implemented where it result in set of definition and it may produce a
comprehensive phases. To facilitate forward concept, user provide reverse dictionary in which for any
phases or word, the appropriate single word meaning is given. System will provide the relevant meaning
even if that word is not available in the database. Virtually all attempts to study the similarity of
concepts model concepts as single words[4]. Work in text classification for instance, surveyed in detail
in [3], attempts to cluster documents as similar to one another if they contain co-occurring words (not
phrases or sentences). Current word sense disambiguation approaches appears, still consider a single
word at a time[5]-[7].Also there exists some work on multiword addresses ,the problem of finding the
similarity of multiword phrases across a set of documents in Wikipedia[19].
Several studies addressed different paradigms for approximate dictionary matching. Bocek etal. (2007)
presented the Fast Similarity Search (FastSS), an enhancement of the neighborhood generation
algorithms, in which multiple variants of each string record are stored in a database[10].
Wang et al. (2009) further improved the technique of neighborhood generation by introducing
partitioning and prefix pruning. Huynh et al. (2006) developed a solution to the k-mismatch problem in
compressed suffix arrays. Liu et al. (2008) stored string records in a trie, and proposed a framework
called TITAN[11].
These studies are specialized. Several researchers have presented refined similarity measures for strings
(Winkler, 1999; Cohen et al., 2003; Bergsma and Kondrak, 2007; Davis et al., 2007). Although these
studies are sometimes regarded as a research topic of approximate dictionary matching, they assume that
two strings for the target of similarity computation are given; in other words, it is out of their scope to
find strings in a large collection that are similar to a given string. Thus, it is a reasonable approach for an
approximate dictionary matching to quickly collect candidate strings with a loose similarity threshold,
and for a refined similarity measure to scrutinize each candidate string for the target application.
A. Dataset
Reverse Dictionary Application is a software element that captures a user phrase as input and returns
theoretically connected words as output. It requires large amount of dataset to get accurate meaning of
the word. There exists a three large datasets and simultaneously database for synonyms, hyponyms and
antonyms.
1. Person name: This dataset comprises actor names extracted from the IMDB database6. We used all
actor names (1,098,022 strings; 18 MB) from the file actors.list.gz.The average number of letter trigrams
3. International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 01, Issue 06, [December - 2014] e-ISSN: 2349-9745, p-ISSN: 2393-8161
@IJMTER-2014, All rights Reserved 156
in the strings is 17.2. The total number of trigrams is 42,180. The system generated index files of 83 MB
in 56.6 s.
Table 1 : Literature survey on Reverse Dictionary
Author/Year Method Dataset Remark
Yuhua Li, David
McLean, Zuhair A.
Bandar 2006
(IEEE)
[13]
Sentence-similarity Lexical dataset varied sentence pair
data set with human
ratings and an
improvement to the
algorithm to
disambiguate word
sense using the
surrounding words to
give a little contextual
information
Naoaki Okazaki and
Jun’ichi Tsujii
2010
[8]
CP Merge algorithm
and n-grams feature
approach.
Personal names,
general English words
and biomedical
names.
solved ῑ overlap joins
by checking
approximately
half of the inverted
lists with cosine
similarity and
threshold α= 0.7).
Anindya Datta and
Kaushik Dutta March
2013
(IEEE)
[1]
concept similarity
problem (CSP)
Dictionary contain
synonyms, antonym
and hyponyms .
propose a set of
methods for building
and querying a
reverse dictionary,
and describe a set of
experiments that
show the quality of
results.
Oscar Méndez,
Marco A. Moreno-
Armendáriz 2013
(IEEE)
[14]
Semantic approach WordNet, semantic
dataset
applying algebraic
analysis on dataset
then filtering process
and a ranking phase.
Finally, a predefined
number of output
target words are
displayed
2. GoogleWeb1T unigrams: This dataset consists of English word unigrams included in the Google
Web1T corpus (LDC2006T13). We used all word unigrams (13,588,391 strings; 121 MB) in the corpus
4. International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 01, Issue 06, [December - 2014] e-ISSN: 2349-9745, p-ISSN: 2393-8161
@IJMTER-2014, All rights Reserved 157
after removing the frequency information. The average number of letter trigrams in the strings is 10.3.
The total number of trigrams is 301,459. The system generated index files of 601 MB in 551.7s[15].
3. UMLS: This dataset consists of English names and descriptions of biomedical concepts included in
the Unified Medical Language System (UMLS). We extracted all
English concept names (5,216,323 strings; 212 MB) from MRCONSO.RRF.aa.gz and
MRCONSO.RRF.ab.gz in UMLS Release 2009AA. The average number of letter trigrams in the strings
is 43.6. The total number of trigrams is 171,596[14].
4. Synonym set: Set of similar meaning words example talk: {speak, utter, mouth}.
5. Antonym set: A set of conceptually opposite or negated terms for t. For example, pleasant might
consist of {“unpleasant,” “unhappy”}.
6. Hypernym set: A set of conceptually more general terms describing. For example (red) might consist
of {“color”}.
7. Hyponym Set: A set of conceptually more specific terms describing. For example (red) might consist
of {“maroon”, “crimson”}.
B. Building the Reverse Mapping Set
The existing dictionary receives an input phrase and outputs many output words, therefore it can be
tedious for the user to search one from it. The basic architecture of database driven reverse dictionary is
as shown in figure 1.
Building an RMS means to find a set of words in whose definitions any word ‘w’ is found. Example:
The word “sleep” is found in 4 definitions belonging to 4 words. Therefore R(clever) will be
"intelligent", "bright", "smart", "brilliant". These words must be manually entered for each word. The
RMS of the words can be found from the wordnet [2][6] dictionary. The stop words like "am", "are",
"however" ,"where" etc. needs to be negated as they don’t form a very important part of the process.
Whereas, Antonyms are needed to be addressed.
Example: When the word 'clever' is followed by “not”, the antonym of “clever”, which is “stupid”
should be considered for the search process.
C. Querying the Reverse mapping
It describes the use of R indexes, to respond to user input phrases. When a user input phrase U is
received, first extract the core terms from U. The next step is to apply stemming. Stemming is done in
order to convert a term to its base form. For example, if the input phrase given is “hopping animal” and
when stemming is applied, ‘hopping’ will get converted to its base form ‘hop’. Stemming is done
through a standard stemming algorithm.
5. International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 01, Issue 06, [December - 2014] e-ISSN: 2349-9745, p-ISSN: 2393-8161
@IJMTER-2014, All rights Reserved 158
Figure1: Database Driven Reverse Dictionary
After stemming, consult the appropriate R indexes ( ie. RMS ) of these terms extracted from the user
input phrase to find out the candidate words. Given an input phrase " a small town " extract the core
terms : “small” and “town” ( the term “a” is a stop word and hence it is ignored ). Then consult the
appropriate R indexes, R ( small ) and R ( town ) and will return words in whose definition “small” and
“town” occurs simultaneously. Each word becomes a candidate word. A tunable input parameter α is
defined which represents the minimum number of candidate words needed to stop processing and return
output. If the first step discussed above does not generate a sufficient number of candidate words ( W )
according to α, then expand the query Q to include synonyms, hypernyms and hyponyms of the terms in
Q. When threshold number of candidate words have been found out, then sort the results based on
similarity to U and return top β Ws where β is an input parameter representing the maximum number of
words to be returned as output.
C. Ranking candidate words
Here semantic similarity of definition of candidate words “S” found is compared with the user input
phrase U. On the basis of that, sorts a set of output words in the order of decreasing similarity to U as
compared to S. There is a need to assign a similarity measure for each ( S,U ) pair, where U is the user
input phrase and S is the definition of candidate words found out[8].
Google
Web1T
unigram
UMLSIMDB
actors
dataset
Input Phrase
Build Reverse Mapping Set
Querying the Reverse
Mapping Sets
Ranking Candidate Words
Output Phrase
HyponymsSynonym Antonyms
6. International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 01, Issue 06, [December - 2014] e-ISSN: 2349-9745, p-ISSN: 2393-8161
@IJMTER-2014, All rights Reserved 159
Here it is necessary to compute both term similarity and term importance. Compute term similarity
between two terms based on their location in the WordNet hierarchy. The WordNet hierarchy organizes
words in English language from general at root to most specific at leaf nodes. Consider the LCA (Least
Common Ancestor) of the two terms and if the LCA of two terms in this hierarchy is root, then those
two terms will have little similarity. If the LCA is more deeper, two terms will have greater similarity.
Define a similarity function to compute the similarity between two terms ‘a’ and ‘b’
(1)
Where “b” is the term in the user input phrase U and “a” is the term in the sense phrase S.A(a,b) return
the LCA shared by both a and b in the WordNet hierarchy. E[A(a,b)] is the depth of LCA. E(a) and E(b)
return the depth of terms “a” and “b” respectively. Value of ρ(a,b) will be larger for more similar terms.
It is essential to consider the importance of each term in the phrase. For Example, Consider two phrases
“ the fox who bit the man” and “the man who bit the fox”. These two phrases contain similar words but
convey different meanings. So it is important to consider the sequence of words in a phrase. To generate
the importance of each term, a parser can be used. OpenNLP[12] parser is used in this work. The parser
returns the grammatical structure of a sentence. A parser return a parse tree for a given input phrase. In a
parse tree, the terms in the phrase that add most to its meaning appears higher than those words that add
less to its meaning[18].
E. Problem and Constraints:
• Problem of approximate dictionary matching.
• It is necessary to compute both term similarity and term importance.
• It does not scale well—for a dictionary containing more than 100,000 defined words, where each
word may have multiple definitions; it would require potentially hundreds of thousands of
queries to return a result.
• To demonstrate the efficiency of the algorithm on three large-scale datasets with person names,
biomedical concept names, and general English words.
• Provide significant improvements in performance scale without sacrificing
• Solution quality but for larger query, it is slow.
III. FUTURE ENHANCEMENT
It is natural to extend this study to compressing and decompressing inverted lists for reducing disk space
and for improving query performance .use of K-means clustering algorithm for searching the queries
Even though, a meaningful information regarding the implementation of the existing reverse dictionaries
cannot be provided, with the help of some corrections, an effective reverse dictionary, which gets a user
input phrase and outputs a set of words, according to the priority and also in the ascending order of the
words from the most conceptually similar to the least can be obtained. Also we can introducing the wild
card characters in user query .Try to make emoticon based dictionary using semantic orientation.
IV. CONCLUSION
Reverse Dictionary Application is a software element that captures a user phrase as input, and returns
theoretically connected words as output. The database driven approach can provide significant
improvements in performance scale without sacrificing the quality of the result. In this survey paper, we
7. International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 01, Issue 06, [December - 2014] e-ISSN: 2349-9745, p-ISSN: 2393-8161
@IJMTER-2014, All rights Reserved 160
study different approaches that need to construct the database driven reverse dictionary. we describe the
significant challenges inherent in building a reverse dictionary, and map the problem to the well-known
conceptual similarity problem , the methods for building and querying a reverse dictionary.
V.ACKNOWLEDGMENT
The authors would like to thank for his valuable help throughout their research. They also thank the
referees for their suggestions in improving this paper.
REFERENCES
[1] Anindya Datta, Ryan Shaw, Debra VanderMeer and Kaushik Dutta (2013) „Building a Scalable
Database-Driven Reverse Dictionary‟-VOL. 25, NO. 3, pp.528-540
[2] D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent Dirichlet Allocation,” J. Machine Learning Research,
vol. 3, pp. 993-1022, Mar. 2003.
[3]J. Carlberger, H. Dalianis, M. Hassel, and O. Knutsson, “Improving Precision in Information
Retrieval for Swedish Using Stemming,” Technical Report IPLab-194, TRITA-NA-P0116, Interaction
and Presentation Laboratory, Royal Inst. of Technology and Stockholm Univ., Aug. 2001.
[4] H. Cui, R. Sun, K. Li, M.-Y. Kan, and T.-S. Chua, “Question Answering Passage Retrieval Using
Dependency Relations,” Proc. 28th Ann. Int‟l ACM SIGIR Conf. Research and Development in
Information Retrieval, pp. 400-407, 2005.
[5] T. Dao and T. Simpson, “Measuring Similarity between Sentences,”
http://opensvn.csie.org/WordNetDotNet/trunk/Projects/Thanh/Paper/WordNetDotNet_Semantic_Si
milarity.pdf (last accessed 16 Oct. 2009), 2009.
[6] X. Liu and W. Croft, “Passage Retrieval Based on Language Models,” Proc. 11th Int‟l Conf.
Information and Knowledge Management, pp. 375-382, 2002.
[7]F. Sebastiani, “Machine Learning in Automated Text Categorization,”(2002) ACM Computing
Surveys, vol. 34, no. 1, pp. 1-47.
[8]Naoaki Okazaki and Jun’ichi Tsujii,”Simple and Efficient Algorithm for Approximate Dictionary
Matching”,Coling 2010,ICLL,pp 851-859.
[9]Dietterich, ”Machine Learning Research” , vol. 3, pp. 993-1022, Mar.2003.
[10]Wang, Wei, Chuan Xiao, Xuemin Lin, and Chengqi Zhang. 2009. Efficient approximate entity
extraction with edit distance constraints. In SIGMOD ’09: Proceedings of the 35th SIGMOD
International Conference on Management of Data, pages 759–770.
[11] Winkler, William E. 1999. The state of record linkage and current research problems. Technical
Report R99/04, Statistics of Income Division, Internal Revenue Service Publication.
[12]Li, Chen, Bin Wang, and Xiaochun Yang. 2007.Vgram: improving performance of approximate
queries on string collections using variable-length grams. In VLDB ’07: Proceedings of the 33rd
International Conference on Very Large Data Bases,pages 303–314.
[13] Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crockett Sentence
Similarity Based on Semantic Nets and Corpus Statistics IEEE transactions on knowledge and data
engineering, vol. 18, no. 8, August 2006.
[14] Oscar Méndez, Hiram Calvo, Marco A. Moreno-Armendáriz A Reverse Dictionary Based on
Semantic Analysis Using WordNet Advances in Artificial Intelligence and Its Applications Lecture
Notes in Computer Science Volume 8265, 2013, pp 275-285 Springer 2013.
[16]M.Porter,“The Porter Stemming Algorithm,”http://tartarus.org/martin/PorterStemmer/, 2009.
SITE References
[17] http://dictionary.reference.com/reverse
[18] http://www.onelook.com/