This paper focus on keyphrase extraction for news articles because news article is one of the popular document genres on the web and most news articles have no author-assigned keyphrases. Existing methods for single document keyphrase extraction usually make use of only the information contained in the specified document. This paper proposes to use a small number of nearest neighbor documents to provide more knowledge to improve single document keyphrase extraction. Experimental results demonstrate the good effectiveness and robustness of our proposed approach. According to experiments conducted on several documents and keyphrases we consider top 10 keyphrases are most suitable keyphrases.
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...TELKOMNIKA JOURNAL
Most scientific publishers encourage authors to provide keyphrases on their published article.
Hence, the need to automatize keyphrase extraction is increased. However, it is not a trivial task
considering keyphrase characteristics may overlap with the non-keyphrase’s. To date, the accuracy of
automatic keyphrase extraction approaches is still considerably low. In response to such gap, this paper
proposes two contributions. First, a feature called fact-based sentiment is proposed. It is expected to
strengthen keyphrase characteristics since, according to manual observation, most keyphrases are
mentioned in neutral-to-positive sentiment. Second, a combination of supervised and unsupervised
approach is proposed to take the benefits of both approaches. It will enable automatic hidden pattern
detection while keeping candidate importance comparable to each other. According to evaluation, factbased
sentiment is quite effective for representing keyphraseness and semi-supervised approach is
considerably effective to extract keyphrases from scientific articles.
The document discusses predictive text analytics, including predicting text completions, disambiguating text, and correcting errors. It also discusses extracting entities, concepts, facts, and sentiments from unstructured text sources for applications like search, knowledge discovery, and predictive analytics. Key challenges include the complexity of human language with features like ambiguity and context.
The growing number of datasets published on the Web as linked data brings both opportunities for high data
availability of data. As the data increases challenges for querying also increases. It is very difficult to search
linked data using structured languages. Hence, we use Keyword Query searching for linked data. In this paper,
we propose different approaches for keyword query routing through which the efficiency of keyword search can
be improved greatly. By routing the keywords to the relevant data sources the processing cost of keyword search
queries can be greatly reduced. In this paper, we contrast and compare four models – Keyword level, Element
level, Set level and query expansion using semantic and linguistic analysis. These models are used for keyword
query routing in keyword search.
International Journal of Computational Engineering Research(IJCER) ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
This document provides an overview of information retrieval models, including vector space models, TF-IDF, Doc2Vec, and latent semantic analysis. It begins with basic concepts in information retrieval like document indexing and relevance scoring. Then it discusses vector space models and how documents and queries are represented as vectors. TF-IDF weighting is explained as assigning higher weight to rare terms. Doc2Vec is introduced as an extension of word2vec to learn document embeddings. Latent semantic analysis uses singular value decomposition to project documents to a latent semantic space. Implementation details and examples are provided for several models.
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...INFOGAIN PUBLICATION
The Improved Web Explorer aims at extraction and selection of the best possible hyperlinks and retrieving more accurate search results for the entered search query. The hyperlinks that are more preferable to the entered search query are evaluated by taking into account weighted values of frequencies of words in search string that are present in anchor texts and plain texts available in title and body tags of various hyperlink pages respectively to retrieve relevant hyperlinks from all available links. Then the concept of ontology is used to gain insights of words in search string by finding their hypernyms, hyponyms and synsets to reach to the source and context of the words in search string. The Explicit Semantic Similarity analysis along with Naïve Bayes method is used to find the semantic similarity between lexically different terms using Wikipedia and Google as explicit semantic analysis tools and calculating the probabilities of occurrence of words in anchor and body texts .Vector Space Model is being used to calculate Term frequency and Inverse document frequency values, and then calculate cosine similarities between the entered Search query and extracted relevant hyperlinks to get the most appropriate relevance wise ranked search results to the entered search string
Semantic tagging for documents using 'short text' informationcsandit
Tagging documents with relevant and comprehensive k
eywords offer invaluable assistance to
the readers to quickly overview any document. With
the ever increasing volume and variety of
the documents published on the internet, the intere
st in developing newer and successful
techniques for annotating (tagging) documents is al
so increasing. However, an interesting
challenge in document tagging occurs when the full
content of the document is not readily
accessible. In such a scenario, techniques which us
e “short text”, e.g., a document title, a news
article headline, to annotate the entire article ar
e particularly useful. In this paper, we pro-
pose a novel approach to automatically tag document
s with relevant tags or key-phrases using
only “short text” information from the documents. W
e employ crowd-sourced knowledge from
Wikipedia, Dbpedia, Freebase, Yago and similar open
source knowledge bases to generate
semantically relevant tags for the document. Using
the intelligence from the open web, we prune
out tags that create ambiguity in or “topic drift”
from the main topic of our query document.
We have used real world dataset from a corpus of re
search articles to annotate 50 research
articles. As a baseline, we used the full text info
rmation from the document to generate tags. The
proposed and the baseline approach were compared us
ing the author assigned keywords for the
documents as the ground truth information. We found
that the tags generated using proposed
approach are better than using the baseline in term
s of overlap with the ground truth tags
measured via Jaccard index (0.058 vs. 0.044). In te
rms of computational efficiency, the
proposed approach is at least 3 times faster than t
he baseline approach. Finally, we
qualitatively analyse the quality of the predicted
tags for a few samples in the test corpus. The
evaluation shows the effectiveness of the proposed
approach both in terms of quality of tags
generated and the computational time.
Meta documents and query extension to enhance information retrieval processeSAT Journals
This document discusses enhancing information retrieval through the use of meta-documents and query extension. Meta-documents are used to annotate and represent web documents, assigning terms different levels of importance. Query extension involves adding semantically related terms from an ontology to user queries to expand their scope without changing the meaning. The cooperation of meta-documents and query extension aims to improve information retrieval by better matching queries to document representations and retrieving more relevant results. The approach is evaluated using standard information retrieval measures and shown to perform better than without query extension.
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...TELKOMNIKA JOURNAL
Most scientific publishers encourage authors to provide keyphrases on their published article.
Hence, the need to automatize keyphrase extraction is increased. However, it is not a trivial task
considering keyphrase characteristics may overlap with the non-keyphrase’s. To date, the accuracy of
automatic keyphrase extraction approaches is still considerably low. In response to such gap, this paper
proposes two contributions. First, a feature called fact-based sentiment is proposed. It is expected to
strengthen keyphrase characteristics since, according to manual observation, most keyphrases are
mentioned in neutral-to-positive sentiment. Second, a combination of supervised and unsupervised
approach is proposed to take the benefits of both approaches. It will enable automatic hidden pattern
detection while keeping candidate importance comparable to each other. According to evaluation, factbased
sentiment is quite effective for representing keyphraseness and semi-supervised approach is
considerably effective to extract keyphrases from scientific articles.
The document discusses predictive text analytics, including predicting text completions, disambiguating text, and correcting errors. It also discusses extracting entities, concepts, facts, and sentiments from unstructured text sources for applications like search, knowledge discovery, and predictive analytics. Key challenges include the complexity of human language with features like ambiguity and context.
The growing number of datasets published on the Web as linked data brings both opportunities for high data
availability of data. As the data increases challenges for querying also increases. It is very difficult to search
linked data using structured languages. Hence, we use Keyword Query searching for linked data. In this paper,
we propose different approaches for keyword query routing through which the efficiency of keyword search can
be improved greatly. By routing the keywords to the relevant data sources the processing cost of keyword search
queries can be greatly reduced. In this paper, we contrast and compare four models – Keyword level, Element
level, Set level and query expansion using semantic and linguistic analysis. These models are used for keyword
query routing in keyword search.
International Journal of Computational Engineering Research(IJCER) ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
This document provides an overview of information retrieval models, including vector space models, TF-IDF, Doc2Vec, and latent semantic analysis. It begins with basic concepts in information retrieval like document indexing and relevance scoring. Then it discusses vector space models and how documents and queries are represented as vectors. TF-IDF weighting is explained as assigning higher weight to rare terms. Doc2Vec is introduced as an extension of word2vec to learn document embeddings. Latent semantic analysis uses singular value decomposition to project documents to a latent semantic space. Implementation details and examples are provided for several models.
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...INFOGAIN PUBLICATION
The Improved Web Explorer aims at extraction and selection of the best possible hyperlinks and retrieving more accurate search results for the entered search query. The hyperlinks that are more preferable to the entered search query are evaluated by taking into account weighted values of frequencies of words in search string that are present in anchor texts and plain texts available in title and body tags of various hyperlink pages respectively to retrieve relevant hyperlinks from all available links. Then the concept of ontology is used to gain insights of words in search string by finding their hypernyms, hyponyms and synsets to reach to the source and context of the words in search string. The Explicit Semantic Similarity analysis along with Naïve Bayes method is used to find the semantic similarity between lexically different terms using Wikipedia and Google as explicit semantic analysis tools and calculating the probabilities of occurrence of words in anchor and body texts .Vector Space Model is being used to calculate Term frequency and Inverse document frequency values, and then calculate cosine similarities between the entered Search query and extracted relevant hyperlinks to get the most appropriate relevance wise ranked search results to the entered search string
Semantic tagging for documents using 'short text' informationcsandit
Tagging documents with relevant and comprehensive k
eywords offer invaluable assistance to
the readers to quickly overview any document. With
the ever increasing volume and variety of
the documents published on the internet, the intere
st in developing newer and successful
techniques for annotating (tagging) documents is al
so increasing. However, an interesting
challenge in document tagging occurs when the full
content of the document is not readily
accessible. In such a scenario, techniques which us
e “short text”, e.g., a document title, a news
article headline, to annotate the entire article ar
e particularly useful. In this paper, we pro-
pose a novel approach to automatically tag document
s with relevant tags or key-phrases using
only “short text” information from the documents. W
e employ crowd-sourced knowledge from
Wikipedia, Dbpedia, Freebase, Yago and similar open
source knowledge bases to generate
semantically relevant tags for the document. Using
the intelligence from the open web, we prune
out tags that create ambiguity in or “topic drift”
from the main topic of our query document.
We have used real world dataset from a corpus of re
search articles to annotate 50 research
articles. As a baseline, we used the full text info
rmation from the document to generate tags. The
proposed and the baseline approach were compared us
ing the author assigned keywords for the
documents as the ground truth information. We found
that the tags generated using proposed
approach are better than using the baseline in term
s of overlap with the ground truth tags
measured via Jaccard index (0.058 vs. 0.044). In te
rms of computational efficiency, the
proposed approach is at least 3 times faster than t
he baseline approach. Finally, we
qualitatively analyse the quality of the predicted
tags for a few samples in the test corpus. The
evaluation shows the effectiveness of the proposed
approach both in terms of quality of tags
generated and the computational time.
Meta documents and query extension to enhance information retrieval processeSAT Journals
This document discusses enhancing information retrieval through the use of meta-documents and query extension. Meta-documents are used to annotate and represent web documents, assigning terms different levels of importance. Query extension involves adding semantically related terms from an ontology to user queries to expand their scope without changing the meaning. The cooperation of meta-documents and query extension aims to improve information retrieval by better matching queries to document representations and retrieving more relevant results. The approach is evaluated using standard information retrieval measures and shown to perform better than without query extension.
This document provides an overview of text mining and web mining. It defines data mining and describes the common data mining tasks of classification, clustering, association rule mining and sequential pattern mining. It then discusses text mining, defining it as the process of analyzing unstructured text data to extract meaningful information and structure. The document outlines the seven practice areas of text mining as search/information retrieval, document clustering, document classification, web mining, information extraction, natural language processing, and concept extraction. It provides brief descriptions of the problems addressed within each practice area.
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
This document describes a study that uses natural language processing and text mining techniques to identify future work statements in scientific papers and extract keywords from those statements. The researchers developed a multi-step pipeline to first identify the future work section, then select future work sentences within that section. They used rules and algorithms to identify sentences discussing future work. Keywords were then extracted from the selected sentences using the RAKE algorithm. An analysis found that 31.4% of papers contained future work statements, with medical science papers having the highest overlap between future work and title-abstract keywords. The researchers hope this work is a first step toward predicting future research topics.
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
This document provides an overview of conceptual foundations and preprocessing steps for text mining. It discusses the differences between syntax and semantics in text, and presents a general framework for text analytics including preprocessing, representation, and knowledge discovery. For text representation, it describes bag-of-words models and vector space models, including frequency vectors, one-hot encoding, and TF-IDF weighting. It also provides an introduction to n-grams for representing sequential data.
The task of keyword extraction is to automatically identify a set of terms that best describe the document. Automatic keyword extraction establishes a foundation for various natural language processing applications: information retrieval, the automatic indexing and classification of documents, automatic summarization and high-level semantic description, etc. Although the keyword extraction applications usually work on single documents (document-oriented task), keyword extraction is also applicable to a more demanding task, i.e. the keyword extraction from a whole collection of documents or from an entire web site, or from tweets from Twitter. In the era of big-data, obtaining an effective and efficient method for automatic keyword extraction from huge amounts of multi-topic textual sources is of high importance.
We proposed a novel Selectivity-Based Keyword Extraction (SBKE) method, which extracts keywords from the source text represented as a network. The node selectivity value is calculated from a weighted network as the average weight distributed on the links of a single node and is used in the procedure of keyword candidate ranking and extraction. The selectivity slightly outperforms an extraction based on the standard centrality measures. Therefore, the selectivity and its modification – generalized selectivity as the node centrality measures are included in the SBKE method. Selectivity-based extraction does not require linguistic knowledge as it is derived purely from statistical and structural information of the network and it can be easily ported to new languages and used in a multilingual scenario. The true potential of the proposed SBKE method is in its generality, portability and low computation costs, which positions it as a strong candidate for preparing collections which lack human annotations for keyword extraction. Testing of the portability of the SBKE was tested on Croatian, Serbian and English texts – more precisely it was developed on Croatian News and ported for extraction from parallel abstracts of scientific publication in the Serbian and English languages.
The constructed parallel corpus of scientific abstracts with annotated keywords allows a better comparison of the performance of the method across languages since we have the controlled experimental environment and data. The achieved keyword extraction results measured with an F1 score are 49.57% for English and 46.73% for the Serbian language, if we disregard keywords that are not present in the abstracts. In case that we evaluate against the whole keyword set, the F1 scores are 40.08% and 45.71% respectively. This work shows that SBKE can be easily ported to new a language, domain and type of text in the sense of its structure. Still, there are drawbacks – the method can extract only the words that appear in the text.
Documents do not always have the same content. However, the similarity between documents often occurs in the world of writing scientific papers. Some similarities occur because of a coincidence, but something happens because of the element of intent. On documents that have little content, this can be checked by the eyes. However, on documents that have thousands of lines and pages, of course, it is impossible. To anticipate it, it takes a way that can analyze plagiarism techniques performed. Many methods can examine the resemblance of documents, one of them by using the Rabin-Karp algorithm. The algorithm is very well since it has a determination for syllable cuts (K-Grams). This algorithm looks at how many hash values are the same in both documents. The percentage of plagiarism can also be adjusted up to a few percent according to the need for examination of the document. Implementation of this algorithm is beneficial for an institution to do the filtering of incoming documents. It is usually done at the time of receipt of a scientific paper to be published.
Document Retrieval System, a Case StudyIJERA Editor
In this work we have proposed a method for automatic indexing and retrieval. This method will provide as a
result the most likelihood document which is related to the input query. The technique used in this project is
known as singular-value decomposition, in this method a large term by document matrix is analyzed and
decomposed into 100 factors. Documents are represented by 100 item vector of factor weights. On the other
hand queries are represented as pseudo-document vectors, which are formed from weighed combinations of
terms.
Rabin Karp algorithm is a search algorithm that searches for a substring pattern in a text using hashing. It is beneficial for matching words with many patterns. One of the practical applications of Rabin Karp's algorithm is in the detection of plagiarism. Michael O. Rabin and Richard M. Karp invented the algorithm. This algorithm performs string search by using a hash function. A hash function is the values that are compared between two documents to determine the level of similarity of the document. Rabin-Karp algorithm is not very good for single pattern text search. This algorithm is perfect for multiple pattern search. The Levenshtein algorithm can be used to replace the hash calculation on the Rabin-Karp algorithm. The hash calculation on Rabin-Karp only counts the number of hashes that have the same value in both documents. Using the Levenshtein algorithm, the calculation of the hash distance in both documents will result in better accuracy.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
This document summarizes a study comparing two topic modeling techniques, Correlated Topic Models (CTM) and Latent Dirichlet Allocation (LDA), on Twitter data related to asthma. Two datasets were analyzed - tweets with the keyword "asthma" and tweets with the hashtag "#asthma". Both CTM and LDA were used to identify topics in each dataset for different numbers of clusters. The results show that CTM better captures the relationships between topics as the number of clusters increases, while LDA performs similarly for both small and large cluster sizes. The topics identified included terms like "asthma", "hygiene", "pets", "allergy", and "pollution".
The document describes an evaluation of existing relational keyword search systems. It notes discrepancies in how prior studies evaluated systems using different datasets, query workloads, and experimental designs. The evaluation aims to conduct an independent assessment that uses larger, more representative datasets and queries to better understand systems' real-world performance and tradeoffs between effectiveness and efficiency. It outlines schema-based and graph-based search approaches included in the new evaluation.
Computing semantic similarity measure between words using web search enginecsandit
Semantic Similarity measures between words plays an important role in information retrieval,
natural language processing and in various tasks on the web. In this paper, we have proposed a
Modified Pattern Extraction Algorithm to compute the supervised semantic similarity measure
between the words by combining both page count method and web snippets method. Four
association measures are used to find semantic similarity between words in page count method
using web search engines. We use a Sequential Minimal Optimization (SMO) support vector
machines (SVM) to find the optimal combination of page counts-based similarity scores and
top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous
word-pairs and non-synonymous word-pairs. The proposed Modified Pattern Extraction
Algorithm outperforms by 89.8 percent of correlation value.
Text categorization is a term that has intrigued researchers for quite some time now. It is the concept
in which news articles are categorized into specific groups to cut down efforts put in manually categorizing
news articles into particular groups. A growing number of statistical classification and machine learning
technique have been applied to text categorization. This paper is based on the automatic text categorization
of news articles based on clustering using k-mean algorithm. The goal of this paper is to automatically
categorize news articles into groups. Our paper mostly concentrates on K-mean for clustering and for term
frequency we are going to use TF-IDF dictionary is applied for categorization. This is done using mahaout
as platform.
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
This chapter introduces the concept of data and discusses some examples to illustrate key ideas:
1) The same data can be summarized in different ways (e.g. mean vs median) leading to different conclusions.
2) Self-selected samples from online polls cannot be used to make statistical inferences about a population.
3) Randomized experiments are needed to establish cause-and-effect relationships and eliminate bias, but industry-funded research may still be biased in its methods or presentation.
The talk at TYPO3 DevDays 2015 in Nuremberg which explains the deep insights of how search works. TF-IDF algorithm, vector space model and how that is used in Lucene and therefore Solr and Elasticsearch.
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONSIJDKP
Big Data creates many challenges for data mining experts, in particular in getting meanings of text data. It is beneficial for text mining to build a bridge between word embedding process and graph capacity to connect the dots and represent complex correlations between entities. In this study we examine processes of building a semantic graph model to determine word associations and discover document topics. We introduce a novel Word2Vec2Graph model that is built on top of Word2Vec word embedding model. We demonstrate how this model can be used to analyze long documents, get unexpected word associations and uncover document topics. To validate topic discovery method we transfer words to vectors and vectors to images and use CNN deep learning image classification.
Slides for the class, From Pattern Matching to Knowledge Discovery Using Text Mining and Visualization Techniques, presented June 13, 2010, at the Special Libraries Association 2010 annual meeting.
Topic detecton by clustering and text miningIRJET Journal
This document discusses topic detection from text documents using text mining and clustering techniques. It proposes extracting keywords from documents, representing topics as groups of keywords, and using k-means clustering on the keywords to group them into topics. The keywords are extracted based on frequency counts and preprocessed by removing stop words and stemming. The k-means clustering algorithm is used to assign keywords to topics represented by cluster centroids, and the centroids are iteratively updated until cluster assignments converge.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
This document proposes using Word2Vec and decision trees to extract keywords from textual documents and classify the documents. It reviews related work on keyword extraction and text classification techniques. The proposed approach involves preprocessing text, representing words as vectors with Word2Vec, calculating frequently occurring keywords for each category, and using decision trees to classify documents based on keyword similarity. Experiments using different preprocessing and Word2Vec settings achieved an F-score of up to 82% for document classification.
This document provides an overview of text mining and web mining. It defines data mining and describes the common data mining tasks of classification, clustering, association rule mining and sequential pattern mining. It then discusses text mining, defining it as the process of analyzing unstructured text data to extract meaningful information and structure. The document outlines the seven practice areas of text mining as search/information retrieval, document clustering, document classification, web mining, information extraction, natural language processing, and concept extraction. It provides brief descriptions of the problems addressed within each practice area.
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
This document describes a study that uses natural language processing and text mining techniques to identify future work statements in scientific papers and extract keywords from those statements. The researchers developed a multi-step pipeline to first identify the future work section, then select future work sentences within that section. They used rules and algorithms to identify sentences discussing future work. Keywords were then extracted from the selected sentences using the RAKE algorithm. An analysis found that 31.4% of papers contained future work statements, with medical science papers having the highest overlap between future work and title-abstract keywords. The researchers hope this work is a first step toward predicting future research topics.
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
This document provides an overview of conceptual foundations and preprocessing steps for text mining. It discusses the differences between syntax and semantics in text, and presents a general framework for text analytics including preprocessing, representation, and knowledge discovery. For text representation, it describes bag-of-words models and vector space models, including frequency vectors, one-hot encoding, and TF-IDF weighting. It also provides an introduction to n-grams for representing sequential data.
The task of keyword extraction is to automatically identify a set of terms that best describe the document. Automatic keyword extraction establishes a foundation for various natural language processing applications: information retrieval, the automatic indexing and classification of documents, automatic summarization and high-level semantic description, etc. Although the keyword extraction applications usually work on single documents (document-oriented task), keyword extraction is also applicable to a more demanding task, i.e. the keyword extraction from a whole collection of documents or from an entire web site, or from tweets from Twitter. In the era of big-data, obtaining an effective and efficient method for automatic keyword extraction from huge amounts of multi-topic textual sources is of high importance.
We proposed a novel Selectivity-Based Keyword Extraction (SBKE) method, which extracts keywords from the source text represented as a network. The node selectivity value is calculated from a weighted network as the average weight distributed on the links of a single node and is used in the procedure of keyword candidate ranking and extraction. The selectivity slightly outperforms an extraction based on the standard centrality measures. Therefore, the selectivity and its modification – generalized selectivity as the node centrality measures are included in the SBKE method. Selectivity-based extraction does not require linguistic knowledge as it is derived purely from statistical and structural information of the network and it can be easily ported to new languages and used in a multilingual scenario. The true potential of the proposed SBKE method is in its generality, portability and low computation costs, which positions it as a strong candidate for preparing collections which lack human annotations for keyword extraction. Testing of the portability of the SBKE was tested on Croatian, Serbian and English texts – more precisely it was developed on Croatian News and ported for extraction from parallel abstracts of scientific publication in the Serbian and English languages.
The constructed parallel corpus of scientific abstracts with annotated keywords allows a better comparison of the performance of the method across languages since we have the controlled experimental environment and data. The achieved keyword extraction results measured with an F1 score are 49.57% for English and 46.73% for the Serbian language, if we disregard keywords that are not present in the abstracts. In case that we evaluate against the whole keyword set, the F1 scores are 40.08% and 45.71% respectively. This work shows that SBKE can be easily ported to new a language, domain and type of text in the sense of its structure. Still, there are drawbacks – the method can extract only the words that appear in the text.
Documents do not always have the same content. However, the similarity between documents often occurs in the world of writing scientific papers. Some similarities occur because of a coincidence, but something happens because of the element of intent. On documents that have little content, this can be checked by the eyes. However, on documents that have thousands of lines and pages, of course, it is impossible. To anticipate it, it takes a way that can analyze plagiarism techniques performed. Many methods can examine the resemblance of documents, one of them by using the Rabin-Karp algorithm. The algorithm is very well since it has a determination for syllable cuts (K-Grams). This algorithm looks at how many hash values are the same in both documents. The percentage of plagiarism can also be adjusted up to a few percent according to the need for examination of the document. Implementation of this algorithm is beneficial for an institution to do the filtering of incoming documents. It is usually done at the time of receipt of a scientific paper to be published.
Document Retrieval System, a Case StudyIJERA Editor
In this work we have proposed a method for automatic indexing and retrieval. This method will provide as a
result the most likelihood document which is related to the input query. The technique used in this project is
known as singular-value decomposition, in this method a large term by document matrix is analyzed and
decomposed into 100 factors. Documents are represented by 100 item vector of factor weights. On the other
hand queries are represented as pseudo-document vectors, which are formed from weighed combinations of
terms.
Rabin Karp algorithm is a search algorithm that searches for a substring pattern in a text using hashing. It is beneficial for matching words with many patterns. One of the practical applications of Rabin Karp's algorithm is in the detection of plagiarism. Michael O. Rabin and Richard M. Karp invented the algorithm. This algorithm performs string search by using a hash function. A hash function is the values that are compared between two documents to determine the level of similarity of the document. Rabin-Karp algorithm is not very good for single pattern text search. This algorithm is perfect for multiple pattern search. The Levenshtein algorithm can be used to replace the hash calculation on the Rabin-Karp algorithm. The hash calculation on Rabin-Karp only counts the number of hashes that have the same value in both documents. Using the Levenshtein algorithm, the calculation of the hash distance in both documents will result in better accuracy.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
This document summarizes a study comparing two topic modeling techniques, Correlated Topic Models (CTM) and Latent Dirichlet Allocation (LDA), on Twitter data related to asthma. Two datasets were analyzed - tweets with the keyword "asthma" and tweets with the hashtag "#asthma". Both CTM and LDA were used to identify topics in each dataset for different numbers of clusters. The results show that CTM better captures the relationships between topics as the number of clusters increases, while LDA performs similarly for both small and large cluster sizes. The topics identified included terms like "asthma", "hygiene", "pets", "allergy", and "pollution".
The document describes an evaluation of existing relational keyword search systems. It notes discrepancies in how prior studies evaluated systems using different datasets, query workloads, and experimental designs. The evaluation aims to conduct an independent assessment that uses larger, more representative datasets and queries to better understand systems' real-world performance and tradeoffs between effectiveness and efficiency. It outlines schema-based and graph-based search approaches included in the new evaluation.
Computing semantic similarity measure between words using web search enginecsandit
Semantic Similarity measures between words plays an important role in information retrieval,
natural language processing and in various tasks on the web. In this paper, we have proposed a
Modified Pattern Extraction Algorithm to compute the supervised semantic similarity measure
between the words by combining both page count method and web snippets method. Four
association measures are used to find semantic similarity between words in page count method
using web search engines. We use a Sequential Minimal Optimization (SMO) support vector
machines (SVM) to find the optimal combination of page counts-based similarity scores and
top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous
word-pairs and non-synonymous word-pairs. The proposed Modified Pattern Extraction
Algorithm outperforms by 89.8 percent of correlation value.
Text categorization is a term that has intrigued researchers for quite some time now. It is the concept
in which news articles are categorized into specific groups to cut down efforts put in manually categorizing
news articles into particular groups. A growing number of statistical classification and machine learning
technique have been applied to text categorization. This paper is based on the automatic text categorization
of news articles based on clustering using k-mean algorithm. The goal of this paper is to automatically
categorize news articles into groups. Our paper mostly concentrates on K-mean for clustering and for term
frequency we are going to use TF-IDF dictionary is applied for categorization. This is done using mahaout
as platform.
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
This chapter introduces the concept of data and discusses some examples to illustrate key ideas:
1) The same data can be summarized in different ways (e.g. mean vs median) leading to different conclusions.
2) Self-selected samples from online polls cannot be used to make statistical inferences about a population.
3) Randomized experiments are needed to establish cause-and-effect relationships and eliminate bias, but industry-funded research may still be biased in its methods or presentation.
The talk at TYPO3 DevDays 2015 in Nuremberg which explains the deep insights of how search works. TF-IDF algorithm, vector space model and how that is used in Lucene and therefore Solr and Elasticsearch.
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONSIJDKP
Big Data creates many challenges for data mining experts, in particular in getting meanings of text data. It is beneficial for text mining to build a bridge between word embedding process and graph capacity to connect the dots and represent complex correlations between entities. In this study we examine processes of building a semantic graph model to determine word associations and discover document topics. We introduce a novel Word2Vec2Graph model that is built on top of Word2Vec word embedding model. We demonstrate how this model can be used to analyze long documents, get unexpected word associations and uncover document topics. To validate topic discovery method we transfer words to vectors and vectors to images and use CNN deep learning image classification.
Slides for the class, From Pattern Matching to Knowledge Discovery Using Text Mining and Visualization Techniques, presented June 13, 2010, at the Special Libraries Association 2010 annual meeting.
Topic detecton by clustering and text miningIRJET Journal
This document discusses topic detection from text documents using text mining and clustering techniques. It proposes extracting keywords from documents, representing topics as groups of keywords, and using k-means clustering on the keywords to group them into topics. The keywords are extracted based on frequency counts and preprocessed by removing stop words and stemming. The k-means clustering algorithm is used to assign keywords to topics represented by cluster centroids, and the centroids are iteratively updated until cluster assignments converge.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
This document proposes using Word2Vec and decision trees to extract keywords from textual documents and classify the documents. It reviews related work on keyword extraction and text classification techniques. The proposed approach involves preprocessing text, representing words as vectors with Word2Vec, calculating frequently occurring keywords for each category, and using decision trees to classify documents based on keyword similarity. Experiments using different preprocessing and Word2Vec settings achieved an F-score of up to 82% for document classification.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...cscpconf
Document Clustering algorithms goal is to create clusters that are coherent internally, but clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable to eliminate it and keeping just the useful information. Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining applications can use it to improve her results. The Keyphrases are defined as phrases that capture the main topics discussed in document; they offer a brief and precise summary of document content. Therefore, it can be a good solution to get rid of the existent noise from documents. In this paper, we propose a new method to solve the problem cited above especially for Arabic language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach, we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting Keyphrases improves the clustering results.
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...csandit
Document Clustering algorithms goal is to create clusters that are coherent internally, but
clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable
to eliminate it and keeping just the useful information.
Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining
applications can use it to improve her results. The Keyphrases are defined as phrases that
capture the main topics discussed in document; they offer a brief and precise summary of
document content. Therefore, it can be a good solution to get rid of the existent noise from
documents.
In this paper, we propose a new method to solve the problem cited above especially for Arabic
language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach,
we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage
techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting
Keyphrases improves the clustering results.
Construction of Keyword Extraction using Statistical Approaches and Document ...IJERA Editor
Organize continuing growth of dynamic unstructured documents is the major challenge to the field experts.
Handling of such unorganized documents causes more expensive. Clustering of such dynamic documents helps
to reduce the cost. Document clustering by analysing the keywords of the documents is one the best method to
organize the unstructured dynamic documents. Statistical analysis is the best adaptive method to extract the
keywords from the documents. In this paper an algorithm was proposed to cluster the documents. It has two
parts, first part extracts the keywords using statistical method and the second part construct the clusters by
keyword using agglomerative method. This proposed algorithm gives more than 90% of accuracy.
Construction of Keyword Extraction using Statistical Approaches and Document ...IJERA Editor
Organize continuing growth of dynamic unstructured documents is the major challenge to the field experts.
Handling of such unorganized documents causes more expensive. Clustering of such dynamic documents helps
to reduce the cost. Document clustering by analysing the keywords of the documents is one the best method to
organize the unstructured dynamic documents. Statistical analysis is the best adaptive method to extract the
keywords from the documents. In this paper an algorithm was proposed to cluster the documents. It has two
parts, first part extracts the keywords using statistical method and the second part construct the clusters by
keyword using agglomerative method. This proposed algorithm gives more than 90% of accuracy.
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
This document summarizes a research paper that proposes a method for classifying news and research articles using text pattern mining. The method involves preprocessing text to remove stop words and perform stemming. Frequent and closed patterns are then discovered from the preprocessed text. These patterns are structured into a taxonomy and deployed to classify new documents. The method also involves evolving patterns by reshuffling term supports within patterns to reduce the effects of noise from negative documents. Over 80% of documents were successfully classified using this pattern-based approach.
In this paper, we present three techniques for incorporating syntactic metadata in a textual retrieval system. The first technique involves just a syntactic analysis of the query and it generates a different weight for each term of the query, depending on its grammar category in the query phrase. These weights will be used for each term in the retrieval process. The second technique involves a storage optimization of the system's inverted index that is the inverse index will store only terms that are subjects or predicates in the document they appear in. Finally, the third technique builds a full syntactic index, meaning that for each term in the term collection, the inverse index stores besides the term-frequency and the inverse-document-frequency, also the grammar category of the term for each of its occurrences in a document.
The document discusses text classification and summarization techniques for complex domain-specific documents like research papers. It reviews various preprocessing approaches like stopword removal, lemmatizing, tokenization, and stemming. It also compares different machine learning algorithms for text classification, including Naive Bayes, decision trees, SVM, KNN, and neural networks. The document surveys works analyzing domain-specific documents using these techniques, such as biomedical document relation extraction and research paper topic classification.
Technical Whitepaper: A Knowledge Correlation Search Engines0P5a41b
For the technically oriented reader, this brief paper describes the technical foundation of the Knowledge Correlation Search Engine - patented by Make Sence, Inc.
IRJET- Automatic Recapitulation of Text DocumentIRJET Journal
The document describes an approach for automatically generating summaries of text documents. It involves preprocessing the input text by tokenizing, removing stop words, and stemming words. It then extracts features from the preprocessed text, such as term frequency-inverse document frequency (tf-idf) values. Sentences are scored based on these features and sentences with higher scores are selected to form the summary. Keywords from the text are also identified using WordNet to help select the most relevant sentences for the summary. The proposed approach aims to generate concise yet meaningful summaries using natural language processing techniques.
Survey on Key Phrase Extraction using Machine Learning ApproachesYogeshIJTSRD
The automated keyword extraction task is to define a collection of representative terms for the text. Extracting keywords defines a small collection of terms, key phrases and keywords that define the document’s context. Keyword search allows large document collections to be searched effectively. To allocate suitable key phrases to new documents, text categorization techniques can be applied. A predefined collection of key phrases from which all key phrases for new documents are selected is given in the training documents. The training data for each key phrase describes a collection of documents associated with it. Standard machine learning techniques are used for each key phrase to construct a classifier from the training materials, using those relevant to it as positive examples and the rest as negative examples. Provided a new text, it is processed by the classifier of each key phrase. Preeti Sondhi | Aakib Jabbar "Survey on Key Phrase Extraction using Machine Learning Approaches" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd39890.pdf Paper URL: https://www.ijtsrd.com/other-scientific-research-area/other/39890/survey-on-key-phrase-extraction-using-machine-learning-approaches/preeti-sondhi
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONIJDKP
The document describes a new method for multi-domain document classification using keyword sequences extracted from documents. It introduces the Word AdHoc Network (WANET) system which uses Google's Keyword Relation and a new similarity measurement called Google Purity to classify documents into domains based on their extracted 4-word keyword sequences, without requiring pre-established keyword repositories. Experimental results showed the classification was accurate and efficient, allowing cross-domain classification and management of knowledge from different sources.
Improving Annotations in Digital Documents using Document Features and Fuzzy ...IRJET Journal
The document proposes a system to automatically annotate digital documents using document features extracted via natural language processing techniques and fuzzy logic. It aims to improve on existing annotation systems by maintaining semantic accuracy while annotating large amounts of documents. The system first extracts features from documents like titles, sentence length, proper nouns etc. It then uses fuzzy logic to apply the best possible annotations based on weighted feature values. The approach is meant to accurately annotate documents in all conditions while preserving semantic meaning.
An index is a database that stores information collected from documents in a way that allows quick retrieval. It maps words to their locations in documents. Different indexing methods exist, including inverted indexes and latent semantic indexing (LSI). Probabilistic latent semantic indexing (PLSI) is an improvement over LSI as it has a stronger statistical model and can handle issues like synonyms and multiple meanings of words better. Creating accurate indexes is important for search engines to return relevant results but it involves challenges like document formatting, language processing, and updating as new information is added.
Automatize Document Topic And Subtopic Detection With Support Of A CorpusRichard Hogue
This document discusses previous research on automatic topic and subtopic detection from documents. It provides an overview of various approaches that have been used, including text segmentation, text clustering, frequent word sequences, word clustering, concept-based classification, probabilistic topic modeling, and agglomerative clustering. The document then proposes a new method called paragraph extension, which treats a document as a set of paragraphs and uses a paragraph merging technique and corpus of related words to detect topics and subtopics.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSkevig
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSijnlc
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
Similar to Keyphrase Extraction using Neighborhood Knowledge (20)
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
2. Keyphrase Extraction using Neighborhood Knowledge Page 2
International Journal for Modern Trends in Science and Technology
ISSN: 2455-3778 |Volume No: 02 | Issue No: 02 | February 2016
I. INTRODUCTION
Key phrases are words or group of words that
capture the key ideas of document. Automatic
keyphrase extraction is automatic selection of
important keyphrases in the document.
Appropriate key phrases can serve as a highly
condensed summary for a document and they can
be used as a label for the document to supplement
or replace the title or summary, or they can be
highlighted within the body of document to
facilitate users fast browsing and reading. Key
phrases extraction has huge arena. Document key
phrase have been successfully used in the following
IR and NLP tasks. Document indexing, document
classification, document clustering and
summarization. Key phrases are usually manually
assigned by authors, especially for journal or
conference articles. But since document corpus is
increasing rapidly, automation of identification of
key phrase process is necessary. Generated
keyphrases do not necessarily appear in the body
of the document. Though the keyphrases play a
vital role in information extraction, only minority of
documents are available with keyphrases. Some
commercial software products use automatic
keyphrase extraction. Example: word97 and
Tetranet. Most previous works focus on keyphrase
extraction for journal or conference articles, while
this work focus on keyphrase extraction for news
articles because news article is one of the popular
document genres on the web and most news
articles have no author-assigned keyphrases.
II. MOTIVATION
Existing methods conduct the keyphrase extraction
task using only the information contained in the
specified document, including the phrase‘s TFIDF,
position and other syntactic information in the
document. One common assumption of existing
methods is that the documents are independent of
each other. And the keyphrase extraction task is
conducted separately without interactions for each
document. However, some topic-related documents
actually have mutual influences and contain useful
clues which can help to extract keyphrases from
each other. For example, two documents about the
same topic ―politics‖ would share a few common
phrases, e.g. ―government‖, ―political party‖, and
they can provide additional knowledge for each
other to better evaluate and extract salient
keyphrases from each other. Therefore, given a
specified document, we can retrieve a few
documents topically close to the document from a
large corpus through search engines, and these
neighbor documents are deemed beneficial to
evaluate and extract keyphrases from the
document because they can provide more
knowledge and clues for keyphrase extraction from
the specified document.
III. LITERATURE SURVEY
Several papers explore the task of producing a
summary of a document by extracting key
sentences from the document. This task is similar
to the task of keyphrase extraction, but it is more
difficult. The extracted sentences often lack
cohesion because anaphoric references are not
resolved. Anaphors are pronouns (e.g., ―it‖, ―they‖),
definite noun phrases (e.g., ―the car‖), and
demonstratives (e.g., ―this‖, ―these‖) that refer to
previously discussed concepts. When a sentence is
extracted out of the context of its neighboring
sentences, it may be impossible or very difficult for
the reader of the summary to determine the
referents of the anaphors. Keyphrase extraction
avoids this problem because anaphors (by their
nature) are not keyphrases. Also, a list of
keyphrases has no structure; unlike a list of
sentences, a list of keyphrases can be randomly
permuted without significant consequences.
Most of these papers on summarization by
sentence extraction describe algorithms that are
based on manually derived heuristics. The
heuristics tend to be effective for the intended
domain, but they often do not generalize well to a
new domain. Extending the heuristics to a new
domain involves a significant amount of manual
work. A few of the papers describe learning
algorithms, which can be trained by supplying
documents with associated target summaries.
Learning algorithms can be extended to new
domains with less work than algorithms that use
manually derived heuristics. However, there is still
some manual work involved, because the training
summaries must be composed of sentences that
appear in the document, which means that
standard author-supplied abstracts are not
suitable. An advantage of keyphrase extraction is
that standard author-supplied keyphrases are
suitable for training a learning algorithm, because
the majority of such keyphrases appear in the
3. Keyphrase Extraction using Neighborhood Knowledge Page 3
International Journal for Modern Trends in Science and Technology
ISSN: 2455-3778 |Volume No: 02 | Issue No: 02 | February 2016
bodies of the corresponding documents.
Another related work addresses the task of
information extraction. An information extraction
system seeks specific information in a document,
according to predefined template. The template is
specific to a given topic area. For example, if the
topic area is news reports of terrorist attacks, the
template might specify that the information
extraction system should identify (i) the terrorist
organization involved in the attack, (ii) the victims
of the attack, (iii) the type of attack (kidnapping,
murder, etc.), and other information of this type
where information extraction systems are evaluated
with corpora in various topic areas, including
terrorist attacks and corporate mergers.
Information extraction and keyphrase extraction
are at opposite ends of a continuum that ranges
from detailed, specific, and domain-dependent
(information extraction) to condensed, general, and
domain-independent (keyphrase extraction). The
different ends of this continuum require
substantially different algorithms. However, there
are intermediate points on this continuum. An
example is the task of identifying corporate names
in business news
IV. PROPOSED SYSTEM
In order to extract keyphrases from neighborhood
we need to identify neighborhood documents by
using following
A. Cosine similarity:
The cosine similarity between two vectors (or two
documents on the Vector Space) is a measure that
calculates the cosine of the angle between them.
This metric is a measurement of orientation and
not magnitude; it can be seen as a comparison
between documents on a normalized space. What
we have to do to build the cosine similarity
equation is to solve the equation of the dot product
for cos(𝜃): Cosine Similarity will generate a metric
that says how related are two documents by
looking at the angle instead of magnitude.
The cosine of two vectors can be derived by using
Euclidean dot product formula:𝑎. 𝑏 = 𝑎 𝑏 cos(𝜃)
(1)
Given two vectors of attributes, A and B the cosine
similarity, cos(θ), is represented using a dot
product and magnitude as
1
2 2
1 1
.
( ) ( )
n
i i
i
n n
i i
i i
A B
A B
Cos
A B
A B
(2)
Similar Score Unrelated scores Opposite scores
Score vectors in same direction Score vectors are nearly orthogonal Score vectors in opposite direction
If angle between them is 0 deg If angle between them is 90 deg If angle between them is 180 deg.
Cosine angle is near 1 i.e.100% Cosine angle is near 0 i.e. 0% Cosine angle is near -1 i.e. -100%
Fig 1: Cosine similarity values for different documents
Note that even if we had a vector pointing to a point
far from another vector, they still could have an
small angle and that is the central point on the use
of Cosine Similarity, the measurement tends to
ignore the higher term count on documents.
Suppose we have a document with the word ―earth‖
appearing 200 times and another document with
the word ―earth‖ appearing 50, the Euclidean
distance between them will be higher but the angle
will still be small because they are pointing to the
same direction, which is what matters when we are
comparing documents.
B. TF-IDF:
Tf-idf stands for term frequency-inverse document
frequency, and the tf-idf weight is a weight often
used in information retrieval and text mining. This
weight is a statistical measure used to evaluate
how important a word is to a document in a
collection or corpus. The importance increases
proportionally to the number of times a word
appears in the document but is offset by the
frequency of the word in the corpus. Variations of
the tf-idf weighting scheme are often used Ɵby
search engines as a central tool in scoring and
ranking a document's relevance given a user query.
One of the simplest ranking functions is computed
by summing the tf-idf for each query term; many
more sophisticated ranking functions are variants
of this simple model.
Tf-idf can be successfully used for stop-words
filtering in various subject fields including text
summarization and classification.
Typically, the tf-idf weight is composed by two
terms: the first computes the normalized Term
4. Keyphrase Extraction using Neighborhood Knowledge Page 4
International Journal for Modern Trends in Science and Technology
ISSN: 2455-3778 |Volume No: 02 | Issue No: 02 | February 2016
Frequency (TF), aka. the number of times a word
appears in a document, divided by the total
number of words in that document; the second
term is the Inverse Document Frequency (IDF),
computed as the logarithm of the number of the
documents in the corpus divided by the number of
documents where the specific term appears.
TF: Term Frequency, which measures how
frequently a term occurs in a document. Since
every document is different in length, it is possible
that a term would appear much more times in long
documents than shorter ones. Thus, the term
frequency is often divided by the document length
(aka. the total number of terms in the document)
as a way of normalization:
Number of times term t appears in a document
( )
Total number of terms in the document
TF t
IDF: Inverse Document Frequency, which measures
how important a term is. While computing TF, all
terms are considered equally important. However it
is known that certain terms, such as "is", "of", and
"that", may appear a lot of times but have little
importance. Thus we need to weigh down the
frequent terms while scale up the rare ones,
number of documents
umber of documents with term t in it
( ) log
Total
N
eIDF t
The use of neighborhood information is
worth more discussion. Because neighbor
documents might not be sampled from the same
generative model as the specified document, we
probably do not want to trust them as much as the
specified document. Thus a confidence value is
associated with every document in the expanded
document set, which reflects out belief that the
document is sampled from the same underlying
model as the specified document. When a
document is close to the specified one, the
confidence value is high, but when it is farther
apart, the confidence value will be reduced.
Heuristically, we use the cosine similarity between
a document and the specified document as the
confidence value. The confidence values of the
neighbor documents will be incorporated in the
keyphrase extraction algorithm.
V. SYSTEM ARCHITECTURE
Documents
Name of Document
Document Size
Keyphrases
Fig 2: System Architecture for keyphrase extraction
VI. IMPLEMENTATION
In order to construct neighborhood documents we
divide our work into following modules
1. INDEXING:
A document index improves the speed of data
retrieval operations. Indexes are used to quickly
locate data without having to search whole
document.
Example: I am the king of cards and she is the
queen of roses and we are the rulers of the world.
You are the greatest king.
Output: King; card; great; queen; rose; ruler; world
Indexing helps in storing data to facilitate fast and
accurate information retrieval. In order to form
index system should undergo various phases.
TOKENIZATION:
Tokenization means splitting of text into meaning
Word
Tokenization
Stopwords
Removal
Stemming
TF
Indexing
Segmentation
Parts of speech
tagging
Chunking
Neighborhood
construction
Neighborhood level
word evaluation
Document level
keyphrase extraction
5. Keyphrase Extraction using Neighborhood Knowledge Page 5
International Journal for Modern Trends in Science and Technology
ISSN: 2455-3778 |Volume No: 02 | Issue No: 02 | February 2016
full words.
Example: I am the king of cards and she is the
queen of roses and we are the rulers of the world.
You are the greatest king.
No. of tokens in the above sentence are 26
Tokenization relies on punctuation and whitespace
may or may not be included in the resulting list of
tokens.
Tokens are separated by whitespace character,
such as space or line break, or punctuation
characters.
In java Text is divided into tokens using method:
split
Class: string
SYNTAX:
public string[ ] split(String regex, int limit) or
public string[ ] split(String regex)
Parameters: regex: the delimiting regular
expression
Limit: the result threshold which means how many
strings to be returned
Return value: It returns the array of strings
computed by splitting this string around matches
of the given regular expression.
Special CHARACTER removal
Removing special characters ( ) ― [ ] ‗ ; / : . ! @ #
$ % ^ & * { } | ? < >
Removing Numbers 0-9
We use String replaceall( ) method in Java.
Syntax:
public String replaceAll(string regex, String
replacement)
Example:
String str = ―hello +$ % ^ & * { } | ? < >world[ ] ‗ ;
/ : . !‖;
str = str.replaceAll(―+$ % ^ & * { } | ? < >world [ ] ‗ ;
/ : . !;‖);
Output: hello world
STOPWORDS:
Given a list of stopwords and a plain text
document, output will be set of words remaining
after removing stopwords from document. In our
work we consider 176 words as stopwords which
are called when required.
Syntax:
public static Boolean is StopWord(String word,
String[ ] stopwords)
STEMMING:
Stemming is mainly used to reduce word forms to
proper class. The most efficient algorithm used in
stemming is The Porter stemming algorithm[13] (or
‗Porter stemmer‘) is a process for removing the
commoner morphological and inflexional endings
from words in English. Its main use is as part of a
term normalization process that is usually done
when setting up Information Retrieval systems.
Algorithm:
Step1a: Replace plurals
Example: Stresses by stress, gaps by gap, cries by
cri
Step1b: Deletes ed, edly, ing, ingly suffixes
Example: Fished - fish, repeatedly – repeat,
predating – predicate, interestingly –interest etc.
Step1c: Replaces y by i
Example: Happy – happi, sky – sky.
Step2: Replaces ization, ational, iveness, fullness
etc by ate,tion,ive,full respectively
Example: Renationalization – relation, relational –
relate, conditional – condition, effectiveness –
effective, hopefulness – hopeful.
Step3: Replaces icate, alize, iciti, ical by ic, al, ic,
ic and deletes active, ful, ness
Example: Triplicate-triplic, formalize- formal,
electricity- electric, electrical- electrical, formative-
form, hopeful-hope, goodness-good.
Step4: Deletes al, ance, ence, er etc
Example: Revival – reviv, allowance- allow,
inference- infer, airliner- airlin
Step5a: Deletes e
Example: Probate-probat, rate-rate
Step6: Maps double suffix to single suffix
Example: Control-control, roll- rol
Documents
Index files
Fig 3: creating index file
Word Tokenization using
split method
Removal of special
Characters
Stopwords Removal
Porter Stemmer
6. Keyphrase Extraction using Neighborhood Knowledge Page 6
International Journal for Modern Trends in Science and Technology
ISSN: 2455-3778 |Volume No: 02 | Issue No: 02 | February 2016
2. SEGMENTATION, CHUNKING, POS TAGGING
For the process of segmentation, chunking and
POS Tagging we are using tool known as jtextro
which performs all the operations.
JTEXTPRO:
java text processor is essential processing steps for
developing applications in NLP use Jtextpro
developed by Xuan-Hieu Phan. A Java-based Text
Processing toolkit that currently includes
important processing steps for natural
language/text processing as follows:
Sentence boundary detection
Sentence boundary detection is done using
maximum entropy classifier train on WSJ corpus
Word tokenization
Tokenization is a process of breaking sentences
into tokens
SYNTAX:
publicPENNTokenizer<word>newPENNTokenizer(Re
ader r)
Constructs a new PENNTokenizer that treats
carriage returns as a normal whitespace
PARAMETERS:
r- Reader whose contents will be tokenized
Returns:
A PENNTokenizer that tokenizes a strean of objects
of type word
Input: tokenizer divides sentences into words
known as tokens
Output: tokenizer divides sentences into words
known as tokens
PART-OF-SPEECH TAGGING
Parts of speech (POS) tagging is assigning each
word in a sentence to the parts of speech that it
belongs to. POS tagging uses conditional random
fields. Pos tagging guidelines as per university of
Pennsylvania
Input: a string of words and a tagset
Output: a single best tag for each word
Example: Input: current account
Output: current JJ account NN
PHRASE CHUNKING
Text chunking subsumes a range of tasks. It is the
simplest of finding ‗Noun groups‘. More ambitious
systems may add additional chunk types, such as
verb group, or may seek a complete partition of
sentences into chunks of different types. Chunking
also uses conditional random fields. Chunks tags
are made up of two parts
First part:
B- Marks the beginning of the chunk
I-Marks the continuation of a chunk
E- Marks the end of the chunk
Second part:
NP- noun chunk
VP- Verb Chunk
Example:
Word POS tagging Chunking
He PRP B-NP
Reckons VBZ B-VP
Current JJ I-NP
Account NN I-NP
Deficit NN I-NP
Will MD B-VP
Narrow VB I-VP
To TO B-PP
Only RB B-NP
# # I-NP
1.8 CD I-NP
Billon CD I-NP
Table.1: POS tagging and chunking of the words
Documents
Fig 4: segmentation, pos tagging, chunking
3. NEIGHBORHOOD CONSTRUCTION
Given a specified document 0d , neighborhood
construction aims to find a few nearest neighbors
for the document from a text corpus or on the Web.
The k neighbor documents 1 2, ,....., kd d d and the
specified document 0d build the expanded
document set 1 2{ , ,...., }kD d d d for 0d , which can
be considered as the expanded knowledge context
Sentence segmentation
Word tokenization by following Penn
conventions
Parts of speech tagging using conditional
random fields
Phrase Chunking using conditional random
fields
7. Keyphrase Extraction using Neighborhood Knowledge Page 7
International Journal for Modern Trends in Science and Technology
ISSN: 2455-3778 |Volume No: 02 | Issue No: 02 | February 2016
for document 0d .The neighbor documents can be
obtained by using the technique of document
similarity search. Document similarity search is to
find documents similar to a query document in a
text corpus and return a ranked list of similar
documents to users. The effectiveness of document
similarity search relies on the function for
evaluating the similarity between two documents.
In this study, we use the widely-used cosine
measure to evaluate document similarity.
Hence, the similarity ( , )doc i jsim d d , between
documents id and jd , can be defined as the
normalized inner product of the two term vectors
id
and jd
:
.
( , )
i j
doc i j
i j
d d
sim d d
d d
(3)
In the experiments, we simply use the cosine
measure to compute the pair wise similarity value
between the specified document 0d and the
documents in the corpus, and then choose k
documents (different from 0d ) with the largest
similarity values as the nearest neighbors for 0d .
Finally, there are totally 1k documents in the
expanded document set. For the document set
0 1 3{ , , ,......, }kD d d d d , the pair wise cosine
similarity values between documents are calculated
and recorded for later use. The efficiency of
document similarity search can be significantly
improved by adopting some index structure in the
implemented system, such as K-D-B tree, R-tree,
SS-tree, SR-tree and X-tree (Böhm & Berchtold,
[14]).
Indexing Name of Document
Neighborhood construction
Fig 5: Construction of neighborhood
4. KEYPHRASE EXTRACTION
Keyphrase extraction is done at two phases
Neighborhood level word evaluation
Document level keyphrase extraction
NEIGHBORHOOD-LEVEL WORD EVALUATION
Like the PageRank algorithm [15], the graph-based
ranking algorithm employed in this study is
essentially a way of deciding the importance of a
vertex within a graph based on global information
recursively drawn from the entire graph. The basic
idea is that of ―voting‖ or ―recommendation‖
between the vertices. A link between two vertices is
considered as a vote cast from one vertex to the
other vertex. The score associated with a vertex is
determined by the votes that are cast for it, and the
score of the vertices casting these votes. Formally,
given the expanded document set D , let
( , )G V E be an undirected graph to reflect the
relationships between words in the document set.v
is the set of vertices and each vertex is a candidate
word in the document set. Because not all words in
the documents are good indicators of keyphrases,
the words added to the graph are restricted with
syntactic filters, i.e., only the words with a certain
part of speech are added. As in Mihalcea and Tarau
[16], the documents are tagged by a POS tagger,
and only the nouns and adjectives are added into
the vertex set. E is the set of edges, which is a
subset of v v . Each edge ije in E is associated
with an affinity weight ( , )i jaff v v between words iv
and jv . The weight is computed based on the co-
occurrence relation between the two words,
controlled by the distance between word
occurrences. The co-occurrence relation can
express cohesion relationships between words. Two
vertices are connected if the corresponding words
co-occur at least once within a window of
maximum w words, where w can be set anywhere
from 2 to 20 words. The affinity weight ( , )i jaff v v is
simply set to be the count of the controlled co-
occurrences between the words iv and jv in the
whole document set as follows:
0 ,( , ) ( , ) ( )s
s
i j doc p d i j
d D
aff v v sim d d count v v
(4)
Where ( , )sd i jcount v v is the count of the controlled
co-occurrences between words iv and jv in
document pd , and 0( , )doc psim d d is the similarity
factor to reflect the confidence value for using
document pd (0 )p k in the expanded
document set. The graph is built based on the
whole document set and it can reflect the global
information in the neighborhood, which is called
Cosine Similarity
8. Keyphrase Extraction using Neighborhood Knowledge Page 8
International Journal for Modern Trends in Science and Technology
ISSN: 2455-3778 |Volume No: 02 | Issue No: 02 | February 2016
Global Affinity Graph. We use an affinity matrix
M to describe G with each entry corresponding to
the weight of an edge in the graph. | | | |( )ij v vM M is
defined as follows:
(5)
Then M is normalized to M as follows to make the
sum of each row equal to 1:
| | | |
1 1
, 0
0,
V V
ij ij ij
j j
M M if M
ij otherwiseM
(6)
Based on the global affinity graphG , the saliency
score ( )iWordScore v for word iv can be deduced
from those of all other words linked with it and it
can be formulated in a recursive form as in the
PageRank algorithm:
,
(1 )
( ) . ( ).
| |
i j j i
j i
WordScore v WordScore v M
V
(7)
And the matrix form is:
(1 )
| |
T
M e
V
(8)
Where | | 1[ ( )]i VWordScore v
is the vector of word
saliency scores. e
is a vector with all elements
equaling to 1. is the damping factor usually set
to 0.85, as in the PageRank algorithm. The above
process can be considered as a Markov chain by
taking the words as the states and the
corresponding transition matrix is given by
(1 )
| |
T
M e
V
. The stationary probability
distribution of each state is obtained by the
principal eigenvector of the transition matrix. For
implementation, the initial scores of all words are
set to 1 and the iteration algorithm in Equation
(4.6) is adopted to compute the new scores of the
words. Usually the convergence of the iteration
algorithm is achieved when the difference between
the scores computed at two successive iterations
for any words falls below a given threshold (0.0001
in this study).
DOCUMENT-LEVEL KEYPHRASE EXTRACTION
After the scores of all candidate words in the
document set have been computed, candidate
phrases (either single-word or multi-word) are
selected and evaluated for the specified document
0d . The candidate words (i.e. nouns and adjectives)
of 0d , which is a subset of V , are marked in the
text of document 0d , and sequences of adjacent
candidate words are collapsed into a multi-word
phrase. The phrases ending with an adjective is not
allowed, and only the phrases ending with a noun
are collected as candidate phrases for the
document. For instance, in the following sentence:
―Mad/JJ cow/NN disease/NN has/VBZ killed/VBN
10,000/CD cattle/NNS‖, the candidate phrases are
―Mad cow disease‖ and ―cattle‖. The score of a
candidate phrase ip is computed by summing the
neighborhood-level saliency scores of the words
contained in the phrase.
( ) ( )
j i
i j
v p
PhraseScore p WordScore v
(9)
All the candidate phrases in document 0d are
ranked in decreasing order of the phrase scores
and the top m phrases are selected as the
keyphrases of 0d . m ranges from 1 to 20 in this
study.
Index files
JText processing
neighborhood construction
keyphrases
Fig 6: Extraction of keyphrases
VII. RESULT ANALYSIS
System takes plain documents as input undergoes
through various techniques to extract keywords
and constructs the neighborhood documents for
the given document and then the keywords are
extracted at neighborhood level and then at
document level.
Document
Name
No. of words
in document
TFIDF
Extraction of Noun Phrases
Build Affinity Graph
( , ); ( )&&( )
0:
i j i jaff v v if v linksv i j
ij otherwiseM
9. Keyphrase Extraction using Neighborhood Knowledge Page 9
International Journal for Modern Trends in Science and Technology
ISSN: 2455-3778 |Volume No: 02 | Issue No: 02 | February 2016
Maximum words
containing
document
Sample18 1060
Minimum words
containing
document
Sample13 151
Average no. of
words in all
documents
345
Table 2: Test Data
DOCUMENT INDEXING
System takes a documents as input and it
undergoes various stages such as tokenization,
stopwords removal, stemming, term frequency (TF)
through which the size of document will be
reduced. Table 5.2 provides a detail frequency of
the words before and after indexing of first ten
documents. We can observe that number of words
reduces after indexing.
Document
Name
No. of
words
before
indexing
No. of
words
after
indexing
Term frequency
Most
frequent
word
Frequency
Sample 398 137 tax 11
Sample1 417 161 lokpal 8
Sample2 308 113 court 7
Sample3 469 142 govern 15
Sample4 360 121 will 7
Sample5 439 162 fodder 15
Sample6 347 145 modi 8
Sample7 214 87 lal 7
Sample8 109 47 justic 5
Sample9 369 141 farmer 10
Table 3: Document Indexing
NEIGHBORHOOD CONSTRUCTION
Neighborhood construction provides details about
document in a text corpus and ranked list of
similar documents. It takes input from index files
and jtext processing output and by using cosine
similarity method neighborhood is calculated. In
Table 5.3 we consider sample document as
considered main document and the remaining
documents are the neighbor documents for sample
document and arranged as per their similarity to
sample document.
S.No Document Name
Document
Similarity
1 Sample 1.0
2 Sample10 0.40289894
3 Sample3 0.33210042
4 Sample4 0.29905915
5 Sample9 0.26591668
6 Sample1 0.24555373
7 Sample6 0.2197027
8 Sample5 0.14047898
9 Sample7 0.11404948
10 Sample2 0.11210743
11 Sample8 0.08503368
Table 4: neighborhood construction
KEYPHRASE EXTRACTION
News article: ―The Hindu‖ News Paper Article
related to implementation of GST in India
Top ten extracted Keyphrases: implementation gst
reduction corporate tax rate, corporate tax rate per
cent, rid exemption corporate tax side, hope goods
services tax bill, parliament approval gst bill
ongoing winter session, goods services tax,
adversarial rent-seeking tax exercise, new indirect
tax regime, revolutionary transformation taxes
india since independence, minister state finance
jayant sinha.
News article: ―The Hindu‖ News Paper Article
related to allegations made by Former Aam Admi
Party Leader Prashant Bhushan related
implementation of Lokpal NewDelhi in India
Top ten extracted Keyphrases: provisions jan lokpal
draft anna movement, appointment removal
processes lokpal, member activist-lawyer prashant
bhushan Saturday, certain provisions draft bill,
present presser mr. bhushan, strong lokpal body,
lokpal act, delhi chief minister arvind kejriwal,
dissident aap mla pankaj pushkar.
News article: ―The Hindu‖ News Paper Article
related to Supreme Court Judge on Law Minister
Sadananda Gowda
Top ten extracted Keyphrases: plots hsr layout
bangalore violation lease-cum-sale agreement mr.
gowda, karnataka high court mr. gowda day,
supreme court karnataka high court order,
supremecourt high court, mr. gowda, spite clout
mr. sadananda gowda, clean chit building law
violation case, mr. gowda.
News article: ―The Hindu‖ News Paper Article
related to comments made by Rajanath Singh on
Economic Growth.
10. Keyphrase Extraction using Neighborhood Knowledge Page 10
International Journal for Modern Trends in Science and Technology
ISSN: 2455-3778 |Volume No: 02 | Issue No: 02 | February 2016
Top ten extracted Keyphrases: several significant
steps result domestic investors foreign investors,
hot favourite destination foreign investors, many
essential commodities prices, control economic
growth double digit, dig economic policies previous
upa government, destination foreign investors,
sense confidence among investors, prices essential
commodities, economic condition country, prices
many goods.
News article: ―The Hindu‖ News Paper Article
related to current development scenario in India.
Top ten extracted Keyphrases: rivers across india
national waterways current winter session, rivers
across india national waterways, movement goods
passengers across country, inland waterways,
charge road transport highways, dry ports, large
number national highways, many four-lane roads
highways, ports, shipping minister nitin gadkari.
VIII. CONCLUSION
According to experiments conducted on several
documents and keyphrases we consider top 10
keyphrases are most suitable keyphrases.
REFERENCES
[1] ―Single Document Keyphrase Extraction Using Neighborhood
Knowledge‖ by Xiaojun Wan and Jianguo Xiao submitted to
Proceedings of the Twenty-Third AAAI Conference on
Artificial Intelligence (2008).
[2] ―Learning Algorithms for Keyphrase Extraction‖ by Peter D.
Turney, Submitted to Information Retrieval — INRT 34-99
on October 4, 1999.
[3] ―A Ranking Approach to Keyphrase Extraction‖ by Xin
Jianga, Yunhua Hub, Hang Lib , Microsoft Research
Technical Report July 31, 2009.
[4] ―Improved Automatic Keyword Extraction Given More
Linguistic Knowledge‖ by Anette Hulth, Department of
Computer and Systems Sciences, Stockholm University
Sweden hulth@dsv.su.se
[5] ―Unsupervised Approaches for Automatic Keyword Extraction
Using Meeting Transcripts‖ by FeifanLiu, DeanaPennell,
FeiLiu and YangLiu submitted to Human Language
Technologies: The 2009 Annual Conference of the North
American Chapter of the ACL.
[6] ―Automatic Keyword Extraction From Any Text Document
Using N-gram Rigid Collocation‖ by Bidyut Das, Subhajit
Pal, Suman Kr. Mondal, Dipankar Dalui, Saikat Kumar
Shome.
[7] ―The Contact Finder agent: Answering bulletin board
questions with referrals‖ by Bruce Krulwich and Chad
Burkey: Center for Strategic Technology Research Andersen
Consulting 3773 Willow Drive, Northbrook, IL, 60062.
[8] ―Compound key word generation from document databases
using a hierarchical clustering ART model‖ by Muñoz, A.,
Intelligent Data Analysis, Amsterdam: Elsevier. (1996).
[9] ―Exporting phrases: A statistical analysis of topical
language‖ by Steier, A. M., and Belew, R. K., Second
Symposium on Document Analysis and Information
Retrieval, PP.: 179-190, 1993.
[10] ―Domain-specific keyphrase extraction‖ by Frank, E.,
Paynter, G.W., Witten, I.H., Gutwin, C., and Nevill-
Manning, C.G., Proceedings of the Sixteenth International
Joint Conference on Artificial Intelligence (IJCAI-99), PP.:
668-673. California: Morgan Kaufmann, 1999.
[11] ―A statistical learning approach to automatic indexing of
controlled index terms‖ by Leung, C.-H., and Kan, W.-K.,
Journal of the American Society for Information Science,
PP: 55-66, 1997.
[12] ―Extraction of index words from manuals‖ by Nakagawa, H.
RIAO 97 Conference Proceedings: Computer-Assisted
Information Searching on Internet, PP. 598-611. Montreal,
Canada, (1997).
[13] ―An algorithm for suffix stripping‖ by M.F.Porter 1980
[14] ―Searching in high-dimensional spaces index structures for
improving the performance of multimedia databases‖ by
Böhm, C., and Berchtold, S. 2001. ACM Computing
Surveys, PP: 322-373.
[15] ―The page rank citation ranking: Bringing order to the web‖
by L.; Brin, S.; Motwani, R.; and Winograd, T. Technical
report, 1998, Stanford Digital Libraries.
[16] TextRank: Bringing order into texts‖ by Mihalcea, R., and
Tarau, P., In Proceedings of EMNLP2004.