The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
The document discusses text classification and summarization techniques for complex domain-specific documents like research papers. It reviews various preprocessing approaches like stopword removal, lemmatizing, tokenization, and stemming. It also compares different machine learning algorithms for text classification, including Naive Bayes, decision trees, SVM, KNN, and neural networks. The document surveys works analyzing domain-specific documents using these techniques, such as biomedical document relation extraction and research paper topic classification.
Natural Language Toolkit (NLTK) is a generic platform to process the data of various natural (human)
languages and it provides various resources for Indian languages also like Hindi, Bangla, Marathi and so
on. In the proposed work, the repositories provided by NLTK are used to carry out the processing of Hindi
text and then further for analysis of Multi word Expressions (MWEs). MWEs are lexical items that can be
decomposed into multiple lexemes and display lexical, syntactic, semantic, pragmatic and statistical
idiomaticity. The main focus of this paper is on processing and analysis of MWEs for Hindi text. The
corpus used for Hindi text processing is taken from the famous Hindi novel “KaramaBhumi by Munshi
PremChand”. The result analysis is done using the Hindi corpus provided by Resource Centre for Indian
Language Technology Solutions (CFILT). Results are analysed to justify the accuracy of the proposed
work.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
This document summarizes an approach to automatic text summarization using the Simplified Lesk algorithm and WordNet. It analyzes sentences to determine relevance based on semantic information rather than surface features like position or format. Sentences are assigned weights based on the number of overlaps between words' dictionary definitions and the full text. Higher weighted sentences are selected for the summary based on a percentage of the original text length. The approach achieves 80% accuracy on 50% summarizations of diverse texts compared to human summaries.
A statistical model for gist generation a case study on hindi news articleIJDKP
Every day, huge number of news articles are reported and disseminated on the Internet. By generating gist
of an article, reader can go through the main topics instead of reading the whole article as it takes much
time for reader to read the entire content of the article. An ideal system would understand the document
and generate the appropriate theme(s) directly from the results of the understanding. In the absence of
natural language understanding system, it is required to design an appropriate system. Gist generation is a
difficult task because it requires both maximizing text content in short summary and maintains
grammaticality of the text. In this paper we present a statistical approach to generate a gist of a Hindi
news article. The experimental results are evaluated using the standard measures such as precision, recall
and F1 measure for different statistical models and their combination on the article before pre-processing
and after pre-processing.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
The document discusses text classification and summarization techniques for complex domain-specific documents like research papers. It reviews various preprocessing approaches like stopword removal, lemmatizing, tokenization, and stemming. It also compares different machine learning algorithms for text classification, including Naive Bayes, decision trees, SVM, KNN, and neural networks. The document surveys works analyzing domain-specific documents using these techniques, such as biomedical document relation extraction and research paper topic classification.
Natural Language Toolkit (NLTK) is a generic platform to process the data of various natural (human)
languages and it provides various resources for Indian languages also like Hindi, Bangla, Marathi and so
on. In the proposed work, the repositories provided by NLTK are used to carry out the processing of Hindi
text and then further for analysis of Multi word Expressions (MWEs). MWEs are lexical items that can be
decomposed into multiple lexemes and display lexical, syntactic, semantic, pragmatic and statistical
idiomaticity. The main focus of this paper is on processing and analysis of MWEs for Hindi text. The
corpus used for Hindi text processing is taken from the famous Hindi novel “KaramaBhumi by Munshi
PremChand”. The result analysis is done using the Hindi corpus provided by Resource Centre for Indian
Language Technology Solutions (CFILT). Results are analysed to justify the accuracy of the proposed
work.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
This document summarizes an approach to automatic text summarization using the Simplified Lesk algorithm and WordNet. It analyzes sentences to determine relevance based on semantic information rather than surface features like position or format. Sentences are assigned weights based on the number of overlaps between words' dictionary definitions and the full text. Higher weighted sentences are selected for the summary based on a percentage of the original text length. The approach achieves 80% accuracy on 50% summarizations of diverse texts compared to human summaries.
A statistical model for gist generation a case study on hindi news articleIJDKP
Every day, huge number of news articles are reported and disseminated on the Internet. By generating gist
of an article, reader can go through the main topics instead of reading the whole article as it takes much
time for reader to read the entire content of the article. An ideal system would understand the document
and generate the appropriate theme(s) directly from the results of the understanding. In the absence of
natural language understanding system, it is required to design an appropriate system. Gist generation is a
difficult task because it requires both maximizing text content in short summary and maintains
grammaticality of the text. In this paper we present a statistical approach to generate a gist of a Hindi
news article. The experimental results are evaluated using the standard measures such as precision, recall
and F1 measure for different statistical models and their combination on the article before pre-processing
and after pre-processing.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
An automatic text summarization using lexical cohesion and correlation of sen...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
This document discusses using Naive Bayesian methods for paragraph-level text classification in the Kannada language. It evaluates the performance of the Naive Bayesian and Naive Bayesian Multinomial models on a corpus of 1791 paragraphs from four categories (Commerce, Social Sciences, Natural Sciences, Aesthetics). Dimensionality reduction techniques like removing stop words and words with low term frequency are applied before classification. The results show that the Naive Bayesian Multinomial model outperforms the simple Naive Bayesian approach for paragraph classification in Kannada.
The enormous amount of information stored in unstructured texts cannot simply be used for further
processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific
(pre-) processing methods and algorithms are required in order to extract useful patterns. Text Mining is the
discovery of valuable, yet hidden, information from the text document. Text classification (Also called Text
Categorization) is one of the important research issues in the field of text mining. It is necessary to
classify/categorize large texts (documents) into specific classes. Text Classification assigns a text document to one of a
set of predefined classes. This paper covers different text classification techniques and also includes Classifier
Architecture and Text Classification Applications.
Improvement of Text Summarization using Fuzzy Logic Based MethodIOSR Journals
The document describes a method for improving text summarization using fuzzy logic. It proposes using fuzzy logic to determine the importance of sentences based on calculated feature scores. Eight features are used to score sentences, including title words, length, term frequency, position, and similarity. Sentences are then ranked based on their fuzzy logic-determined scores. The highest scoring sentences are extracted to create a summary. An evaluation of summaries generated using this fuzzy logic method found it performed better than other summarizers in accurately reflecting the content and order of human-generated reference summaries. The method could be expanded to multi-document summarization and automatic selection of fuzzy rules based on input type.
Design of optimal search engine using text summarization through artificial i...TELKOMNIKA JOURNAL
Natural language processing is the trending topic in the latest research areas, which allows the developers to create the human-computer interactions to come into existence. The natural language processing is an integration of artificial intelligence, computer science and computer linguistics. The research towards natural Language Processing is focused on creating innovations towards creating the devices or machines which operates basing on the single command of a human. It allows various Bot creations to innovate the instructions from the mobile devices to control the physical devices by allowing the speech-tagging. In our paper, we design a search engine which not only displays the data according to user query but also performs the detailed display of the content or topic user is interested for using the summarization concept. We find the designed search engine is having optimal response time for the user queries by analyzing with number of transactions as inputs. Also, the result findings in the performance analysis show that the text summarization method has been an efficient way for improving the response time in the search engine optimizations.
Sentiment analysis is an important current research area. The demand for sentiment analysis and classification is growing day by day; this paper presents a novel method to classify Urdu documents as previously no work recorded on sentiment classification for Urdu text. We consider the problem by determining whether the review or sentence is positive, negative or neutral. For the purpose we use two machine learning methods Naïve Bayes and Support Vector Machines (SVM) . Firstly the documents are preprocessed and the sentiments features are extracted, then the polarity has been calculated, judged and classify through Machine learning methods.
A domain specific automatic text summarization using fuzzy logicIAEME Publication
This paper proposes a domain-specific automatic text summarization technique using fuzzy logic. It extracts important sentences from documents using both statistical and linguistic features. The technique first preprocesses documents by segmenting sentences, removing stop words, and stemming words. It then extracts eight features from each sentence, including title words, sentence length, position, and similarity. Fuzzy logic is used to assign an importance score to each sentence based on these features. Sentences are ranked and the top sentences are selected for the summary based on the compression rate. The technique was evaluated on news articles and shown to generate good quality summaries compared to other summarizers based on precision, recall, and F-measure.
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...TELKOMNIKA JOURNAL
In this paper, we proposed a work on rhetorical corpus construction and sentence classification
model experiment that specifically could be incorporated in automatic paper title generation task for
scientific article. Rhetorical classification is treated as sequence labeling. Rhetorical sentence classification
model is useful in task which considers document’s discourse structure. We performed experiments using
two domains of datasets: computer science (CS dataset), and chemistry (GaN dataset). We evaluated the
models using 10-fold-cross validation (0.70-0.79 weighted average F-measure) as well as on-the-run
(0.30-0.36 error rate at best). We argued that our models performed best when handled using SMOTE
filter for imbalanced data.
Text can be analysed by splitting the text and extracting the keywords .These may be represented as summaries, tabular representation, graphical forms, and images. In order to provide a solution to large amount of information present in textual format led to a research of extracting the text and transforming the unstructured form to a structured format. The paper presents the importance of Natural Language Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural
language input which is done using regular expressions, artificial intelligence and database concepts.Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various index (points and percent) and time and then tokens are mapped to graph. This paper proposes a business solution for users for effective time management.
IRJET - Text Optimization/Summarizer using Natural Language Processing IRJET Journal
1. The document discusses the development of an intelligent system to optimize the English language using natural language processing techniques. The system will perform functions like summarization, spell check, grammar check, and sentence auto-completion.
2. It describes the various algorithms used for each function, including extracting important sentences for summarization, comparing words to dictionaries for spell check, analyzing syntax for grammar check, and completing sentences based on previous user data for auto-completion.
3. The system aims to build a smart tool that can correct errors and summarize text in English to improve communication through optimized language.
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Myanmar news summarization using different word representations IJECEIAES
There is enormous amount information available in different forms of sources and genres. In order to extract useful information from a massive amount of data, automatic mechanism is required. The text summarization systems assist with content reduction keeping the important information and filtering the non-important parts of the text. Good document representation is really important in text summarization to get relevant information. Bag-ofwords cannot give word similarity on syntactic and semantic relationship. Word embedding can give good document representation to capture and encode the semantic relation between words. Therefore, centroid based on word embedding representation is employed in this paper. Myanmar news summarization based on different word embedding is proposed. In this paper, Myanmar local and international news are summarized using centroid-based word embedding summarizer using the effectiveness of word representation approach, word embedding. Experiments were done on Myanmar local and international news dataset using different word embedding models and the results are compared with performance of bag-of-words summarization. Centroid summarization using word embedding performs comprehensively better than centroid summarization using bag-of-words.
A Survey on Sentiment Categorization of Movie ReviewsEditor IJMTER
Sentiment categorization is a process of mining user generated text content and determine
the sentiment of the users towards that particular thing. It is the approach of detecting the sentiment of
the author in regard to some topics. It also known as sentiment detection, sentiment analysis and opinion
mining. It is very useful for movie production companies that interested in knowing how users feel
about their movies. For example word “excellent” indicates that the review gives positive emotion about
particular movie. The same applies to movies, songs, cars, holiday destinations, Political parties, social
network sites, web blogs, discussion forum and so on. Sentiment categorization can be carried out by
using three approaches. First, Supervised machine learning based text classifier on Naïve Bayes,
Maximum Entropy, SVM, kNN classifier, hidden marcov model. Second, Unsupervised Semantic
Orientation scheme of extracting relevant N-grams of the text and then labelling. Third, SentiWordNet
based publicly available library.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMIRJET Journal
This document describes a proposed candidate set key document retrieval system. The system would process user queries in English and return relevant documents from a collection. It would use natural language processing techniques like tokenization, stop word removal, stemming, and lemmatization to index the documents and match them with user queries. The proposed system architecture includes components for indexing, processing user queries, and retrieving relevant documents from the collection. The indexing process involves organizing the documents, extracting tokens, removing stop words, and applying stemming/lemmatization to create an inverted index for efficient searching.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
An automatic text summarization using lexical cohesion and correlation of sen...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
This document discusses using Naive Bayesian methods for paragraph-level text classification in the Kannada language. It evaluates the performance of the Naive Bayesian and Naive Bayesian Multinomial models on a corpus of 1791 paragraphs from four categories (Commerce, Social Sciences, Natural Sciences, Aesthetics). Dimensionality reduction techniques like removing stop words and words with low term frequency are applied before classification. The results show that the Naive Bayesian Multinomial model outperforms the simple Naive Bayesian approach for paragraph classification in Kannada.
The enormous amount of information stored in unstructured texts cannot simply be used for further
processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific
(pre-) processing methods and algorithms are required in order to extract useful patterns. Text Mining is the
discovery of valuable, yet hidden, information from the text document. Text classification (Also called Text
Categorization) is one of the important research issues in the field of text mining. It is necessary to
classify/categorize large texts (documents) into specific classes. Text Classification assigns a text document to one of a
set of predefined classes. This paper covers different text classification techniques and also includes Classifier
Architecture and Text Classification Applications.
Improvement of Text Summarization using Fuzzy Logic Based MethodIOSR Journals
The document describes a method for improving text summarization using fuzzy logic. It proposes using fuzzy logic to determine the importance of sentences based on calculated feature scores. Eight features are used to score sentences, including title words, length, term frequency, position, and similarity. Sentences are then ranked based on their fuzzy logic-determined scores. The highest scoring sentences are extracted to create a summary. An evaluation of summaries generated using this fuzzy logic method found it performed better than other summarizers in accurately reflecting the content and order of human-generated reference summaries. The method could be expanded to multi-document summarization and automatic selection of fuzzy rules based on input type.
Design of optimal search engine using text summarization through artificial i...TELKOMNIKA JOURNAL
Natural language processing is the trending topic in the latest research areas, which allows the developers to create the human-computer interactions to come into existence. The natural language processing is an integration of artificial intelligence, computer science and computer linguistics. The research towards natural Language Processing is focused on creating innovations towards creating the devices or machines which operates basing on the single command of a human. It allows various Bot creations to innovate the instructions from the mobile devices to control the physical devices by allowing the speech-tagging. In our paper, we design a search engine which not only displays the data according to user query but also performs the detailed display of the content or topic user is interested for using the summarization concept. We find the designed search engine is having optimal response time for the user queries by analyzing with number of transactions as inputs. Also, the result findings in the performance analysis show that the text summarization method has been an efficient way for improving the response time in the search engine optimizations.
Sentiment analysis is an important current research area. The demand for sentiment analysis and classification is growing day by day; this paper presents a novel method to classify Urdu documents as previously no work recorded on sentiment classification for Urdu text. We consider the problem by determining whether the review or sentence is positive, negative or neutral. For the purpose we use two machine learning methods Naïve Bayes and Support Vector Machines (SVM) . Firstly the documents are preprocessed and the sentiments features are extracted, then the polarity has been calculated, judged and classify through Machine learning methods.
A domain specific automatic text summarization using fuzzy logicIAEME Publication
This paper proposes a domain-specific automatic text summarization technique using fuzzy logic. It extracts important sentences from documents using both statistical and linguistic features. The technique first preprocesses documents by segmenting sentences, removing stop words, and stemming words. It then extracts eight features from each sentence, including title words, sentence length, position, and similarity. Fuzzy logic is used to assign an importance score to each sentence based on these features. Sentences are ranked and the top sentences are selected for the summary based on the compression rate. The technique was evaluated on news articles and shown to generate good quality summaries compared to other summarizers based on precision, recall, and F-measure.
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...TELKOMNIKA JOURNAL
In this paper, we proposed a work on rhetorical corpus construction and sentence classification
model experiment that specifically could be incorporated in automatic paper title generation task for
scientific article. Rhetorical classification is treated as sequence labeling. Rhetorical sentence classification
model is useful in task which considers document’s discourse structure. We performed experiments using
two domains of datasets: computer science (CS dataset), and chemistry (GaN dataset). We evaluated the
models using 10-fold-cross validation (0.70-0.79 weighted average F-measure) as well as on-the-run
(0.30-0.36 error rate at best). We argued that our models performed best when handled using SMOTE
filter for imbalanced data.
Text can be analysed by splitting the text and extracting the keywords .These may be represented as summaries, tabular representation, graphical forms, and images. In order to provide a solution to large amount of information present in textual format led to a research of extracting the text and transforming the unstructured form to a structured format. The paper presents the importance of Natural Language Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural
language input which is done using regular expressions, artificial intelligence and database concepts.Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various index (points and percent) and time and then tokens are mapped to graph. This paper proposes a business solution for users for effective time management.
IRJET - Text Optimization/Summarizer using Natural Language Processing IRJET Journal
1. The document discusses the development of an intelligent system to optimize the English language using natural language processing techniques. The system will perform functions like summarization, spell check, grammar check, and sentence auto-completion.
2. It describes the various algorithms used for each function, including extracting important sentences for summarization, comparing words to dictionaries for spell check, analyzing syntax for grammar check, and completing sentences based on previous user data for auto-completion.
3. The system aims to build a smart tool that can correct errors and summarize text in English to improve communication through optimized language.
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Myanmar news summarization using different word representations IJECEIAES
There is enormous amount information available in different forms of sources and genres. In order to extract useful information from a massive amount of data, automatic mechanism is required. The text summarization systems assist with content reduction keeping the important information and filtering the non-important parts of the text. Good document representation is really important in text summarization to get relevant information. Bag-ofwords cannot give word similarity on syntactic and semantic relationship. Word embedding can give good document representation to capture and encode the semantic relation between words. Therefore, centroid based on word embedding representation is employed in this paper. Myanmar news summarization based on different word embedding is proposed. In this paper, Myanmar local and international news are summarized using centroid-based word embedding summarizer using the effectiveness of word representation approach, word embedding. Experiments were done on Myanmar local and international news dataset using different word embedding models and the results are compared with performance of bag-of-words summarization. Centroid summarization using word embedding performs comprehensively better than centroid summarization using bag-of-words.
A Survey on Sentiment Categorization of Movie ReviewsEditor IJMTER
Sentiment categorization is a process of mining user generated text content and determine
the sentiment of the users towards that particular thing. It is the approach of detecting the sentiment of
the author in regard to some topics. It also known as sentiment detection, sentiment analysis and opinion
mining. It is very useful for movie production companies that interested in knowing how users feel
about their movies. For example word “excellent” indicates that the review gives positive emotion about
particular movie. The same applies to movies, songs, cars, holiday destinations, Political parties, social
network sites, web blogs, discussion forum and so on. Sentiment categorization can be carried out by
using three approaches. First, Supervised machine learning based text classifier on Naïve Bayes,
Maximum Entropy, SVM, kNN classifier, hidden marcov model. Second, Unsupervised Semantic
Orientation scheme of extracting relevant N-grams of the text and then labelling. Third, SentiWordNet
based publicly available library.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMIRJET Journal
This document describes a proposed candidate set key document retrieval system. The system would process user queries in English and return relevant documents from a collection. It would use natural language processing techniques like tokenization, stop word removal, stemming, and lemmatization to index the documents and match them with user queries. The proposed system architecture includes components for indexing, processing user queries, and retrieving relevant documents from the collection. The indexing process involves organizing the documents, extracting tokens, removing stop words, and applying stemming/lemmatization to create an inverted index for efficient searching.
The document describes a proposed system for automatic text summarization using latent semantic analysis and natural language processing. The system would take in a user-submitted document, preprocess it by removing stop words, generate a singular value decomposition matrix to derive the latent semantic structure, and output a summary by semantically arranging sentences to encompass the original text's concepts. The document provides an implementation example in Python using the LSA method from the scikit-learn machine learning library for natural language processing and matrix decomposition.
A Review Of Text Mining Techniques And ApplicationsLisa Graves
This document provides a review of various text mining techniques and applications. It discusses techniques used for text classification and summarization, including Naive Bayes classification, backpropagation neural networks, keyword matching, and information extraction. It also covers applications of text mining in areas like sentiment analysis of social media posts and hotel reviews. Finally, it discusses the need for organizational text mining to extract useful information and insights from large amounts of unstructured text data.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
A Novel approach for Document Clustering using Concept ExtractionAM Publications
In this paper we present a novel approach to extract the concept from a document and cluster such set of documents depending on the concept extracted from each of them. We transform the corpus into vector space by using term frequency–inverse document frequency then calculate the cosine distance between each document, followed by clustering them using K means algorithm. We also use multidimensional scaling to reduce the dimensionality within the corpus. It results in the grouping of documents which are most similar to each other with respect to their content and the genre.
This document describes a method for enriching search results using ontology. It begins with an abstract discussing how keyword searches often return irrelevant documents due to the large amount of information available online. It then introduces the concept of using ontology to allow for more sophisticated semantic searches. The paper presents an architecture that augments keyword search results with additional documents that are semantically relevant based on ontology mappings. Documents in the search results are then ranked based on both keyword frequency and semantic relevance to improve search accuracy.
This document proposes using Word2Vec and decision trees to extract keywords from textual documents and classify the documents. It reviews related work on keyword extraction and text classification techniques. The proposed approach involves preprocessing text, representing words as vectors with Word2Vec, calculating frequently occurring keywords for each category, and using decision trees to classify documents based on keyword similarity. Experiments using different preprocessing and Word2Vec settings achieved an F-score of up to 82% for document classification.
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...idescitation
Genetic Algorithm (GA) has been a successful method
that is been used for extracting keywords. This paper presents
a full method by which keywords can be derived from the
various corpuses. We have built equations that exploit the
structure of the documents from which the keywords need to
be extracted. The procedures are been broken into two
distinguished profiles: one is to weigh the words in the whole
document content and the other is to explore the possibilities
of the occurrence of key terms by using genetic algorithm.
The basic equations of the heuristic mechanism is been varied
to allow the complete exploitation of document. The Genetic
Algorithm and the enhanced standard deviation method is
used in full potential to enable the generation of the key
terms that describe the given text document. The new
technique has an enhanced performance and better time
complexities.
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered 672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
Information Retrieval System is an effective process that helps a user to trace relevant information by
Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic
Information Retrieval System(BIRS) based on information and the system is significant mathematically
and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of
Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as
compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora
resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions
of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the
accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit
(BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have
also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered
672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted
information. For testing this system, we have created 19335 questions from the introduced information
and got 97.22% accurate answer.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
The document describes a study that uses GloVe word embeddings to measure semantic similarity between short texts. GloVe is an unsupervised learning algorithm for obtaining vector representations of words. The study trains GloVe word embeddings on a large corpus, then uses the embeddings to encode short texts and calculate their semantic similarity, comparing the accuracy to methods that use Word2Vec embeddings. It aims to show that GloVe embeddings may provide better performance for short text semantic similarity tasks.
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
IRJET- Automatic Recapitulation of Text DocumentIRJET Journal
The document describes an approach for automatically generating summaries of text documents. It involves preprocessing the input text by tokenizing, removing stop words, and stemming words. It then extracts features from the preprocessed text, such as term frequency-inverse document frequency (tf-idf) values. Sentences are scored based on these features and sentences with higher scores are selected to form the summary. Keywords from the text are also identified using WordNet to help select the most relevant sentences for the summary. The proposed approach aims to generate concise yet meaningful summaries using natural language processing techniques.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSkevig
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSijnlc
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
1) The document discusses different clustering algorithms for text summarization including hierarchical clustering, query-based summarization, graph theoretic clustering, fuzzy c-means clustering, and DBSCAN clustering.
2) These algorithms are evaluated based on performance parameters like precision, recall, time complexity, space complexity, and summary quality.
3) The algorithm found to perform best based on these evaluations will be suggested as the better algorithm for query-dependent text document summarization.
Similar to Keyword Extraction Based Summarization of Categorized Kannada Text Documents (20)
Blood finder application project report (1).pdfKamal Acharya
Blood Finder is an emergency time app where a user can search for the blood banks as
well as the registered blood donors around Mumbai. This application also provide an
opportunity for the user of this application to become a registered donor for this user have
to enroll for the donor request from the application itself. If the admin wish to make user
a registered donor, with some of the formalities with the organization it can be done.
Specialization of this application is that the user will not have to register on sign-in for
searching the blood banks and blood donors it can be just done by installing the
application to the mobile.
The purpose of making this application is to save the user’s time for searching blood of
needed blood group during the time of the emergency.
This is an android application developed in Java and XML with the connectivity of
SQLite database. This application will provide most of basic functionality required for an
emergency time application. All the details of Blood banks and Blood donors are stored
in the database i.e. SQLite.
This application allowed the user to get all the information regarding blood banks and
blood donors such as Name, Number, Address, Blood Group, rather than searching it on
the different websites and wasting the precious time. This application is effective and
user friendly.
Home security is of paramount importance in today's world, where we rely more on technology, home
security is crucial. Using technology to make homes safer and easier to control from anywhere is
important. Home security is important for the occupant’s safety. In this paper, we came up with a low cost,
AI based model home security system. The system has a user-friendly interface, allowing users to start
model training and face detection with simple keyboard commands. Our goal is to introduce an innovative
home security system using facial recognition technology. Unlike traditional systems, this system trains
and saves images of friends and family members. The system scans this folder to recognize familiar faces
and provides real-time monitoring. If an unfamiliar face is detected, it promptly sends an email alert,
ensuring a proactive response to potential security threats.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELijaia
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Height and depth gauge linear metrology.pdfq30122000
Height gauges may also be used to measure the height of an object by using the underside of the scriber as the datum. The datum may be permanently fixed or the height gauge may have provision to adjust the scale, this is done by sliding the scale vertically along the body of the height gauge by turning a fine feed screw at the top of the gauge; then with the scriber set to the same level as the base, the scale can be matched to it. This adjustment allows different scribers or probes to be used, as well as adjusting for any errors in a damaged or resharpened probe.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
Open Channel Flow: fluid flow with a free surfaceIndrajeet sahu
Open Channel Flow: This topic focuses on fluid flow with a free surface, such as in rivers, canals, and drainage ditches. Key concepts include the classification of flow types (steady vs. unsteady, uniform vs. non-uniform), hydraulic radius, flow resistance, Manning's equation, critical flow conditions, and energy and momentum principles. It also covers flow measurement techniques, gradually varied flow analysis, and the design of open channels. Understanding these principles is vital for effective water resource management and engineering applications.
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents
1. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
DOI : 10.5121/ijsc.2011.2408 81
KEYWORD EXTRACTION BASED
SUMMARIZATION OF CATEGORIZED
KANNADA TEXT DOCUMENTS
Jayashree.R1
, Srikanta Murthy.K2
and Sunny.K1
1
Department of Computer Science, PES Institute Of Technology, Bangalore, India
jayashree@pes.edu, sunfinite@gmail.com
2
Department of Computer Science, PES School Of Engineering, Bangalore, India
srikantamurthy@pes.edu
ABSTRACT
The internet has caused a humongous growth in the number of documents available online. Summaries of
documents can help find the right information and are particularly effective when the document base is
very large. Keywords are closely associated to a document as they reflect the document's content and act
as indices for a given document. In this work, we present a method to produce extractive summaries of
documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key
words from pre-categorized Kannada documents collected from online resources. We use two feature
selection techniques for obtaining features from documents, then we combine scores obtained by GSS
(Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF
(Term Frequency) for extracting key words and later use these for summarization based on rank of the
sentence. In the current implementation, a document from a given category is selected from our database
and depending on the number of sentences given by the user, a summary is generated.
KEYWORDS
Summary, Keywords, GSS coefficient, Term Frequency (TF), IDF (Inverse Document Frequency) and Rank
of sentence
1. Introduction
With the growth of the internet, a large amount of data is available online. There is a demanding
need to make effective use of data available in native languages. Information Retrieval [IR] is
therefore becoming an important need in the Indian context. India is a multilingual country; any
new method developed in IR in this context needs to address multilingual documents. There are
around 50 million Kannada speakers and more than 10000 articles in Kannada Wikipedia. This
warrants us to develop tools that can be used to explore digital information presented in Kannada
and other native languages. A very important task in Natural Language Processing is Text
Summarization. Inderjeet Mani provides the following succinct definition for summarization is:
take an information source, extract content from it, and present the most important content to the
user in a condensed form and in a manner sensitive to the user’s application needs[14].There are
two main techniques for Text Document Summarization: extractive summary and abstractive
2. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
82
summary. While extractive summary copies information that is very important to the summary,
abstractive summary condenses the document more strongly than extractive summarization and
require natural language generation techniques. Summarization is not a deterministic problem,
different people would chose different sentences and even the same person may chose different
sentences at different times, showing differences between summaries created by humans. Also,
semantic equivalence is another problem, because two sentences can give the same meaning with
different wordings. In this paper, we present an extractive summarization algorithm which
provides generic summaries. The algorithm uses sentences as the compression basis.
Keywords/phrases, which are a very important component of this work, are nothing but
expressions; single words or phrases describing the most important aspects of a given document.
The list of keywords/phrases aims to reflect the meaning of the document. Guided by the given
keywords/phrases, we can provide a quick summary, which can help people easily understand
what a document describes, saving a great amount of time and thus money. Consequently,
automatic text document summarization is in high demand. Meanwhile, summarization is also
fundamental to many other natural language processing and data mining applications such as
information retrieval, text clustering and so on [11][2].
2. Literature Survey
Previous work on key phrase extraction by Letian Wang and Fang Li [3] has shown that key
phrase extraction can be achieved using chunk based method. Keywords of document are used to
select key phrases from candidates. Similarly, another approach by Mari-SannaPaukkeri et al[2]
selects words and phrases that best describe the meaning of the documents by comparing ranks of
frequencies in the documents to the corpus considered as reference corpus. The SZETERGAK
system by Gabor Berend[1] is a frame work that treats the reproduction of reader assigned
keywords as a supervised learning task. In this work, a restricted set of token sequences was used
as classification instances. One more method of You Ouyang[4] extracted the most essential
words and then expanded the identified core words as the target key phrases by word expansion
approach. A novel approach to key phrase extraction proposed by them consists of two stages:
identifying core words and expanding core words to key phrases. The work of automatically
producing key phrases for each scientific paper by Su Nam Kim et al[5] has compiled a set of 284
scientific articles with key phrases carefully chosen by both their authors and readers, the task
was to automatically produce key phrases for each paper. FumiyoFukumoto[6] present a method
for detecting key sentences from the documents that discuss the same event. To eliminate
redundancy they use spectral clustering and classified each sentence into groups each of which
consists of semantically related sentences. The work of Michael .J . Paul et al[7] use an
unsupervised probabilistic approach to model and extract multiple viewpoints in text. The authors
also use Lex rank, a novel random walk formulating to score sentences and pairs of sentences
from opposite view points based on both representativeness of the collections as well as their
contrast with each other. The word position information proves to play a significant role in
document summarization. The work of You Ouyang [8] et al illustrates the use of word position
information, the idea comes from assigning different importance to multiple words in a single
document .Cross Language document summary is another upcoming trend that is growing in
Natural Language Processing area, wherein the input document is in one language , the
summarizer produces summary in another language. There was a proposal by Xiaojun Wan et al
[9] to consider the translation from English to Chinese. First the translation quality of each
English sentence in the document set is predicted with the SVM regression method and then the
quality score of each sentence is incorporated into the summarization process; finally English
sentences with high translation scores are translated to form the Chinese summary. There have
3. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
83
been techniques which use A* algorithm to find the best extractive summary up to given length,
which is both optimal and efficient to run. Search is typically performed using greedy technique
which selects each sentence in the decreasing order of model score until the desired length
summary is reached [10]. There are two approaches to document summarization, supervised and
unsupervised methods. In supervised approach, a model is trained to determine if a candidate
phrase is a key phrase. In unsupervised method graph based methods are state-of-the art. These
methods first build a word graph according to word co occurrences within the document and then
use random walk techniques to measure the importance of a word [12].
3. Methodology:The methodology adopted by us can be described in four major steps:
3.1. Crawling
The first step is creating the Kannada dataset. Wget , a Unix utility tool was used to crawl the
data available on http://kannada.webdunia.com. Data was pre-categorized on this web site.
3.2. Indexing
Python was the language of choice. The indexing part consisted of removing HTML mark up;
English words need not be indexed for our work. Beautiful Soup is a python HTML/XML parser
which makes it very easy to scrape a screen. It is very tolerant with bad markup. We use Beautiful
Soup to build a string out of the text on the page by recursively traversing the parse tree returned
by Beautiful Soup. All HTML and XML entities (ಅ :ಅ , < : <) are then converted to
their character equivalents. Normal indexing operations involve extracting words by splitting the
document at non-alphanumeric characters, however this would not serve our purpose because
dependent vowels ( ◌ಾಹ, ◌ಾ etc.) are treated as non-alphanumeric, so splitting at non-
alphanumeric characters would not have worked for tokenization. Hence a separate module was
written for removing punctuations.
3.3. Keyword Extraction
Documents in five categories were fetched: sports, religion, entertainment, literature, astrology.
The next step is to calculate GSS coefficients and the Inverse Document Frequency (IDF) scores
for every word (in a given category in the former case) in a given document belonging to these
categories. Every word in a given document has a Term Frequency (TF), which gives the number
of occurrence of a term in a given document, defined by:
TF= frequency of a term in a document / number of terms in a given document.
Similarly, the IDF of a document is given by the formula,
IDF=Log10(N/n), where ‘N’ is the total number of documents indexed across all categories. and
‘n’ is the number of documents containing a particular term.
Hence TF and IDF are category independent. The GSS coefficients which evaluate the
importance of a particular term to a particular category are calculated. GSS(Galavotti-Sebastiani-
4. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
84
Simi) co-efficient [13] is a feature selection technique used as the relevance measure in our case.
Given a word w and category c it
is defined as:
f(w , c)= p(w , c)* p( w' , c') -p( w ', c) * p( w , c')
where, p( w, c ) is the probability that a document contains word ‘w’ and belongs to category ‘c’
p( w' , c' ) is the probability that a document does not contain ‘w’ and does not belong to ‘c’
p( w' , c ) is the probability that a document does not contain ‘w’ and belongs to ‘c’
p( w , c' ) is the probability that a document contains ‘w’ and does not belong to ‘c’.
GSS coefficients give us words which are most relevant to the category to which the documents
belong. IDF gives us words which are of importance to the given documents independently. Thus
using these two parameters to determine relevant parts of the document helps in providing a
wholesome summary.
3.4. Summarization
Given a document and a limit on the number of sentences, we have to provide a meaningful
summary. We calculate the GSS coefficients and IDF of all the words in the given document. If
the document is already present in our database, GSS coefficients and IDF values are already
calculated offline. These values are then multiplied by the TF of the individual words to
determine their overall importance in the document. We then extract top n keywords from each of
the lists (GSS coefficients and IDF). Then sentences are extracted from the given document by
retrieving Kannada sentences ending with full stops. Due care is taken to see that full stops which
do not mark the end of a sentence (ಡಹ. etc.) are not considered as split points. Each of these
sentences is then evaluated for the number of keywords it contains from the GSS and IDF lists as
follows:
Rank of sentence = Number of keywords contained by the sentence from both the lists
____________________________
Total number of sentences in the document
The top ‘m’ sentences are then returned as document summary, where ‘m’ is the user specified
limit.
The same concept can be extended to paragraphs to provide a possibly more meaningful context
than individual sentences particularly in the case of large documents.
3.5. Tests
The following lists were obtained by running our algorithm with a sports article on cricket as
input with n= 20:
5. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
85
GSS co-efficient list:
IDF list:
The following lists were obtained with a blog entry about another blog on films (category:
literature) as input with n = 20:
GSS co-efficient list:
IDF list:
As evident , these lists contain stop words and hence, stop word (noise) removal is essential.
3.6. Noise Removal
Stop words which do not give meaning to the document considered noise and hence should not be
evaluated as keywords. To remove stop words we have implemented an algorithm which takes a
list of stop words prepared manually as input and finds structurally similar words and adds them
to the stop word list.
Some of the words in our primary list of stop word which is created and maintained manually are:
3.6.1 Finding Structurally similar words for Primary list words
Example: Consider the word ' ಯಹ ◌ೆ '
When split into individual Unicode characters,
6. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
86
it becomes : ಯ (U+0CAF) + ◌ಾಹ (U+0CBE) + ಕ (U+0C95) +◌ಾ◌ೆ(U+0CC6). The vowel sound
at the end is not considered as an alphanumeric character. So our similarity module does the
following in order :
1. Fuzzy search for words which contain the unmodified word at the beginning.
2. Strip all the non-alphanumeric characters at the end of the word and then fuzzy search for
words which contain the modified word at the beginning.
A sample of stop words that were obtained by our algorithm :
For the personal pronoun 'ನನನ' , some of the words obtained are:
For the verb ‘ಬ ’, the similar words were:
As evident, though some words have semantic relationship to the primary stop word, a lot of
words have no such relationship and further work needs to be done to find methods which will
prevent such words from being penalized as stop words. Starting with a basic list of stop words,
this program can be used to find structurally similar words and semantically unrelated words can
be manually removed from the stop words list.
3.7.Results
Evaluation-I gives the result of a manual evaluation of the summarizer with three different human
summaries across various categories. Three different people were asked to create reference
summaries for random documents in each category. The same documents were then fed to the
program and the limit was kept at m=10.The number of sentences common between the two
summaries gives the relevance score; the average of three scores is shown for each document.
Evaluation-II consisted of a single human reference summary for a data set consisting of 20
7. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
87
documents across different categories. This human reference summary was compared to the
summary generated by the machine.For Evaluation III and Evaluation IV, we considered 20
documents, across categories: Literature, Entertainment, Sports, and Religion. Two independent
reviewers were asked to summarize the documents (not Abstractive), the results are shown in
Table 3 and Table 4.
Evaluation V and Evaluation VI given in Table 5 and Table 6 shows the number of common
sentences between human summary and Machine summary. Wherein machine summary is
generated for the same set of documents across same categories considered for Evaluation III and
Evaluation IV. The maximum limit on the number of sentences to be given by the machine is the
number of sentences obtained by human summarizers for a given document.
Evaluation VII and Evaluation VIII consists of the average of the human summary and machine
summary, means the number of sentences common between the human summary and machine
summary is taken into consideration and average is taken.
Table 1. Evaluation-I Results.
Table 2.Evaluation-II Results.
8. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
88
Table 3.Evaluation-III
Table 4.Evaluation- IV
Table 5.Evaluation –V
9. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
89
Table 6.Evaluation VI
Table 7. Evaluation VII
Table 8. Evaluation VIII
10. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
90
Below is a snap shot of the summarizer output and the actual document in that order.
11. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
91
3.8. Summarization of unclassified documents
The main purpose of classification in our work is to make efficient use of feature selection
techniques such as GSS co-efficient for summarizing documents. The same feature selection
techniques can be used to train classifiers thereby improving the efficiency of classifiers.
When an unclassified document is given as input, the stop word removal method described above
is applied. It is then submitted to the classifier which generates relevance scores depending on the
number of GSS co-efficient keywords the particular document contains. Such a method will
ensure that the classifier does not train itself on and is not misguided by frequently occurring but
non-relevant terms to a category. Once a document is classified, the procedure outlined above can
be followed.
If all the values returned by the classifier across all categories fall below the threshold, TF-IDF
can be used as the fallback for summarization.
4.Conclusion
It is evident that evaluating summaries is a difficult task, as it is not deterministic. Different
people may choose different sentences and also same people may choose different sentences at
different times. Sentence recall measure is used as the evaluation factor. Although humans can
also cut and paste relevant information of a text, most of the times they rephrase sentences when
necessary, or they join different related information into one sentence. Hence paraphrasing and
rephrasing are the issues in summarization. However in our test, sentences were selected as a
whole both by the machine and human summarizers
Most Evaluation systems in Summarization require at least one reference summary, which is the
most difficult and expensive task. Much effort is needed in order to have corpus of texts and their
corresponding summaries.
The Classifier helps determine the relevant category. Hence any document given as input is able
to produce the summary. There is no standard stop word list for Kannada, or methods to do that.
Hence a given procedure in this work can be used as a stop word removal tool using structurally
similar word removal technique. The summarizer can be used as a tool in various organizations
such as Kannada Development Authority, Kannada SahityaParishath etc.
References
[1] Gabor Berend,Rich´ardFarkasSZTERGAK : Feature Engineering for Keyphrasextraction,Proceedings
of the 5th International Workshop on Semantic Evaluation, ACL 2010, pp 186–189,Uppsala,Sweden,
15-16 July 2010.
[2] Mari-SannaPaukkeri and TimoHonkela,'Likey: Unsupervised Language-independent
KeyphraseExtraction,Proceedings of the 5th International Workshop on Semantic Evaluation, ACL
2010, pp 162–165,Uppsala, Sweden, 15-16 July 2010.
[3] Letian Wang, Fang Li, SJTULTLAB: Chunk Based Method for Keyphrase Extraction, Proceedings of
the 5th International Workshop on Semantic Evaluation, ACL 2010,pp 158– 161,Uppsala, Sweden,
15-16 July 2010.
12. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
92
[4] You OuyangWenjie Li Renxian Zhang,'273. Task 5.Keyphrase Extraction Based on Core Word
Identification and Word Expansion',Proceedings of the 5th International Workshop on Semantic
Evaluation, ACL 2010, pp 142–145, Uppsala, Sweden, 15-16 July 2010.
[5] Su Nam Kim,ÄOlenaMedelyan,~ Min-Yen Kan} and Timothy BaldwinÄ,'SemEval-2010 Task 5:
Automatic Keyphrase Extraction from Scientific Articles',Proceedings of the 5th International
Workshop on Semantic Evaluation, ACL 2010, pp 21–26,Uppsala, Sweden, 15-16 July 2010.
[6] FumiyoFukumotoAkina Sakai Yoshimi Suzuki,'Eliminating Redundancy by Spectral Relaxation for
Multi-Document Summarization',Proceedings of the 2010 Workshop on Graph-based Methods for
Natural Language Processing, ACL 2010, pp 98–102,Uppsala, Sweden, 16 July 2010.
[7] Michael J. Paul,ChengXiangZhai,RoxanaGirju,'Summarizing Contrastive Viewpoints in Opinionated
Text',Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp
66–76,MIT, Massachusetts, USA, 9-11 October 2010.
[8] You Ouyang ,Wenjie Li Qin Lu Renxian Zhang, ‘A Study on Position Information in Document
Summarization’, Coling 2010: Poster Volume, pp 919–927, Beijing, August 2010.
[9] Xiaojun Wan, Huiying Li and JianguoXiao,'Cross-Language Document Summarization Based on
Machine Quality Prediction',Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, pp 917–926,Uppsala, Sweden, 11-16 July 2010.
[10] Ahmet Aker Trevor Cohn,'Multi-document summarization using A* search and discriminative
training',Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing,
pp 482–491,MIT, Massachusetts, USA, 9-11 October 2010.
[11] Hal Daum´e III*,DanielMarcu*,'Induction ofWord and Phrase Alignments for Automatic Document
Summarization', Computational Linguistics, 31 (4), pp. 505-530, December, 2006 .
[12] ZhiyuanLiu,Wenyi Huang, YabinZheng and MaosongSun,'AutomaticKeyphrase Extraction via Topic
Decomposition',Proceedings of the 2010 Conference on Empirical Methods in Natural Language
Processing, pp 366–376,MIT, Massachusetts, USA, 9-11 October 2010.
[13] L. Galavotti, F. Sebastiani, and M. Simi,’Experiments on the use of feature selection and negative
evidence in automated text categorization’ Proc. 4th European Conf. Research and
AdvancedTechnology for Digital Libraries,SpringerVerlag,pp.59-68,2000.
[14] Automatic Summarization by Inderjeet Mani, John Benjamins Publishing Co.
13. International Journal on Soft Computing ( IJSC ) Vol.2, No.4, November 2011
93
Authors
Jayashree.Ris an Asst.Professor in the Department of Computer Science and
Engineering, at PES Institute of Technology, Bangalore. She has over 15 Years
of teaching experience. She has published several papers in international
conferences. Her research area of interest is Natural Language Processing and
Pattern Recognition.
Dr.K.Srikanta Murthy is a Professor and Head, Department of Computer
Science and Engineering, at PES School of Engineering, Bangalore. He has put
in 24 years of service in teaching and 5 years in research. He has published 8
papers in reputed International Journals; 41 papers at various International
Conferences. He is currently guiding 5 PhD students .He is also a member of
various Board of Studies andBoard of Examiners for different universities. His
research areas include Image Processing ,Document Image analysis, Pattern
Recognition, Character Recognition ,Data Mining and Artificial Intelligence.
Sunny.Kis a final year Computer Science undergraduate student.