As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Quinsulon Israel
This document outlines Quinsulon Israel's Ph.D. dissertation defense on using semantic analysis to improve multi-document summarization. The dissertation examines using semantic triples clustering and semantic class scoring of sentences to generate summaries. It reviews prior work on statistical, features combination, graph-based, multi-level text relationship, and semantic analysis approaches. The dissertation aims to improve the baseline method and evaluate the effects of semantic analysis on focused multi-document summarization performance.
Improvement of Text Summarization using Fuzzy Logic Based MethodIOSR Journals
The document describes a method for improving text summarization using fuzzy logic. It proposes using fuzzy logic to determine the importance of sentences based on calculated feature scores. Eight features are used to score sentences, including title words, length, term frequency, position, and similarity. Sentences are then ranked based on their fuzzy logic-determined scores. The highest scoring sentences are extracted to create a summary. An evaluation of summaries generated using this fuzzy logic method found it performed better than other summarizers in accurately reflecting the content and order of human-generated reference summaries. The method could be expanded to multi-document summarization and automatic selection of fuzzy rules based on input type.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
This document describes a method for sentence similarity based text summarization using clusters. It involves preprocessing text, extracting primitives from sentences, linking primitives, computing sentence similarity, merging similarity values, clustering similar sentences, and extracting a representative sentence from each cluster to generate a summary. Key steps include identifying common elements (primitives) between sentences, representing sentences as vectors of primitives, computing similarity based on shared primitives, clustering similar sentences, pruning clusters to remove dissimilar sentences, ranking clusters by importance, and selecting a representative sentence from each cluster for the summary. The goal is to automatically generate a short summary that captures the essential information from a collection of documents or text on the same topic.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
This document discusses text summarization using machine learning. It begins by defining text summarization as reducing a text to create a summary that retains the most important points. There are two main types: single document summarization and multiple document summarization. Extractive summarization creates summaries by extracting phrases or sentences from the source text, while abstractive summarization expresses ideas using different words. Supervised machine learning approaches use labeled training data to train classifiers to select content, while unsupervised approaches select content based on metrics like term frequency-inverse document frequency. ROUGE is commonly used to automatically evaluate summaries by comparing them to human references. Query-focused multi-document summarization aims to answer a user's information need by summarizing relevant documents
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
This document summarizes an approach to automatic text summarization using the Simplified Lesk algorithm and WordNet. It analyzes sentences to determine relevance based on semantic information rather than surface features like position or format. Sentences are assigned weights based on the number of overlaps between words' dictionary definitions and the full text. Higher weighted sentences are selected for the summary based on a percentage of the original text length. The approach achieves 80% accuracy on 50% summarizations of diverse texts compared to human summaries.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Quinsulon Israel
This document outlines Quinsulon Israel's Ph.D. dissertation defense on using semantic analysis to improve multi-document summarization. The dissertation examines using semantic triples clustering and semantic class scoring of sentences to generate summaries. It reviews prior work on statistical, features combination, graph-based, multi-level text relationship, and semantic analysis approaches. The dissertation aims to improve the baseline method and evaluate the effects of semantic analysis on focused multi-document summarization performance.
Improvement of Text Summarization using Fuzzy Logic Based MethodIOSR Journals
The document describes a method for improving text summarization using fuzzy logic. It proposes using fuzzy logic to determine the importance of sentences based on calculated feature scores. Eight features are used to score sentences, including title words, length, term frequency, position, and similarity. Sentences are then ranked based on their fuzzy logic-determined scores. The highest scoring sentences are extracted to create a summary. An evaluation of summaries generated using this fuzzy logic method found it performed better than other summarizers in accurately reflecting the content and order of human-generated reference summaries. The method could be expanded to multi-document summarization and automatic selection of fuzzy rules based on input type.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
This document describes a method for sentence similarity based text summarization using clusters. It involves preprocessing text, extracting primitives from sentences, linking primitives, computing sentence similarity, merging similarity values, clustering similar sentences, and extracting a representative sentence from each cluster to generate a summary. Key steps include identifying common elements (primitives) between sentences, representing sentences as vectors of primitives, computing similarity based on shared primitives, clustering similar sentences, pruning clusters to remove dissimilar sentences, ranking clusters by importance, and selecting a representative sentence from each cluster for the summary. The goal is to automatically generate a short summary that captures the essential information from a collection of documents or text on the same topic.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
This document discusses text summarization using machine learning. It begins by defining text summarization as reducing a text to create a summary that retains the most important points. There are two main types: single document summarization and multiple document summarization. Extractive summarization creates summaries by extracting phrases or sentences from the source text, while abstractive summarization expresses ideas using different words. Supervised machine learning approaches use labeled training data to train classifiers to select content, while unsupervised approaches select content based on metrics like term frequency-inverse document frequency. ROUGE is commonly used to automatically evaluate summaries by comparing them to human references. Query-focused multi-document summarization aims to answer a user's information need by summarizing relevant documents
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
This document summarizes an approach to automatic text summarization using the Simplified Lesk algorithm and WordNet. It analyzes sentences to determine relevance based on semantic information rather than surface features like position or format. Sentences are assigned weights based on the number of overlaps between words' dictionary definitions and the full text. Higher weighted sentences are selected for the summary based on a percentage of the original text length. The approach achieves 80% accuracy on 50% summarizations of diverse texts compared to human summaries.
The document summarizes different techniques for automatic document summarization including extractive and abstractive approaches. It discusses simple techniques like frequency-based methods and cue phrases. Graph-based approaches like TextRank and LexRank that model text as a graph are explained. Linguistic methods involving lexical chains and rhetorical structure are covered. Finally, it summarizes WordNet-based semantic approaches and techniques for evaluating summaries.
An automatic text summarization using lexical cohesion and correlation of sen...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
presentation from my thesis defense on text summarization, discusses already existing state of art models along with efficiency of AMR or Abstract Meaning Representation for text summarization, we see how we can use AMRs with seq2seq models. We also discuss other techniques such as BPE or Byte Pair Encoding and its effectiveness for the task. Also we see how data augmentation with POS tags and AMRs effect the summarization with s2s learning.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
IRJET- Automatic Recapitulation of Text DocumentIRJET Journal
The document describes an approach for automatically generating summaries of text documents. It involves preprocessing the input text by tokenizing, removing stop words, and stemming words. It then extracts features from the preprocessed text, such as term frequency-inverse document frequency (tf-idf) values. Sentences are scored based on these features and sentences with higher scores are selected to form the summary. Keywords from the text are also identified using WordNet to help select the most relevant sentences for the summary. The proposed approach aims to generate concise yet meaningful summaries using natural language processing techniques.
We investigate one technique to produce a summary of
an original text without requiring its full semantic in-
terpretation, but instead relying on a model of the topic
progression in the text derived from lexical chains. We
present a new algorithm to compute lexical chains in
a text, merging several robust knowledge sources: the
WordNet thesaurus, a part-of-speech tagger, shallow
parser for the identification of nominal groups, and a
segmentation algorithm
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
Text can be analysed by splitting the text and extracting the keywords .These may be represented as summaries, tabular representation, graphical forms, and images. In order to provide a solution to large amount of information present in textual format led to a research of extracting the text and transforming the unstructured form to a structured format. The paper presents the importance of Natural Language Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural
language input which is done using regular expressions, artificial intelligence and database concepts.Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various index (points and percent) and time and then tokens are mapped to graph. This paper proposes a business solution for users for effective time management.
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...TELKOMNIKA JOURNAL
Most scientific publishers encourage authors to provide keyphrases on their published article.
Hence, the need to automatize keyphrase extraction is increased. However, it is not a trivial task
considering keyphrase characteristics may overlap with the non-keyphrase’s. To date, the accuracy of
automatic keyphrase extraction approaches is still considerably low. In response to such gap, this paper
proposes two contributions. First, a feature called fact-based sentiment is proposed. It is expected to
strengthen keyphrase characteristics since, according to manual observation, most keyphrases are
mentioned in neutral-to-positive sentiment. Second, a combination of supervised and unsupervised
approach is proposed to take the benefits of both approaches. It will enable automatic hidden pattern
detection while keeping candidate importance comparable to each other. According to evaluation, factbased
sentiment is quite effective for representing keyphraseness and semi-supervised approach is
considerably effective to extract keyphrases from scientific articles.
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
This document discusses training a dependency parser using an unparsed corpus rather than a manually parsed training set. It develops an iterative training method that generates training examples using heuristics from past parsing decisions. The method is shown to produce parse trees qualitatively similar to conventionally trained parsers. Three avenues for future research using this corpus-based generation method are proposed.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
A domain specific automatic text summarization using fuzzy logicIAEME Publication
This paper proposes a domain-specific automatic text summarization technique using fuzzy logic. It extracts important sentences from documents using both statistical and linguistic features. The technique first preprocesses documents by segmenting sentences, removing stop words, and stemming words. It then extracts eight features from each sentence, including title words, sentence length, position, and similarity. Fuzzy logic is used to assign an importance score to each sentence based on these features. Sentences are ranked and the top sentences are selected for the summary based on the compression rate. The technique was evaluated on news articles and shown to generate good quality summaries compared to other summarizers based on precision, recall, and F-measure.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentence’s document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
The document discusses query-based summarization, including defining the task, evaluation criteria, and different approaches used. Some key approaches discussed are using document graphs to identify relevant sections, rhetorical structure theory to create a graph representation, linguistics techniques like Hidden Markov Models for sentence selection, and machine learning methods like using support vector machines to rank sentences. Different domains like medical and opinion summarization are also outlined.
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSijnlc
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
Evaluation of subjective answers using glsa enhanced with contextual synonymyijnlc
Evaluation of subjective answers submitted in an exam is an essential but one of the most resource consuming educational activity. This paper details experiments conducted under our project to build a software that evaluates the subjective answers of informative nature in a given knowledge domain. The paper first summarizes the techniques such as Generalized Latent Semantic Analysis (GLSA) and Cosine Similarity that provide basis for the proposed model. The further sections point out the areas of improvement in the previous work and describe our approach towards the solutions of the same. We then discuss the implementation details of the project followed by the findings that show the improvements achieved. Our approach focuses on comprehending the various forms of expressing same entity and thereby capturing the subjectivity of text into objective parameters. The model is tested by evaluating answers submitted by 61 students of Third Year B. Tech. CSE class of Walchand College of Engineering Sangli in a test on Database Engineering.
An exhaustive font and size invariant classification scheme for ocr of devana...ijnlc
The document presents a classification scheme for recognizing Devanagari characters that is invariant to font and size. It identifies the basic symbols that commonly appear in the middle zone of Devanagari text across different fonts and sizes. Through an analysis of over 465,000 words from various sources, it finds that 345 symbols account for 99.97% of text and aims to classify these into groups based on structural properties like the presence or absence of vertical bars. The proposed classification scheme is validated on 25 fonts and 3 sizes to demonstrate its font and size invariance.
S ENTIMENT A NALYSIS F OR M ODERN S TANDARD A RABIC A ND C OLLOQUIAlijnlc
The rise of social media such as blogs and social n
etworks has fueled interest in sentiment analysis.
With
the proliferation of reviews, ratings, recommendati
ons and other forms of online expression, online op
inion
has turned into a kind of virtual currency for busi
nesses looking to market their products, identify n
ew
opportunities and manage their reputations, therefo
re many are now looking to the field of sentiment
analysis. In this paper, we present a feature-based
sentence level approach for Arabic sentiment analy
sis.
Our approach is using Arabic idioms/saying phrases
lexicon as a key importance for improving the
detection of the sentiment polarity in Arabic sente
nces as well as a number of novels and rich set of
linguistically motivated features (contextual Inten
sifiers, contextual Shifter and negation handling),
syntactic features for conflicting phrases which en
hance the sentiment classification accuracy.
Furthermore, we introduce an automatic expandable w
ide coverage polarity lexicon of Arabic sentiment
words. The lexicon is built with gold-standard sent
iment words as a seed which is manually collected a
nd
annotated and it expands and detects the sentiment
orientation automatically of new sentiment words us
ing
synset aggregation technique and free online Arabic
lexicons and thesauruses. Our data focus on modern
standard Arabic (MSA) and Egyptian dialectal Arabic
tweets and microblogs (hotel reservation, product
reviews, etc.). The experimental results using our
resources and techniques with SVM classifier indica
te
high performance levels, with accuracies of over 95
%.
The document summarizes different techniques for automatic document summarization including extractive and abstractive approaches. It discusses simple techniques like frequency-based methods and cue phrases. Graph-based approaches like TextRank and LexRank that model text as a graph are explained. Linguistic methods involving lexical chains and rhetorical structure are covered. Finally, it summarizes WordNet-based semantic approaches and techniques for evaluating summaries.
An automatic text summarization using lexical cohesion and correlation of sen...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
presentation from my thesis defense on text summarization, discusses already existing state of art models along with efficiency of AMR or Abstract Meaning Representation for text summarization, we see how we can use AMRs with seq2seq models. We also discuss other techniques such as BPE or Byte Pair Encoding and its effectiveness for the task. Also we see how data augmentation with POS tags and AMRs effect the summarization with s2s learning.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
IRJET- Automatic Recapitulation of Text DocumentIRJET Journal
The document describes an approach for automatically generating summaries of text documents. It involves preprocessing the input text by tokenizing, removing stop words, and stemming words. It then extracts features from the preprocessed text, such as term frequency-inverse document frequency (tf-idf) values. Sentences are scored based on these features and sentences with higher scores are selected to form the summary. Keywords from the text are also identified using WordNet to help select the most relevant sentences for the summary. The proposed approach aims to generate concise yet meaningful summaries using natural language processing techniques.
We investigate one technique to produce a summary of
an original text without requiring its full semantic in-
terpretation, but instead relying on a model of the topic
progression in the text derived from lexical chains. We
present a new algorithm to compute lexical chains in
a text, merging several robust knowledge sources: the
WordNet thesaurus, a part-of-speech tagger, shallow
parser for the identification of nominal groups, and a
segmentation algorithm
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
Text can be analysed by splitting the text and extracting the keywords .These may be represented as summaries, tabular representation, graphical forms, and images. In order to provide a solution to large amount of information present in textual format led to a research of extracting the text and transforming the unstructured form to a structured format. The paper presents the importance of Natural Language Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural
language input which is done using regular expressions, artificial intelligence and database concepts.Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various index (points and percent) and time and then tokens are mapped to graph. This paper proposes a business solution for users for effective time management.
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...TELKOMNIKA JOURNAL
Most scientific publishers encourage authors to provide keyphrases on their published article.
Hence, the need to automatize keyphrase extraction is increased. However, it is not a trivial task
considering keyphrase characteristics may overlap with the non-keyphrase’s. To date, the accuracy of
automatic keyphrase extraction approaches is still considerably low. In response to such gap, this paper
proposes two contributions. First, a feature called fact-based sentiment is proposed. It is expected to
strengthen keyphrase characteristics since, according to manual observation, most keyphrases are
mentioned in neutral-to-positive sentiment. Second, a combination of supervised and unsupervised
approach is proposed to take the benefits of both approaches. It will enable automatic hidden pattern
detection while keeping candidate importance comparable to each other. According to evaluation, factbased
sentiment is quite effective for representing keyphraseness and semi-supervised approach is
considerably effective to extract keyphrases from scientific articles.
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
This document discusses training a dependency parser using an unparsed corpus rather than a manually parsed training set. It develops an iterative training method that generates training examples using heuristics from past parsing decisions. The method is shown to produce parse trees qualitatively similar to conventionally trained parsers. Three avenues for future research using this corpus-based generation method are proposed.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
A domain specific automatic text summarization using fuzzy logicIAEME Publication
This paper proposes a domain-specific automatic text summarization technique using fuzzy logic. It extracts important sentences from documents using both statistical and linguistic features. The technique first preprocesses documents by segmenting sentences, removing stop words, and stemming words. It then extracts eight features from each sentence, including title words, sentence length, position, and similarity. Fuzzy logic is used to assign an importance score to each sentence based on these features. Sentences are ranked and the top sentences are selected for the summary based on the compression rate. The technique was evaluated on news articles and shown to generate good quality summaries compared to other summarizers based on precision, recall, and F-measure.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentence’s document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
The document discusses query-based summarization, including defining the task, evaluation criteria, and different approaches used. Some key approaches discussed are using document graphs to identify relevant sections, rhetorical structure theory to create a graph representation, linguistics techniques like Hidden Markov Models for sentence selection, and machine learning methods like using support vector machines to rank sentences. Different domains like medical and opinion summarization are also outlined.
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSijnlc
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
Evaluation of subjective answers using glsa enhanced with contextual synonymyijnlc
Evaluation of subjective answers submitted in an exam is an essential but one of the most resource consuming educational activity. This paper details experiments conducted under our project to build a software that evaluates the subjective answers of informative nature in a given knowledge domain. The paper first summarizes the techniques such as Generalized Latent Semantic Analysis (GLSA) and Cosine Similarity that provide basis for the proposed model. The further sections point out the areas of improvement in the previous work and describe our approach towards the solutions of the same. We then discuss the implementation details of the project followed by the findings that show the improvements achieved. Our approach focuses on comprehending the various forms of expressing same entity and thereby capturing the subjectivity of text into objective parameters. The model is tested by evaluating answers submitted by 61 students of Third Year B. Tech. CSE class of Walchand College of Engineering Sangli in a test on Database Engineering.
An exhaustive font and size invariant classification scheme for ocr of devana...ijnlc
The document presents a classification scheme for recognizing Devanagari characters that is invariant to font and size. It identifies the basic symbols that commonly appear in the middle zone of Devanagari text across different fonts and sizes. Through an analysis of over 465,000 words from various sources, it finds that 345 symbols account for 99.97% of text and aims to classify these into groups based on structural properties like the presence or absence of vertical bars. The proposed classification scheme is validated on 25 fonts and 3 sizes to demonstrate its font and size invariance.
S ENTIMENT A NALYSIS F OR M ODERN S TANDARD A RABIC A ND C OLLOQUIAlijnlc
The rise of social media such as blogs and social n
etworks has fueled interest in sentiment analysis.
With
the proliferation of reviews, ratings, recommendati
ons and other forms of online expression, online op
inion
has turned into a kind of virtual currency for busi
nesses looking to market their products, identify n
ew
opportunities and manage their reputations, therefo
re many are now looking to the field of sentiment
analysis. In this paper, we present a feature-based
sentence level approach for Arabic sentiment analy
sis.
Our approach is using Arabic idioms/saying phrases
lexicon as a key importance for improving the
detection of the sentiment polarity in Arabic sente
nces as well as a number of novels and rich set of
linguistically motivated features (contextual Inten
sifiers, contextual Shifter and negation handling),
syntactic features for conflicting phrases which en
hance the sentiment classification accuracy.
Furthermore, we introduce an automatic expandable w
ide coverage polarity lexicon of Arabic sentiment
words. The lexicon is built with gold-standard sent
iment words as a seed which is manually collected a
nd
annotated and it expands and detects the sentiment
orientation automatically of new sentiment words us
ing
synset aggregation technique and free online Arabic
lexicons and thesauruses. Our data focus on modern
standard Arabic (MSA) and Egyptian dialectal Arabic
tweets and microblogs (hotel reservation, product
reviews, etc.). The experimental results using our
resources and techniques with SVM classifier indica
te
high performance levels, with accuracies of over 95
%.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
IMPLEMENTATION OF NLIZATION FRAMEWORK FOR VERBS, PRONOUNS AND DETERMINERS WIT...ijnlc
The document discusses the implementation of a natural language interface (NLization) framework for Punjabi using the EUGENE system. It focuses on generating Punjabi sentences from UNL representations for verbs, pronouns, and determiners. Rules and dictionaries are added to EUGENE to analyze the UNL syntax and semantics and output corresponding Punjabi sentences without human intervention. Examples are provided to illustrate the NLization process for sentences containing different parts of speech.
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
This document describes a verb-based sentiment analysis of Manipuri language documents using conditional random fields for part-of-speech tagging and a manually annotated verb lexicon to determine sentiment polarity. The system was tested on 550 letters to newspaper editors, achieving an average recall of 72.1%, precision of 78.14%, and F-measure of 75% for sentiment classification. The authors conclude the work is an initial effort in sentiment analysis for the highly agglutinative Manipuri language and more methods are needed to improve accuracy.
A MULTI-STREAM HMM APPROACH TO OFFLINE HANDWRITTEN ARABIC WORD RECOGNITIONijnlc
This document presents a multi-stream HMM approach for offline handwritten Arabic word recognition. It extracts two sets of features from each word using a sliding window approach and VH2D projection approach. These features are input to separate HMM classifiers, and the outputs are combined in a multi-stream HMM to provide more reliable recognition. The system is evaluated on 200 words, achieving a recognition rate of 83.8% using the multi-stream approach compared to 78.2% and 76.6% for the individual classifiers.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
The document presents a hybrid part-of-speech tagging method for Arabic text that combines rule-based and statistical approaches. Rule-based tagging alone can misclassify words and leave some untagged, so the method integrates it with a Hidden Markov Model tagger. The hybrid approach is evaluated on two Arabic corpora and achieves accuracy rates of 97.6% and 98%, outperforming the individual rule-based and HMM taggers.
Smart grammar a dynamic spoken language understanding grammar for inflective ...ijnlc
1. The document proposes SmartGrammar, a new method for developing spoken language understanding grammars for inflectional languages like Italian.
2. SmartGrammar uses a morphological analyzer to convert user utterances into their canonical forms before parsing, allowing the grammar to contain only canonical word forms rather than all possible inflections.
3. This significantly reduces the complexity and size of grammars for inflectional languages by representing many possible inflected forms with a single canonical form entry, making grammar development and management easier.
Building a vietnamese dialog mechanism for v dlg~tabl systemijnlc
This paper introduces a Vietnamese automatic dialog mechanism which allows the V-DLG~TABL system to automatically communicate with clients. This dialog mechanism is based on the Question – Answering engine of V-DLG~TABL system, and composes the following supplementary mechanisms: 1) the mechanism of choosing the suggested question; 2) the mechanism of managing the conversations and suggesting information; 3) the mechanism of resolving the questions having anaphora; 4) the scenarios of dialog (in this stage, there is only one simple scenario defined for the system).
CLUSTERING WEB SEARCH RESULTS FOR EFFECTIVE ARABIC LANGUAGE BROWSINGijnlc
The process of browsing Search Results is one of the major problems with traditional Web search engines
for English, European, and any other languages generally, and for Arabic Language particularly. This
process is absolutely time consuming and the browsing style seems to be unattractive. Organizing Web
search results into clusters facilitates users quick browsing through search results. Traditional clustering
techniques (data-centric clustering algorithms) are inadequate since they don't generate clusters with
highly readable names or cluster labels. To solve this problem, Description-centric algorithms such as
Suffix Tree Clustering (STC) algorithm have been introduced and used successfully and extensively with
different adapted versions for English, European, and Chinese Languages. However, till the day of writing
this paper, in our knowledge, STC algorithm has been never applied for Arabic Web Snippets Search
Results Clustering.
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATIONijnlc
The document discusses handling unknown words in named entity recognition using transliteration. It proposes an approach where named entities in training data are transliterated into other languages and stored in transliteration files. During testing, if an unknown entity is encountered, it is checked against the transliteration files and assigned the corresponding tag if found. The approach is shown to achieve 95.8% recall, 96.3% precision and 96.04% F-measure on a multilingual named entity recognition task handling words from English, Hindi, Marathi, Punjabi and Urdu. Performance metrics for named entity recognition systems such as precision, recall and F-measure are also discussed.
This paper deals about the chunking of the Manipuri language, which is very highly agglutinative in
Nature. The system works in such a way that the Manipuri text is clean upto the gold standard. The text is
processed for Part of Speech (POS) tagging using Conditional Random Field (CRF). The output file is
treated as an input file for the CRF based Chunking system. The final output is a completely chunk tag
Manipuri text. The system shows a recall of 71.30%, a precision of 77.36% and a F-measure of 74.21%.
Developemnt and evaluation of a web based question answering system for arabi...ijnlc
Question Answering (QA) systems are gaining great importance due to the increasing amount of web
content and the high demand for digital information that regular information retrieval techniques cannot
satisfy. A question answering system enables users to have a natural language dialog with the machine,
which is required for virtually all emerging online service systems on the Internet. The need for such
systems is higher in the context of the Arabic language. This is because of the scarcity of Arabic QA
systems, which can be attributed to the great challenges they present to the research community,including
theparticularities of Arabic, such as short vowels, absence of capital letters, complex morphology, etc. In
this paper, we report the design and implementation of an Arabic web-based question answering
system,which we called “JAWEB”, the Arabic word for the verb “answer”. Unlike all Arabic questionanswering
systems, JAWEB is a web-based application,so it can be accessed at any time and from
anywhere. Evaluating JAWEBshowed that it gives the correct answer with 100% recall and 80% precision
on average. When comparedto ask.com, the well-established web-based QA system, JAWEBprovided 15-
20% higher recall.These promising results give clear evidence that JAWEB has great potential as a QA
platform and is much needed by Arabic-speaking Internet users across the world.
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODELijnlc
This document describes a tool called NERHMM for performing named entity recognition using hidden Markov models. The tool allows users to annotate raw text to create tagged corpora, train hidden Markov models on annotated data to calculate parameters, and test new text data to produce named entity tags. The tool works for multiple languages and can handle diverse tag sets. It provides a simple interface for tasks involved in named entity recognition like corpus development and parameter estimation for hidden Markov models. Evaluation on data shows the tool's performance increases with more training data.
An improved apriori algorithm for association rulesijnlc
There are several mining algorithms of association rules. One of the most popular algorithms is Apriori
that is used to extract frequent itemsets from large database and getting the association rule for
discovering the knowledge. Based on this algorithm, this paper indicates the limitation of the original
Apriori algorithm of wasting time for scanning the whole database searching on the frequent itemsets, and
presents an improvement on Apriori by reducing that wasted time depending on scanning only some
transactions. The paper shows by experimental results with several groups of transactions, and with
several values of minimum support that applied on the original Apriori and our implemented improved
Apriori that our improved Apriori reduces the time consumed by 67.38% in comparison with the original
Apriori, and makes the Apriori algorithm more efficient and less time consuming
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
The document proposes using part-of-speech tagging and stemming to improve Gujarati to Hindi machine translation through transliteration. It presents a system that applies stemming and POS tagging to Gujarati text before transliterating to resolve ambiguities. An evaluation of the system on 500 sentences found that transliteration and translation matched for 54.48% of Gujarati words, and overall transliteration efficiency was 93.09%. The approach aims to improve over direct transliteration for the highly inflected Gujarati language.
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
This paper presents a knowledge-based approach for the grapheme to-phoneme conversion (G2P) of isolated words of the Italian language. With more than 7,000 languages in the world, the biggest challenge today is to rapidly port speech processing systems to new languages with low human effort and at reasonable cost. This includes the creation of qualified pronunciation dictionaries. The dictionaries provide the mapping from the orthographic form of a word to its pronunciation, which is useful in both speech synthesis and automatic speech recognition (ASR) systems. For training the acoustic models we need an automatic routine that maps the spelling of training set to a string of phonetic symbols representing the pronunciation.
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...mlaij
In this new era, where tremendous information is available on the internet, it is of most important to
provide the improved mechanism to extract the information quickly and most efficiently. It is very difficult
for human beings to manually extract the summary of large documents of text. Therefore, there is a
problem of searching for relevant documents from the number of documents available, and absorbing
relevant information from it. In order to solve the above two problems, the automatic text summarization is
very much necessary. Text summarization is the process of identifying the most important meaningful
information in a document or set of related documents and compressing them into a shorter version
preserving its overall meanings. More specific, Abstractive Text Summarization (ATS), is the task of
constructing summary sentences by merging facts from different source sentences and condensing them
into a shorter representation while preserving information content and overall meaning. This Paper
introduces a newly proposed technique for Summarizing the abstractive newspapers’ articles based on
deep learning.
A hybrid approach for text summarization using semantic latent Dirichlet allo...IJECEIAES
Automatic text summarization generates a summary that contains sentences reflecting the essential and relevant information of the original documents. Extractive summarization requires semantic understanding, while abstractive summarization requires a better intermediate text representation. This paper proposes a hybrid approach for generating text summaries that combine extractive and abstractive methods. To improve the semantic understanding of the model, we propose two novel extractive methods: semantic latent Dirichlet allocation (semantic LDA) and sentence concept mapping. We then generate an intermediate summary by applying our proposed sentence ranking algorithm over the sentence concept mapping. This intermediate summary is input to a transformer-based abstractive model fine-tuned with a multi-head attention mechanism. Our experimental results demonstrate that the proposed hybrid model generates coherent summaries using the intermediate extractive summary covering semantics. As we increase the concepts and number of words in the summary the rouge scores are improved for precision and F1 scores in our proposed model.
Summarization of software artifacts is an ongoing field of research among the software engineering community due to the benefits that summarization provides like saving of time and efforts in various software engineering tasks like code search, duplicate bug reports detection, traceability link recovery, etc. Summarization is to produce short and concise summaries. The paper presents the review of the state of the art of summarization techniques in software engineering context. The paper gives a brief overview to the software artifacts which are mostly used for summarization or have benefits from summarization. The paper briefly describes the general process of summarization. The paper reviews the papers published from 2010 to June 2017 and classifies the works into extractive and abstractive summarization. The paper also reviews the evaluation techniques used for summarizing software artifacts. The paper discusses the open problems and challenges in this field of research. The paper also discusses the future scopes in this area for new researchers.
Summarization of software artifacts is an ongoing field of research among the software engineering community due to the benefits that summarization provides like saving of time and efforts in various software engineering tasks like code search, duplicate bug reports detection, traceability link recovery, etc. Summarization is to produce short and concise summaries. The paper presents the review of the state of the art of summarization techniques in software engineering context. The paper gives a brief overview to the software artifacts which are mostly used for summarization or have benefits from summarization. The paper briefly describes the general process of summarization. The paper reviews the papers published from 2010 to June 2017 and classifies the works into extractive and abstractive summarization. The paper also reviews the evaluation techniques used for summarizing software artifacts. The paper discusses the open problems and challenges in this field of research. The paper also discusses the future scopes in this area for new researchers.
IRJET- A Survey Paper on Text Summarization MethodsIRJET Journal
This document summarizes various approaches for automatic text summarization, including extractive and abstractive methods. It discusses early surface-level approaches from the 1950s that identified important sentences based on word frequency. It also reviews corpus-based, cohesion-based, rhetoric-based, and graph-based approaches. The document then examines single document summarization techniques like naive Bayes methods, log-linear models, and deep natural language analysis. It concludes with a discussion of multi-document summarization, including abstraction and information fusion as well as graph spreading activation approaches. The goal of the survey is to provide an overview of the major existing methods that have been used for automatic text summarization.
The document describes a proposed system for automatic text summarization using latent semantic analysis and natural language processing. The system would take in a user-submitted document, preprocess it by removing stop words, generate a singular value decomposition matrix to derive the latent semantic structure, and output a summary by semantically arranging sentences to encompass the original text's concepts. The document provides an implementation example in Python using the LSA method from the scikit-learn machine learning library for natural language processing and matrix decomposition.
Automatic Text Summarization Using Natural Language Processing (1)Don Dooley
This document discusses automatic text summarization using natural language processing. It describes two main approaches for summarization - extractive and abstractive. Extractive summarization selects important sentences from the original text, while abstractive summarization generates new sentences to summarize the text. The document presents the objectives, methodologies, organization of the project report, and provides a literature review of papers on text summarization techniques.
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
The document summarizes a technique for generating summaries using sentence compression and statistical measures. It first implements a graph-based technique to achieve sentence compression and information fusion. It then uses hand-crafted syntactic rules to prune compressed sentences. Finally, it uses probabilistic measures and word co-occurrence to obtain the summaries. The system can generate summaries at any user-defined compression rate.
The document describes an automatic text summarization system that uses both extractive and abstractive summarization techniques. It discusses pre-processing steps like stop word removal and sentence filtering. Key features for ranking sentences include title matching, term frequency-inverse document frequency (TF-IDF), and position of the sentence. The proposed system generates a summary by selecting the highest ranked sentences while maintaining overall meaning and readability. Room for future improvements include enhancing the stemming process and exploring additional ranking features.
A statistical model for gist generation a case study on hindi news articleIJDKP
Every day, huge number of news articles are reported and disseminated on the Internet. By generating gist
of an article, reader can go through the main topics instead of reading the whole article as it takes much
time for reader to read the entire content of the article. An ideal system would understand the document
and generate the appropriate theme(s) directly from the results of the understanding. In the absence of
natural language understanding system, it is required to design an appropriate system. Gist generation is a
difficult task because it requires both maximizing text content in short summary and maintains
grammaticality of the text. In this paper we present a statistical approach to generate a gist of a Hindi
news article. The experimental results are evaluated using the standard measures such as precision, recall
and F1 measure for different statistical models and their combination on the article before pre-processing
and after pre-processing.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSkevig
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Automatic Text Summarization: A Critical ReviewIRJET Journal
This document provides a literature review and critical analysis of automatic text summarization techniques. It discusses extractive and abstractive summarization approaches and reviews 10 papers published between 2009-2021 on topics like graph-based, keyword-based, and feature-based summarization methods. The document aims to identify strengths and limitations of the approaches discussed and opportunities for future work in automatic text summarization.
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
This paper proposes a multi-document summarization system that uses bisect k-means clustering, an optimal merge function, and a neural network. The system first preprocesses input documents through stemming and removing stop words. It then applies bisect k-means clustering to group similar sentences. The clusters are merged using an optimal merge function to find important keywords. The NEWSUM algorithm is used to generate a primary summary for each keyword. A neural network trained on sentence classifications is then used to classify sentences in the primary summary as positive or negative. Only positively classified sentences are included in the final summary to improve accuracy. The system aims to generate a concise and accurate summary in a short period of time from multiple documents on a given topic.
Design of optimal search engine using text summarization through artificial i...TELKOMNIKA JOURNAL
Natural language processing is the trending topic in the latest research areas, which allows the developers to create the human-computer interactions to come into existence. The natural language processing is an integration of artificial intelligence, computer science and computer linguistics. The research towards natural Language Processing is focused on creating innovations towards creating the devices or machines which operates basing on the single command of a human. It allows various Bot creations to innovate the instructions from the mobile devices to control the physical devices by allowing the speech-tagging. In our paper, we design a search engine which not only displays the data according to user query but also performs the detailed display of the content or topic user is interested for using the summarization concept. We find the designed search engine is having optimal response time for the user queries by analyzing with number of transactions as inputs. Also, the result findings in the performance analysis show that the text summarization method has been an efficient way for improving the response time in the search engine optimizations.
Automatic Text Summarization using Natural Language ProcessingIRJET Journal
The document discusses automatic text summarization using natural language processing. It describes using the Simplified Lesk calculation and WordNet to assess the importance of sentences in a document and perform word sense disambiguation. The proposed approach assesses sentence weights using Simplified Lesk and orders them by weight. It then selects sentences for the summary based on a given summarization percentage. The approach provides good results for summaries up to 50% of the original text and acceptable results up to 25%.
IRJET- Semantic based Automatic Text Summarization based on Soft ComputingIRJET Journal
This document discusses semantic-based automatic text summarization using soft computing techniques. It begins with an introduction describing how large amounts of data are generated daily and the need for automated summarization. The next sections cover related work on text summarization methods including syntactic parsing, extractive techniques using n-gram language models and A* search, and mathematical reduction techniques like singular value decomposition and non-negative matrix factorization. The document also discusses using part-of-speech tagging, hidden Markov models, and named entity recognition for extractive summarization in Indian languages.
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMIRJET Journal
This document describes a proposed candidate set key document retrieval system. The system would process user queries in English and return relevant documents from a collection. It would use natural language processing techniques like tokenization, stop word removal, stemming, and lemmatization to index the documents and match them with user queries. The proposed system architecture includes components for indexing, processing user queries, and retrieving relevant documents from the collection. The indexing process involves organizing the documents, extracting tokens, removing stop words, and applying stemming/lemmatization to create an inverted index for efficient searching.
Similar to Conceptual framework for abstractive text summarization (20)
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Conceptual framework for abstractive text summarization
1. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
DOI : 10.5121/ijnlc.2015.4104 39
CONCEPTUAL FRAMEWORK FOR ABSTRACTIVE
TEXT SUMMARIZATION
Nikita Munot1
and Sharvari S. Govilkar2
1,2
Department of Computer Engineering, Mumbai University, PIIT, New Panvel, India
ABSTRACT
As the volume of information available on the Internet increases, there is a growing need for tools helping
users to find, filter and manage these resources. While more and more textual information is available on-
line, effective retrieval is difficult without proper indexing and summarization of the content. One of the
possible solutions to this problem is abstractive text summarization. The idea is to propose a system that
will accept single document as input in English and processes the input by building a rich semantic graph
and then reducing this graph for generating the final summary.
KEYWORDS
Part-of speech (POS) tagging, rich semantic graph, abstractive summary, named entity recognition (NER).
1.INTRODUCTION
Text summarization is one of the most popular research areas today because of the problem of the
information overloading available on the web, and has increased the necessity of the more strong
and powerful text summarizers. The condensation of information from text is needed and this can
be achieved by text summarization by reducing the length of the original text. Text
summarization is commonly classified into two types extractive and abstractive. Extractive
summarization means extracting few sentences from the original document based on some
statistical factors and adding them into summary. Extractive summarization usually tends to
sentence extraction rather than summarization. Whereas abstractive summarization are more
powerful than extractive summarization because they generate the sentences based on their
semantic meaning. Hence this leads to a meaningful summarization which is more accurate than
extractive summaries.
Summarization by extractive just extracts the sentences from the original document and adds
them to summary. Extractive method is based on statistical features not on semantic relation with
sentences [2] and are easier to implement. Therefore the summary generated by this method tends
to be inconsistent. Summarization by abstraction needs understanding of the original text and then
generating the summary which is semantically related. It is difficult to compute abstractive
summary because it needs understanding of complex natural language processing tasks.
There are few issues of extractive summarization. Extracted sentences usually tend to be longer
than average. Due to this, parts of the segments that are not essential for summary also get
included, consuming space. Important or relevant information is usually spread across sentences,
and extractive summaries cannot capture this (unless the summary is long enough to hold all
those sentences). Conflicting information may not be presented accurately. Pure extraction often
leads to problems in overall coherence of the summary. These problems become more severe in
the multi-document case, since extracts are drawn from different sources. Therefore abstractive
2. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
40
summarization is more accurate than extractive summarization.
In this paper, an approach is presented to generate an abstractive summary for the input document
using a graph reduction technique. This paper proposes a system that accepts a document as input
and processes the input by building a rich semantic graph and then reducing this graph for
generating summary. Related work and literature survey is discussed in section 2. The proposed
system is discussed in section 3 and conclusion in section 4.
2. LITERATURE SURVEY
In this section we cite the relevant past literature that use the various abstractive summarization
techniques to summarize a document. Techniques till today focused on extractive summarization
rather than abstractive. Current state of art is statistical methods for extractive summarization.
Pushpak Bhattacharyya [3] proposed a WordNet based approach to text summarization. It extracts
a sub-graph from the WordNet graph for the entire document. Each nodes of the sub-graph are
assigned weights with respect to the synsnet using the WordNet. WordNet[11] is a online lexical
database. The proposed algorithm captures the global semantic information using WordNet.
Silber G.H., Kathleen F. McCoy [4][5] presents a linear time algorithm for lexical chain
computation. Lexical chain is used as an intermediate representation for automatic text
summarization. Lexical chains exploit the cohesion among an arbitrary number of related words.
Lexical chains can be computed in a source document by grouping (chaining) sets of words that
are semantically related. Words must be grouped such that it creates a strongest and longest
lexical chain.
J. Leskovec[6] proposed approach which produces a logical form analysis for each sentence. The
author proposed subject-predicate-object (SPO) triples from individual sentences to create a
semantic graph of the original document. Difficult to compute SOP semantic based triples as it
requires deep understanding of natural language processing.
Clustering is used to summarize a document by grouping and clustering the similar data or
sentences. Zhang Pei-yin, LI zcun-he[7] states that summarization result depends on the sentence
features and on the sentence similarity measure. MultiGen[7] is a multi-document system in the
news domain.
Naresh Kumar, Dr.Shrish Verma[8] proposed a single document frequent terms based text
summarization algorithm. The author suggests an algorithm based on three steps: First the
document which is required to be summarized is processed by eliminating the stop word. Next
step is to calculate the term-frequent data from the document and frequent terms are selected, and
for these selected words the semantic equivalent terms are also generated. Finally in third step, all
the sentences in document, which contains the frequent terms and semantic equivalent terms are
filtered for summarization.
I. Fathy, D. Fadl, M. Aref[9] proposed a new semantic representation called Rich Semantic
Graph(RSG) to be used as an intermediate representation for various applications. A new model
to generate an English text from RSG is proposed. The method access a domain ontology which
contains the information needed in same domain of RSG.
The author suggested a method [10] for summarizing document by creating a semantic graph and
identifies the substructure of graph that can be used to extract sentences for a document summary.
3. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
41
It starts with deep syntactic analysis of the text. For each sentence it extracts logical form triples
in the form of subject-predicate and object.
Many approaches addressed above uses lexical chain, word net and clustering method to produce
abstractive summary. Some of the methods provided a graph-based approach to generate
extractive summary.
3. PROPOSED APPROACH
The idea is to summarize an input document by creating semantic graph called rich semantic
graph(RSG) for the original document, reducing the generated semantic graph, and the finally
generating the final abstractive summary from the reduced semantic graph. The input to the
system is a single text document in English and output will be a reduced summary.
The proposed approach includes three phases: Rich Semantic Graph creation (RSG) phase, Rich
Semantic Graph reduction (RSG) phase and generating summary from reduced RSG.
Figure 1. Proposed approach
First step is to pre-process the input document. For each word in the document, apply part-of-
4. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
42
speech tagging, named entity recognition and tokenization. Then for each sentence in the input
document, graphs are created. Finally the sentences RSG sub graph are merged together to
represent whole document semantically. The final RSG of entire document is reduced with the
help of some reduction rules. Summary is generated from reduced RSG.
Algorithm:
Input: Accepts a single document as input.
Output: Summarized document.
Accept the text document as input in English
for each sentence in the input document
for each word in the sentences
do tokenization
part-of-speech tagging (POS)
named entity recognition (NER)
Generate the graph for each sentence
for entire document do
merge all sentence graph to represent whole document
reduce the graph using reduction rules
generate from reduced graph
3.1. Rich Semantic Graph (RSG) Creation Phase
This phase analyses the input text, and detects the sentence and generates tokens for the entire
document. For each word it generates POS tags and locates the words into predefined categories
such as person name, location and organization. Then it builds the graph for each sentence and
later it interconnects rich semantic sub-graphs. Finally the sub-graphs are merged together to
represent the whole document semantically. RSG creation phase involves following tasks:
3.1.1. Pre-processing module
This phase accepts an input text document and generates pre-processed sentences. Initially the
entire text document is taken as input. First step is to perform tokenization for document. Once
tokens are generated, next step is to identify part-of-speech tag for every word or token and assign
parts of speech to each word such as noun, verb, and adjective. These tags are useful for
generating graph for entire document. Next task is to perform named entity recognition to identify
the entities in the document. Pre-processing consists of 3 main processes: tokenization, parts of
speech tagging (POS) and named entity recognition (NER)[1]. Once tokens, part-of-speech
(POS), Named Entity Recognition (NER) are ready these tags are used for further generating
graph of each sentence.
Steps for pre-processing module [1]:
1. Tokenization & Filtration: Accept the input document, detect sentences and generate the tokens
for entire document and filter out the special characters.
2. Name Entity Recognition (NER): It locates atomic elements into predefined categories such as
location, person names, organization etc.To perform this task we have used Stanford NER tool
[15] which is available freely.
5. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
43
3. Part of speech tagging (POS): It parses the whole sentence to describe each word syntactic
function and generates the POS tags for each word. To perform this task, Stanford parser tool [12]
is used for implementation.
Algorithm for pre-processing module:
Input: Original input document.
Output: Tokens, POS tags & NER.
accept the single text document as input
generate tokens for entire document and store in a file
for each sentence, apply POS tagging and generate the POS tags for each words
in sentences
for each sentence, locate the atomic elements into predefined categories such
as person, organization etc and identify the proper nouns
apply sentence detection algorithm to generate the sentence in proper order.
Figure 2. Pre- Processing module
This phase accepts a input document and filters special character and unwanted script other than
English. Then it generates tokens, Name Entity Recognition (NER) and part-of-speech (POS) tags
for all the sentences.
6. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
44
Tokenization & Filtration:
Tokenization is a process of breaking a stream of sentences into tokens. This is done by searching
a space after each word. All the generated tokens are saved in a separate file for further
processing. In this phase the input text is filtered so as to remove all the special characters such as
* &^%$ #@,.’;+{}[]. Including special characters in the further processing will result in only
degradation of the performance. Also filtration of Devanagari script(Marathi, Hindi) is also done
to validate that input document is in English language only.
Algorithm:
1. Generate a list of all the possible special characters.
2. Then compare each character of input text file with a given list of special characters.
3. If match founds then we simply ignore the matched character.
4. If no match found then that character is not a special character, simply copy the character into
another file containing no special characters.
5. Repeat the step from 2 to 4 till every character from the input text get processed.
6. Give generated file with no special characters to next step
7. Stop.
Input:
!@$%#&*()&^^%%+_?><": प्रतापगडाची लढाई ही इतिहासािील महत्वाची Alice Mathew is a
graduate student. Alice lives in Mumbai. Bob John is a graduate student. Bob works in Mastek.
Output:
Alice
Mathew
is
a
graduate
student
Alice
lives
in
Mumbai
Bob
John
is
a
graduate
student
After tokenization & filtration all the special characters will be removed and non English words
will be removed and tokens will be generated and saved in a separate file.
Named Entity Recognition (NER):
Named Entity Recognition (NER) labels sequences of words in a text which are the names of
7. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
45
things, such as person and company names, or organization and location. Good named entity
recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, and
LOCATION) are available. There are tools available for performing these tasks such as Stanford
NER tool and Open NLP tool. Consider the following sentence and expected named entity tags
are identified by using Stanford NER tool [15]:
Input:
Alice Mathew is a graduate student. Alice lives in Mumbai. Bob John is a graduate student. Bob
works in Mastek.
Output:
Alice Mathew Person
Mumbai Location
Bob John Person
Mastek Organization
POS tagging
A Part-of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language
and assigns parts of speech to each word, such as noun, verb, adjective, etc. OpenNLP POS
tagging tool and Stanford parser tool [12] is available that can be used as a plug-in. The
OpenNLP POS Tagger uses a probability model to predict the correct POS tag out of the tag set.
Penn Treebank POS tag set [13][14] is available which are used by many applications. The
proposed method used Stanford parser tool [12] for part-of-speech tagging.
Input:
Alice Mathew is a graduate student. Alice lives in Mumbai. Bob John is a graduate student. Bob
works in Mastek.
Output:
Alice_NNP Mathew_NNP is_VBZ a_DT graduate_NN student._NN Alice_NNP lives_VBZ
in_IN Mumbai_NNP. Bob_NNP John_NNP is_VBZ a_DT graduate_NN student._NN.
Bob_NNP works_VBZ in_IN Mastek_NNP.
3.1.2. Rich Semantic sub-graph generation module
This module accepts the pre-processed sentences as input and generates graph for each sentence
and later the sub-graphs are merged together to represent the entire document. For every sentence
graphs are generated. The semantic graph is generated with the help of generated POS tags and
tokens where the noun are coloured in orange and verbs in red color in the form of subject-
predicate-object(SPO) triples.
Input:
Alice Mathew is a graduate student. Alice lives in Mumbai.
8. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
46
Output:
Figure 3. Sentence graph
3.1.3. Rich Semantic Graph Generation module
Graph theory [6] can be applied for representing the structure of the text as well as the relationship
between sentences of the document. Sentences in the document are represented as nodes. The
edges between nodes are considered as connections between sentences. These connections are
related by similarity relation. By developing different similarity criteria, the similarity between two
sentences is calculated and each sentence is scored. Whenever a summary is to be processed all the
sentences with the highest scored are chosen for the summary. In graph ranking algorithms, the
importance of a vertex within the graph is iteratively computed from the entire graph. Therefore
the graphs can be generated for the entire document.
The rich semantic graph generation module is responsible to generate the final rich semantic
graphs of the whole document. The semantic sub-graphs are merged together to form the final
RSG. The graph can be built by subject-predicate-object (SPO) triples from individual sentences
to create a semantic graph. It uses linguistic properties of the nodes in the triples to build semantic
graphs for both documents and corresponding summaries.
Extracting summary by semantic graph generation [10] is a method which uses subject-predicate-
object (SPO) triples from individual sentences to create a semantic graph of the original
document. Subjects, Objects, and Predicates are the main functional elements of sentences.
Identifying and exploiting links among them could facilitate the extraction of relevant text. A
method that creates a semantic graph of a document, based on logical form triples subject–
predicate–object (SPO), and learns a relevant sub-graph that could be used for creating
summaries.
The semantic graph is generated in two steps [10]:
1. Syntactic analysis of the text – First apply deep syntactic analysis to document sentences , and
extract logical form triples.
2. Finally merge the resulting logical form triples into a semantic graph and analyze the graph
properties. The nodes in graphs correspond to Subjects, objects and predicate.
Input: Pre-processed sentences as input (POS tags and NER).
Output: Rich Semantic Graph.
Consider the following input and graph generated for same.
9. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
47
Input:
Rini lives in Mumbai. She works in Infosys. Nikita is pursuing master degree in computer
engineering. Nikita is specialized in machine learning field. Rini John is a graduate student
completed engineering in computer science. She is specialized in Web NLP. Rini is also pursuing
post graduation in computer science. Rini is a friend of Nikita. Ashish Mathew is also friend of
Nikita. Nikita Munot published two papers in international conferences under guidance of
Prof.Sharvari Govilkar. Rini John also published two papers in international conferences under
guidance of Prof.Sharvari Govilkar.
Output:
Figure 4. Rich semantic graph
The graphs are generated in subject-predicate-object form where the noun are coloured orange
and verbs in red and proper noun in orange.
3.2. Rich Semantic Graph Reduction Phase
This phase reduces the generated rich semantic graph of the original document. In this process a
set of rules are applied on the generated RSG to reduce it by merging, deleting or consolidating
the graph nodes. Many rules can be derived based on many factors: the semantic relation, the
graph node type (noun or verb), the similarity or dissimilarity between graph nodes.
Few rules are discussed that can be applied on the graph nodes of two simple sentences:
Sentence1= [SN1, MV1, ON1]
Sentence2= [SN2, MV2, ON2]
Each sentence is composed of three nodes: Subject Noun (SN) node, Main verb (MV) node and
Object Noun (ON) node.
Input: Rich Semantic Graph (RSG) of the whole document.
Output: Reduced rich semantic graph (RSG).
10. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
48
Reduction rules examples [1]:
Rule 1. IF SN1 is instance of noun N And
SN2 is instance of noun N And
MV1 is similar to MV2 And
ON1 is similar to ON2
THEN
Merge both MV1 and MV2 And
Merge both ON1 and ON2
Rule 2. IF SN1 is instance of subclass of noun N And
SN2 is instance of subclass of noun N And
{[MV11, ON11],..[MV1n, ON1n]} is similar to
{[MV21, ON21],..[MV2n, ON2n]}
THEN
Replace SN1 by N1 (instance N) And
Replace SN2 by N2 (instance N) And
Merge both N1 and N2
Rule 3. IF SN1 and SN2 are instance of noun N And
MV1 is instance of subclass of verb V And
MV2 is instance of subclass of verb V And
ON1 is similar to ON2
THEN
Replace MV1 by V1 (instance V) And
Replace MV2 by V2 (instance V) And
Merge both V1 and V2 And
Merge both ON1 and ON2
With the help of such rules, the graph is reduced then final summary is generated from reduced
graph. The system is to be trained and more such rules are to be added to make the system more
strong.
3.3 Summarized Text Generation Phase
This phase aims to generate the abstractive summary from the reduced RSG. The sentences are
merging with the help of rules and final summary can be generated.
4. CONCLUSION
As natural language understanding improves, computers will be able to learn from the
information online and apply what they learned in the real world. Information condensation is
needed. Extractive summary leads usually for sentence extraction rather the summarization. So
the need is to generate summary that captures the important text and relates the sentences
semantically. The work is applicable in open domain.
Abstractive summarization will serve as a tool for generating summary which is semantically
correct and produced fewer amounts of sentences in summary. Extractive summarization leads to
sentence extraction based on statistical methods which are not useful always. This paper proposes
an idea to create a semantic graph for the original document and relate it semantically and by
using several rules reduce the graph and generate the summary from reduced graph.
11. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
49
REFERENCES
[1] Ibrahim F. Moawad, Mostafa Aref," Semantic Graph Reduction Approach for Abstractive Text
Summarization" 2012 IEEE
[2] Saeedeh Gholamrezazadeh, Mohsen Amini Salehi, Bahareh Gholamzadeh, "A Comprehensive
Survey on Text Summarization Systems" 2009 In proceeding of: Computer Science and its
Applications, 2nd International Conference.
[3] Kedar Bellare, Anish Das Sharma, Atish Das Sharma, Navneet Loiwal and Pushpak
BhattachaIbrahim F.Moawadryya, "Generic Text Summarization Using Wordnet", Language
Resources Engineering Conference (LREC 2004), Barcelona, May, 2004.
[4] Silber G.H., Kathleen F. McCoy, "Efficiently Computed Lexical Chains as an Intermediate
Representation for Automatic Text Summarization," Computational Linguistics 28(4): 487-496, 2002.
[5] Barzilay, R., Elhadad, M, "Using Lexical Chains for Text Summarization," in Proc. ACL/EACL’97
Workshop on Intelligent Scalable Text summarization, Madrid, Spain,1997, pp.10–17.
[6] J. Leskovec, M. Grobelnik, N. Milic-Frayling, "Extracting Summary Sentences Based on the
Document Semantic Graph", Microsoft Research, 2005.
[7] Pei-ying, LI Cun-he," Automatic text summarization based on sentences clustering and extraction",
2009.
[8] Naresh Kumar, Dr.Shrish Verma , "A Frequent Term And Semantic Similarity Based Single
Document Text Summarization Algorithm" 2011.
[9] I. Fathy, D. Fadl, M. Aref, “Rich Semantic Representation Based Approach for Text Generation”, The
8th International conference on Informatics and systems (INFOS2012), Egypt, 2012.
[10] J. Leskovec, M. Grobelnik, N. Milic-Frayling, "Learning Semantic Graph Mapping for Document
Summarization", 2000.
[11] C. Fellbaum, "WordNet: An Electronic Lexical Database", MIT Press,1998.
[12] Stanford Parser,http://nlp.stanford.edu:8080/parser/index.jsp, June 15, 2012.
[13] R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, B. Webber, "The Penn Discourse
Treebank 2.0", Proceedings of the 6th International Conference on Language Resources and
Evaluation (LREC 2008), Morocco.
[14 ]https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
[15] http://nlp.stanford.edu/software/CRFNER.html.
12. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.1, February 2015
50
Authors
Nikita Munot received B.E. degree in 2012 from Pillai’s Institute of Information
Technology, New Panvel,Mumbai University and currently pursuing M.E. from
Mumbai university. She is having 3 years teaching experience. Presently she is working
as a lecturer in Pillai’s Institute of Information Technology. Her research areas include
natural language processing and data mining. She has published one paper in
international journal.
Sharvari Govilkar is Associate professor in Computer Engineering Department, at
PIIT, New Panvel, University of Mumbai, India. She has received her M.E in Computer
Engineering from University of Mumbai. Currently She is pursuing her PhD in
Information Technology from University of Mumbai. She is having seventeen years of
experience in teaching. Her areas of interest are Text Mining, Natural language
processing, Information Retrieval & Compiler Design etc. She has published many research papers in
international and national journals and conferences.
.