Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
Summarization using ntc approach based on keyword extraction for discussion f...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
Summarization using ntc approach based on keyword extraction for discussion f...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...mlaij
In this new era, where tremendous information is available on the internet, it is of most important to
provide the improved mechanism to extract the information quickly and most efficiently. It is very difficult
for human beings to manually extract the summary of large documents of text. Therefore, there is a
problem of searching for relevant documents from the number of documents available, and absorbing
relevant information from it. In order to solve the above two problems, the automatic text summarization is
very much necessary. Text summarization is the process of identifying the most important meaningful
information in a document or set of related documents and compressing them into a shorter version
preserving its overall meanings. More specific, Abstractive Text Summarization (ATS), is the task of
constructing summary sentences by merging facts from different source sentences and condensing them
into a shorter representation while preserving information content and overall meaning. This Paper
introduces a newly proposed technique for Summarizing the abstractive newspapers’ articles based on
deep learning.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
presentation from my thesis defense on text summarization, discusses already existing state of art models along with efficiency of AMR or Abstract Meaning Representation for text summarization, we see how we can use AMRs with seq2seq models. We also discuss other techniques such as BPE or Byte Pair Encoding and its effectiveness for the task. Also we see how data augmentation with POS tags and AMRs effect the summarization with s2s learning.
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...TELKOMNIKA JOURNAL
In this paper, we proposed a work on rhetorical corpus construction and sentence classification
model experiment that specifically could be incorporated in automatic paper title generation task for
scientific article. Rhetorical classification is treated as sequence labeling. Rhetorical sentence classification
model is useful in task which considers document’s discourse structure. We performed experiments using
two domains of datasets: computer science (CS dataset), and chemistry (GaN dataset). We evaluated the
models using 10-fold-cross validation (0.70-0.79 weighted average F-measure) as well as on-the-run
(0.30-0.36 error rate at best). We argued that our models performed best when handled using SMOTE
filter for imbalanced data.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
We investigate one technique to produce a summary of
an original text without requiring its full semantic in-
terpretation, but instead relying on a model of the topic
progression in the text derived from lexical chains. We
present a new algorithm to compute lexical chains in
a text, merging several robust knowledge sources: the
WordNet thesaurus, a part-of-speech tagger, shallow
parser for the identification of nominal groups, and a
segmentation algorithm
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
Now days, E-commerce systems have become extremely important. Large numbers of customers are choosing online shopping because of its convenience, reliability, and cost. Client generated information and especially item reviews are significant sources of data for consumers to make informed buy choices and for makers to keep track of customer’s opinions. It is difficult for customers to make purchasing decisions based on only pictures and short product descriptions. On the other hand, mining product reviews has become a hot research topic and prior researches are mostly based on pre-specified product features to analyse the opinions. Natural Language Processing (NLP) techniques such as NLTK for Python can be applied to raw customer reviews and keywords can be extracted. This paper presents a survey on the techniques used for designing software to mine opinion features in reviews. Elven IEEE papers are selected and a comparison is made between them. These papers are representative of the significant improvements in opinion mining in the past decade.
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
Advantages of Query Biased Summaries in Information RetrievalOnur Yılmaz
Presentation of the paper:
Advantages of query biased summaries in information retrieval (1998)
Anastasios Tombros and Mark Sanderson.
In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98).
ACM, New York, NY, USA, 2-10.
DOI=10.1145/290941.290947
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...mlaij
In this new era, where tremendous information is available on the internet, it is of most important to
provide the improved mechanism to extract the information quickly and most efficiently. It is very difficult
for human beings to manually extract the summary of large documents of text. Therefore, there is a
problem of searching for relevant documents from the number of documents available, and absorbing
relevant information from it. In order to solve the above two problems, the automatic text summarization is
very much necessary. Text summarization is the process of identifying the most important meaningful
information in a document or set of related documents and compressing them into a shorter version
preserving its overall meanings. More specific, Abstractive Text Summarization (ATS), is the task of
constructing summary sentences by merging facts from different source sentences and condensing them
into a shorter representation while preserving information content and overall meaning. This Paper
introduces a newly proposed technique for Summarizing the abstractive newspapers’ articles based on
deep learning.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
presentation from my thesis defense on text summarization, discusses already existing state of art models along with efficiency of AMR or Abstract Meaning Representation for text summarization, we see how we can use AMRs with seq2seq models. We also discuss other techniques such as BPE or Byte Pair Encoding and its effectiveness for the task. Also we see how data augmentation with POS tags and AMRs effect the summarization with s2s learning.
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...TELKOMNIKA JOURNAL
In this paper, we proposed a work on rhetorical corpus construction and sentence classification
model experiment that specifically could be incorporated in automatic paper title generation task for
scientific article. Rhetorical classification is treated as sequence labeling. Rhetorical sentence classification
model is useful in task which considers document’s discourse structure. We performed experiments using
two domains of datasets: computer science (CS dataset), and chemistry (GaN dataset). We evaluated the
models using 10-fold-cross validation (0.70-0.79 weighted average F-measure) as well as on-the-run
(0.30-0.36 error rate at best). We argued that our models performed best when handled using SMOTE
filter for imbalanced data.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
We investigate one technique to produce a summary of
an original text without requiring its full semantic in-
terpretation, but instead relying on a model of the topic
progression in the text derived from lexical chains. We
present a new algorithm to compute lexical chains in
a text, merging several robust knowledge sources: the
WordNet thesaurus, a part-of-speech tagger, shallow
parser for the identification of nominal groups, and a
segmentation algorithm
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
Now days, E-commerce systems have become extremely important. Large numbers of customers are choosing online shopping because of its convenience, reliability, and cost. Client generated information and especially item reviews are significant sources of data for consumers to make informed buy choices and for makers to keep track of customer’s opinions. It is difficult for customers to make purchasing decisions based on only pictures and short product descriptions. On the other hand, mining product reviews has become a hot research topic and prior researches are mostly based on pre-specified product features to analyse the opinions. Natural Language Processing (NLP) techniques such as NLTK for Python can be applied to raw customer reviews and keywords can be extracted. This paper presents a survey on the techniques used for designing software to mine opinion features in reviews. Elven IEEE papers are selected and a comparison is made between them. These papers are representative of the significant improvements in opinion mining in the past decade.
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
Advantages of Query Biased Summaries in Information RetrievalOnur Yılmaz
Presentation of the paper:
Advantages of query biased summaries in information retrieval (1998)
Anastasios Tombros and Mark Sanderson.
In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98).
ACM, New York, NY, USA, 2-10.
DOI=10.1145/290941.290947
A hybrid approach for text summarization using semantic latent Dirichlet allo...IJECEIAES
Automatic text summarization generates a summary that contains sentences reflecting the essential and relevant information of the original documents. Extractive summarization requires semantic understanding, while abstractive summarization requires a better intermediate text representation. This paper proposes a hybrid approach for generating text summaries that combine extractive and abstractive methods. To improve the semantic understanding of the model, we propose two novel extractive methods: semantic latent Dirichlet allocation (semantic LDA) and sentence concept mapping. We then generate an intermediate summary by applying our proposed sentence ranking algorithm over the sentence concept mapping. This intermediate summary is input to a transformer-based abstractive model fine-tuned with a multi-head attention mechanism. Our experimental results demonstrate that the proposed hybrid model generates coherent summaries using the intermediate extractive summary covering semantics. As we increase the concepts and number of words in the summary the rouge scores are improved for precision and F1 scores in our proposed model.
Current multi-document summarization systems can successfully extract summary sentences, however with
many limitations including: low coverage, inaccurate extraction to important sentences, redundancy and
poor coherence among the selected sentences. The present study introduces a new concept of centroid
approach and reports new techniques for extracting summary sentences for multi-document. In both
techniques keyphrases are used to weigh sentences and documents. The first summarization technique
(Sen-Rich) prefers maximum richness sentences. While the second (Doc-Rich), prefers sentences from
centroid document. To demonstrate the new summarization system application to extract summaries of
Arabic documents we performed two experiments. First, we applied Rouge measure to compare the new
techniques among systems presented at TAC2011. The results show that Sen-Rich outperformed all systems
in ROUGE-S. Second, the system was applied to summarize multi-topic documents. Using human
evaluators, the results show that Doc-Rich is the superior, where summary sentences characterized by
extra coverage and more cohesion
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
AN OVERVIEW OF EXTRACTIVE BASED AUTOMATIC TEXT SUMMARIZATION SYSTEMSijcsit
The availability of online information shows a need of efficient text summarization system. The text
summarization system follows extractive and abstractive methods. In extractive summarization, the important sentences are selected from the original text on the basis of sentence ranking methods. The Abstractive summarization system understands the main concept of texts and predicts the overall idea
about the topic. This paper mainly concentrated the survey of existing extractive text summarization models. Numerous algorithms are studied and their evaluations are explained. The main purpose is to
observe the peculiarities of existing extractive summarization models and to find a good approach that helps to build a new text summarization system.
An automatic text summarization using lexical cohesion and correlation of sen...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
From last few years online information is growing tremendously on World Wide Web or on user’s desktops and thus online information gains much more attention in the field of automatic text summarization. Text mining has become a significant research field as it produces valuable data from unstructured and large amount of texts. Summarization systems provide the possibility of searching the important keywords of the texts and so the consumer will expend less time on reading the whole document. Main objective of summarization system is to generate a new form which expresses the key meaning of the contained text. This paper study on various existing techniques with needs of novel Multi-Document summarization schemes. This paper is motivated by arising need to provide high quality summary in very short period of time. In proposed system, user can quickly and easily access correctly-developed summaries which expresses the key meaning of the contained text. The primary focus of this paper lies with thef_β-optimal merge function, a function recently presented here, that uses the weighted harmonic mean to discover a harmony in the middle of precision and recall. Proposed system utilizes Bisect K-means clustering to improve the time and Neural Networks to improve the accuracy of summary generated by NEWSUM algorithm.
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...idescitation
Genetic Algorithm (GA) has been a successful method
that is been used for extracting keywords. This paper presents
a full method by which keywords can be derived from the
various corpuses. We have built equations that exploit the
structure of the documents from which the keywords need to
be extracted. The procedures are been broken into two
distinguished profiles: one is to weigh the words in the whole
document content and the other is to explore the possibilities
of the occurrence of key terms by using genetic algorithm.
The basic equations of the heuristic mechanism is been varied
to allow the complete exploitation of document. The Genetic
Algorithm and the enhanced standard deviation method is
used in full potential to enable the generation of the key
terms that describe the given text document. The new
technique has an enhanced performance and better time
complexities.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
Due to an exponential growth in the generation of textual data, the need for tools and mechanisms for automatic summarization of documents has become very critical. Text documents are vital to any organization's day-to-day working and as such, long documents often hamper trivial work. Therefore, an automatic summarizer is vital towards reducing human effort. Text summarization is an important activity in the analysis of a high volume text documents and is currently a major research topic in Natural Language Processing. It is the process of generation of the summary of input text by extracting the representative sentences from it. In this project, we present a novel technique for generating the summarization of domain specific text by using Semantic Analysis for text summarization, which is a subset of Natural Language Processing.
This presentation gives insight to the overall Horizon 2020 Program and more specifically for the period 2018-2020 with emphasis to ICT. Mariana Damova is the National Contact Point for Horizon 2020 ICT in Bulgaria
Geography of Letters - The Spirituality of Sofia in the Historic MemoryMariana Damova, Ph.D
Presentation of the project The Spirituality of Sofia in the Historic Memory at the Round table on the future perspectives for Digital humanities in SEE within the Summer School in Advanced Tools for Digital Humanities and IT
Communication channels for the european single digital marketMariana Damova, Ph.D
Presentation about the importance of tackling the multilinguality in the strategy agenda for the European Digital Single Market, and about the role of language technology and the European language technology community in solving this issue endorsed by public funding
This is a presentation targeted to leaders of cultural institutions in Bulgaria to inform them about the opportunities to publish cultural content in Bulgariana and in Europeana and about what would be their benefits for doing this.
NLIWoD ISWC 2014 - Multilingual Retrieval Interface for Structured data on th...Mariana Damova, Ph.D
This presentation described a Multilingual Retrieval Interface for Structured data on the Web, a talk given at NLIWoD workshop at ISWC 2014. The approach is based on Grammatical framework and semantic web and linked data technologies
Presentation held at a meeting in Bulgaria (Varna Regional Library) coorganized by Europeana, Bulgariana, Varna Regional library and BBIA about Europeana.
Automatic summarization is the process by which the information in a source text is expressed in a more concise fashion, with a minimal loss of information.
abstractive summaries produce generated text from the important parts of the documents; extractive summaries identify important sections of the text and use them in the summary as they are. single document summaries represent a single document. multi-document summaries are produced from multiple documents and they have to deal with three major problems: recognizing and coping with redundancy ; identifying important differences among documents; ensuring summary coherence . generic summaries present in concise manner the main topics of a given text; query-based summaries are constructed as an answer to an information need expressed by a user’s query, where: indicative summaries point to information of the document, which helps the user to decide whether the document should be read or not; Informative summaries provide all the relevant information to represent the original document.
Rouge is based on MT evaluation. In this approach human made summaries are compared with automatic summaries based on n-gram co-occurrence statistics. Gisting “ the choicest or most essential or most vital part of some idea or experience ”. The product of machine translation is sometimes called a " gisting translation“ . MT will often produce only a rough translation that will at best allow the reader to "get the gist" of the source text, but is unlikely to convey a complete understanding of it. To evaluate system performance NIST assessors who created the .ideal. written summaries did pairwise comparisons of their summaries to the system-generated summaries, other assessors. summaries, and baseline summaries. They used the Summary Evaluation Environment DUC evaluation - provide sets of documents and their human made summaries, and sets of unseen documents - the ideal summary is created - pairwise comparison of the summaries Recall at different compression ratios has been used in summarization research to measure how well an automatic system retains important content of original Documents . However, the simple sentence recall measure cannot differentiate system performance appropriately . I nstead of pure sentence recall score, we use coverage score C .
RST - a text has a kind of unity that arbitrary collections of sentences generally lack. RST offers an explanation of the coherence of texts. For every part of a coherent text, there is some function, some plausible reason for its presence, evident to readers. RST is intended to describe texts. It posits a range of possibilities of structure -- various sorts of "building blocks" which can be observed to occur in texts. These "blocks" are at two levels, the principal one dealing with "nuclearity" and "relations" (often called coherence relations in the linguistic literature).
A hidden Markov model ( HMM ) is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved state. A HMM can be considered as the simplest dynamic Bayesian network . In a regular Markov model , the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but output, dependent on the state, is visible. Each state has a probability distribution over the possible output tokens.
Redundancy Removal. This will identify any information repetition in the source (input) texts, thus minimising any redundant or repetitive content in the final summary. Definition of Basic Elements: (a) the head of a major syntactic constituent, expressed as a single item, (b) relation between a head-Basic element and a single dependent, expressed as a triple (head | modifier | relation). Basic elements can be created by using a parser to produce a syntactic parse tree and a set of ‘cutting rules’ to extract just the valid basic elements from the tree. With basic elements represented as triples one can quite easily decide whether any two units match or not. The query-based basic elements summarizer includes four major stages: (a) query interpretation ; (b) identify important basic elements; (c) identify important sentences; (d) generate summaries
Capturing Sentence Prior for Query-Based Multi-Document Summarization achieves the generation of a fixed length multi document summary which satisfies a specific information need by topic-oriented informative multi-document summarization. Information retrieval techniques have been explored to improve the relevance scoring of sentences towards information need. A measure to capture the notion of importance or prior of a sentence has been identified. The Probability Ranking Principle, the calculated importance/prior is incorporated into the final sentence scoring by weighted linear combination. The system has outperformed all systems at DUC 2006 challenge in terms of ROUGE scores with a significant margin over the next best system. The information need or topic consists of mainly two components. First is the title of the topic, second is the actual information need, expressed as multiple questions. In this approach Information retrieval techniques have been combined with summarization techniques, in producing the extracts. The system that involves the described task consists of the following stages: information need enrichment, content selection and summary generation. Using Conditional Sampling assumption that the query words to be independent of each other while keeping their dependencies on w intact, it computes the required joint probability. Most of the current query-based summarization systems concentrate only on features that measure the relevance of sentences towards the query. They do not explicitly attempt to capture centrality/prior knowledge carried by a sentence pertaining to a domain. The approach defines a new measure which captures the sentence importance based on the distribution of its constituent words in the domain corpus. An entropy measure has been used to compute the information content of a sentence based on a unigram model learned from document corpus
based solely on word-frequency features of clusters, documents and topics. Summary sentences are ranked by a regression SVM. The summarizer does not use any expensive NLP techniques. Because of a detailed feature analysis using Least Angle Regression, FastSum can rely on a minimal set of features leading to fast processing times, e.g. 1250 news documents per 60 seconds. The method only involves sentence splitting, filtering candidate sentences and computing the word frequencies in the documents of a cluster, topic description and the topic title. A machine learning technique called regression SVM is used and for the feature selection a new model selection technique is adopted, called Least Angle Regression (LARS). The focus is on selecting the minimal set of features that are computationally less expensive than other features. The approach ranks all sentences in the topic cluster for summarizability. Features are mainly based on word frequencies of words in the clusters, documents and topics. A cluster contains 25 documents and is associated with a topic. The topic contains a topic title and a topic description. The topic title is a list of key words or phrases describing the topic, the topic description contain the actual query or queries. The features used are word-based and sentence-based. Word-based features are computed based on the probability of words for the different containers. Sentence-based features include the length and position of the sentence in the document.
Dividing the candidate sentences into groups based on a threshold and selecting highest-ranked one from each group. When it is determined which sentences will be included in the summary, three different “scores” are generated and normalized with the length of the sentence. A query-based Medical Information Summarization System Using Ontology Knowledge proposes a technique using UMLS (Unified Medical Language System) and ontology from National Library of Medicine. The ontology-based approach performs clearly better than the keyword-only approach. A general web search engine tries to serve as an information access agent. It retrieves and ranks information according to a user’s query. A document summarization system is presented specialized for medical domain, which will retrieve and summarize up-to-date medical information from trustworthy online sources according to users’ queries. Summaries that a user wants need to be generated on the fly based on his query keywords in a web context. The summarization algorithm is term-based, i.e. only terms defined in UMLS will be recognized and processed. The summarization procedure is s follows: (a) revise the query with UMLS ontology knowledge, (b) calculate distance of each sentence in the document to the finalized query. The distance function is a metrics satisfying d(x,x)=0, symmetry, and triangle inequality. If the distance is smaller than the threshold, the sentence will be a candidate to be included in the summary. (c) calculate pair-wise distances among the candidate sentences, then divide the candidate sentences into groups based on a threshold and select highest-ranked one from each group. When it is determined which sentences will be included in the summary, three different “scores” are generated: a) simply count the number of matched original keywords and select the sentences with many matching keywords b) if a sentence contains a n original keyword assign weight 1 to it. If a sentence contains an expanded keyword, assign weight 0.5 to this keyword. Add all the weights together, and get the score for each sentence. Sentences with high scores are being selected. c) after the scores are obtained, normalize the score with the length of the sentence. And sentences with high normalized score are being selected.
Each summary has to address a set of complex questions about the target, where the question cannot be answered simply with a named entity. The input to the summarization task comprises a target, some opinion-related questions about the target, and a set of documents that contain answers to the questions. The output is a summary for each target that summarizes the answers to the questions. It has been discovered that users have a strong preference for summarizers that model sentiment over non-sentiment baselines. A filtering component identifies sentences that are unlikely to be in a good summary. Another filter is concerned with the sentiment of a sentence. The system performs the following steps: A.Preprocessing, B.Question sentiment and target analyzer, C.Filtering, C1 Sentiment tagger, C2 Taget overlap,D.Feature extraction,E.sentence ranker, F.Redundancy removal. Several preprocessing steps take place before Web-based blog entries are introduced to the FastSum engine. These include translating the original legal opinion topics into queries and identifying any target entities or concepts within those queries, running the queries through the blog search engine and aggregating the top-ranked results, and passing those results through a “marginal relevance filter” in order to ensure that the entries serving as FastSum input data surpass a minimum relevance criterion.