This document outlines Quinsulon Israel's Ph.D. dissertation defense on using semantic analysis to improve multi-document summarization. The dissertation examines using semantic triples clustering and semantic class scoring of sentences to generate summaries. It reviews prior work on statistical, features combination, graph-based, multi-level text relationship, and semantic analysis approaches. The dissertation aims to improve the baseline method and evaluate the effects of semantic analysis on focused multi-document summarization performance.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
presentation from my thesis defense on text summarization, discusses already existing state of art models along with efficiency of AMR or Abstract Meaning Representation for text summarization, we see how we can use AMRs with seq2seq models. We also discuss other techniques such as BPE or Byte Pair Encoding and its effectiveness for the task. Also we see how data augmentation with POS tags and AMRs effect the summarization with s2s learning.
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
Abstractive text summarization is nowadays one of the most important research topics in NLP. However, getting a deep understanding of what it is and also how it works requires a series of base pieces of knowledge that build on top of each other. This is the reason why this presentation will give audiences an overview of sequence-to-sequence with the acceleration of various versions of attention over the past few years. In addition, natural language generation (NLG) with the focusing on decoder techniques and its relevant problems will be reviewed, as a supportive factor to the light of the success of automatic summarization. Finally, the abstractive text summarization will be represented with potential approaches to tackle some hot issues in some latest research papers.
myassignmenthelp is premier service provider for NLP related assignments and projects. Given PPT describes processes involved in NLP programming.so whenever you need help in any work related to natural language processing feel free to get in touch with us.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
presentation from my thesis defense on text summarization, discusses already existing state of art models along with efficiency of AMR or Abstract Meaning Representation for text summarization, we see how we can use AMRs with seq2seq models. We also discuss other techniques such as BPE or Byte Pair Encoding and its effectiveness for the task. Also we see how data augmentation with POS tags and AMRs effect the summarization with s2s learning.
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
Abstractive text summarization is nowadays one of the most important research topics in NLP. However, getting a deep understanding of what it is and also how it works requires a series of base pieces of knowledge that build on top of each other. This is the reason why this presentation will give audiences an overview of sequence-to-sequence with the acceleration of various versions of attention over the past few years. In addition, natural language generation (NLG) with the focusing on decoder techniques and its relevant problems will be reviewed, as a supportive factor to the light of the success of automatic summarization. Finally, the abstractive text summarization will be represented with potential approaches to tackle some hot issues in some latest research papers.
myassignmenthelp is premier service provider for NLP related assignments and projects. Given PPT describes processes involved in NLP programming.so whenever you need help in any work related to natural language processing feel free to get in touch with us.
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
( **Natural Language Processing Using Python: - https://www.edureka.co/python-natural... ** )
This PPT will provide you with detailed and comprehensive knowledge of the two important aspects of Natural Language Processing ie. Stemming and Lemmatization. It will also provide you with the differences between the two with Demo on each. Following are the topics covered in this PPT:
Introduction to Big Data
What is Text Mining?
What is NLP?
Introduction to Stemming
Introduction to Lemmatization
Applications of Stemming & Lemmatization
Difference between stemming & Lemmatization
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Natural Language Processing is a subfield of Artificial Intelligence and linguistics, devoted to make computers understand the statements or words written by humans.
In this seminar we discuss its issues, and its working etc...
Review of Natural Language Processing tasks and examples of why it is so hard. Then he describes in detail text categorization and particularly sentiment analysis. A few common approaches for predicting sentiment are discussed, going even further, explaining statistical machine learning algorithms.
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
Methods for Sentiment Analysis: A Literature Studyvivatechijri
Sentiment analysis is a trending topic, as everyone has an opinion on everything. The systematic
study of these opinions can lead to information which can prove to be valuable for many companies and
industries in future. A huge number of users are online, and they share their opinions and comments regularly,
this information can be mined and used efficiently. Various companies can review their own product using
sentiment analysis and make the necessary changes in future. The data is huge and thus it requires efficient
processing to collect this data and analyze it to produce required result.
In this paper, we will discuss the various methods used for sentiment analysis. It also covers various techniques
used for sentiment analysis such as lexicon based approach, SVM [10], Convolution neural network,
morphological sentence pattern model [1] and IML algorithm. This paper shows studies on various data sets
such as Twitter API, Weibo, movie review, IMDb, Chinese micro-blog database [9] and more. The paper shows
various accuracy results obtained by all the systems.
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
( **Natural Language Processing Using Python: - https://www.edureka.co/python-natural... ** )
This PPT will provide you with detailed and comprehensive knowledge of the two important aspects of Natural Language Processing ie. Stemming and Lemmatization. It will also provide you with the differences between the two with Demo on each. Following are the topics covered in this PPT:
Introduction to Big Data
What is Text Mining?
What is NLP?
Introduction to Stemming
Introduction to Lemmatization
Applications of Stemming & Lemmatization
Difference between stemming & Lemmatization
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Natural Language Processing is a subfield of Artificial Intelligence and linguistics, devoted to make computers understand the statements or words written by humans.
In this seminar we discuss its issues, and its working etc...
Review of Natural Language Processing tasks and examples of why it is so hard. Then he describes in detail text categorization and particularly sentiment analysis. A few common approaches for predicting sentiment are discussed, going even further, explaining statistical machine learning algorithms.
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
Methods for Sentiment Analysis: A Literature Studyvivatechijri
Sentiment analysis is a trending topic, as everyone has an opinion on everything. The systematic
study of these opinions can lead to information which can prove to be valuable for many companies and
industries in future. A huge number of users are online, and they share their opinions and comments regularly,
this information can be mined and used efficiently. Various companies can review their own product using
sentiment analysis and make the necessary changes in future. The data is huge and thus it requires efficient
processing to collect this data and analyze it to produce required result.
In this paper, we will discuss the various methods used for sentiment analysis. It also covers various techniques
used for sentiment analysis such as lexicon based approach, SVM [10], Convolution neural network,
morphological sentence pattern model [1] and IML algorithm. This paper shows studies on various data sets
such as Twitter API, Weibo, movie review, IMDb, Chinese micro-blog database [9] and more. The paper shows
various accuracy results obtained by all the systems.
Advantages of Query Biased Summaries in Information RetrievalOnur Yılmaz
Presentation of the paper:
Advantages of query biased summaries in information retrieval (1998)
Anastasios Tombros and Mark Sanderson.
In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98).
ACM, New York, NY, USA, 2-10.
DOI=10.1145/290941.290947
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
A new approach based on the detection of opinion by sentiwordnet for automati...csandit
In this paper, we propose a new approach based on t
he detection of opinion by the
SentiWordNet for the production of text summarizati
on by using the scoring extraction
technique adapted to detecting of opinion. The tex
ts are decomposed into sentences then
represented by a vector of scores of opinion of thi
s sentences. The summary will be done by
elimination of sentences whose opinion is different
from the original text. This difference is
expressed by a threshold opinion. The following hyp
othesis: "textual units that do not share the
same opinion of the text are ideas used for the dev
elopment or comparison and their absences
have no vocation to reach the semantics of the abst
ract" Has been verified by the statistical
measure of Chi_2 which we used it to calculate a de
pendence between the unit textual and the
text. Finally we found an opinion threshold interva
l which generate the optimal assessments.
A NEW APPROACH BASED ON THE DETECTION OF OPINION BY SENTIWORDNET FOR AUTOMATI...cscpconf
In this paper, we propose a new approach based on the detection of opinion by the
SentiWordNet for the production of text summarization by using the scoring extraction
technique adapted to detecting of opinion. The texts are decomposed into sentences then
represented by a vector of scores of opinion of this sentences. The summary will be done by
elimination of sentences whose opinion is different from the original text. This difference is
expressed by a threshold opinion. The following hypothesis: "textual units that do not share the
same opinion of the text are ideas used for the development or comparison and their absences
have no vocation to reach the semantics of the abstract" Has been verified by the statistical
measure of Chi_2 which we used it to calculate a dependence between the unit textual and the
text. Finally we found an opinion threshold interval which generate the optimal assessments
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
CLUSTER PRIORITY BASED SENTENCE RANKING FOR EFFICIENT EXTRACTIVE TEXT SUMMARIESecij
This paper presents a cluster priority ranking based approach for extractive automatic text summarization that aggregates different cluster ranks for final sentence scoring. This approach does not require any learning, feature weighting and semantic processing. Surface level features combinations are used for
individual cluster scoring. Proposed approach produces quality summaries without using title feature. Experimental results on DUC 2002 dataset proves robustness of proposed approach as compared to other surface level approaches using ROUGE evaluation matrices.
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
Cite: Lifeng Han. 2021. Meta-evaluation of machine translation evaluation methods. In Metrics2021 Tutorial Track/type: Workshop on Informetric and Scientometric Research (SIG-MET), ASIS&T. October 23–24.
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
With the emergence of the Web of Data, most notably Linked Open Data (LOD), an abundance of data has become available on the web. However, LOD datasets and their inherent subgraphs vary heavily with respect to their size, topic and domain coverage, the schemas and their data dynamicity (respectively schemas and metadata) over the time. To this extent, identifying suitable datasets, which meet spefic criteria, has become an increasingly important, yet challenging task to support issues such as entity retrieval or semantic search and data linking. Particularly with respect to the interlinking issue, the current topology of the LOD cloud underlines the need for practical and ecient means to recommend suitable datasets: currently, only well-known reference graphs such as DBpedia (the most obvious target), YAGO or Freebase show a high amount of in-links, while there exists a long tail of potentially suitable yet under-recognized datasets. This problem is due to
the semantic web tradition in dealing with "fnding candidate datasets to link to", where data publishers are used to identify target datasets for interlinking.
While an understanding of the nature of the content of specic datasets is a crucial
prerequisite for the mentioned issues, we adopt in this dissertation the notion of
\dataset prole" | a set of features that describe a dataset and allow the comparison
of dierent datasets with regard to their represented characteristics. Our
rst research direction was to implement a collaborative ltering-like dataset recommendation
approach, which exploits both existing dataset topic proles, as well
as traditional dataset connectivity measures, in order to link LOD datasets into
a global dataset-topic-graph. This approach relies on the LOD graph in order to
learn the connectivity behaviour between LOD datasets. However, experiments have
shown that the current topology of the LOD cloud group is far from being complete
to be considered as a ground truth and consequently as learning data.
Facing the limits the current topology of LOD (as learning data), our research
has led to break away from the topic proles representation of \learn to rank"
approach and to adopt a new approach for candidate datasets identication where
the recommendation is based on the intensional proles overlap between dierent
datasets. By intensional prole, we understand the formal representation of a set of
schema concept labels that best describe a dataset and can be potentially enriched
Semantic Web: Technolgies and Applications for Real-WorldAmit Sheth
Amit Sheth and Susie Stephens, "Semantic Web: Technolgies and Applications for Real-World," Tutorial at 2007 World Wide Web Conference, Banff, Canada.
Tutorial discusses technologies and deployed real-world applications through 2007.
Tutorial description at: http://www2007.org/tutorial-T11.php
Due to an exponential growth in the generation of textual data, the need for tools and mechanisms for automatic summarization of documents has become very critical. Text documents are vital to any organization's day-to-day working and as such, long documents often hamper trivial work. Therefore, an automatic summarizer is vital towards reducing human effort. Text summarization is an important activity in the analysis of a high volume text documents and is currently a major research topic in Natural Language Processing. It is the process of generation of the summary of input text by extracting the representative sentences from it. In this project, we present a novel technique for generating the summarization of domain specific text by using Semantic Analysis for text summarization, which is a subset of Natural Language Processing.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Similar to Dissertation defense slides on "Semantic Analysis for Improved Multi-document Text Summarization" (20)
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Dissertation defense slides on "Semantic Analysis for Improved Multi-document Text Summarization"
1. Ph. D. Dissertation Defense
Semantic Analysis for Improved Multi-
Document Summarization
Quinsulon L. Israel
Committee Members:
Dr. Il-Yeol Song (Chair)
Dr. Hyoil Han (Co-chair)
Dr. Jung-Ran Park
Dr. Harry Wang
Dr. Erjia Yan
1
2. Overview
Motivation
Background
Research Goals
Literature Review
Methodology
› Approach 1 - MDS by Aggregate SDS via Semantic Linear Combination
› Approach 2- MDS by Semantic Triples Clustering with Focus Overlap
Evaluation
Results
Conclusion
Further Work
2
3. Motivation
▪ What is automatic focused Multi-
Document Text Summarization (fMDS)?
‒ Automatic text summarization: creation of a gist of text by
an artificial system
‒ Multi-document summarization: automatic summarization
of multiple, yet related documents
‒ fMDS: multi-document summarization focused on some input
given to an artificial system
3
4. ▪ Why automatic focused Multi-Document
text Summarization (fMDS)?
‒ Purpose: Information overload reduction of multiple, related
documents according to an inputted focus (i.e. query, topic,
question, etc.)
‒ Use: Quick overviews of news and reports by analysts that
focus on specific details and/or new information
‒ How: Extract subset of sentences from multiple, related text
sources, while maximizing “informativeness” and reducing
redundancy in the new summary
4
Motivation (cont.)
6. Hypothesis
The use of semantic analysis can improve focused multi-document
summarization (fMDS) beyond the use of
baseline sentence features.
› Semantic analysis here uses light-weight semantic triples to help
both represent and filter sentences.
› Semantic analysis also uses assigned weights given to the
semantic classes (e.g. special NER typed as person,
organization, location, date, etc.)
› Semantic analysis also uses “semantic cues” for identifying
important information
6
Motivation (cont.)
7. Motivation (cont.)
Research Question & Sub-questions
What effects does semantic analysis of sentences have
on the improvement of focused multi-document
summarization (fMDS)?
7
› What is the effect on overall system performance of clustering
sentences based on semantic analysis for improving fMDS?
› What is the effect on overall system performance of using the
semantic class scoring of sentences for improving fMDS?
8. Research Goals
› Improve upon the gold standard baseline
› Examine the use of the new “semantic class scoring” with
“semantic cue scoring”
› Examine the use of the new “semantic triples clustering”
methods for extractive fMDS
› Create a portable, light-weight improvement for fMDS that can
be easily modified
8
9. Background
Human Summarization Activity
› Single document summarization (SDS): 79% “direct
match” to a sentence within the source document
(Kupiec, Pedersen et al. 1995)
› Multiple document summarization (MDS): use 55% of
the “vocabulary” contained within source documents
(Copeck and Szpakowicx 2004)
Man vs. Machine
› “Highly frequent content words” from corpus not
found in automatic summaries during evaluation but
appear in human summaries
(Nenkova, Vanderwende et al. 2006)
› Man and machine have difficulties with generic MDS
(Copeck and Szpakowicx 2004)
9
10. Human Summarization Agreement
› SDS: 40% unigram overlap
(Lin and Hovy 2002)
− Humans tend to agree on “highly frequent content words”
(Nenkova, Vanderwende et al. 2006)
−Words not agreed upon may not be highly frequent but may
still be useful
› SDS: 30-40 summaries before consensus
› MDS: No such human studies found within literature
10
Background (cont.)
11. Focus Processing
Background (cont.)
Sentence Processing
Sentence Compression
Sentence Scoring and
Ranking
Redundancy Removal
Sentence Selection
(into summary)
Summary Truncation
Figure 1.
Multi-phase process
• Process initial focus and document
sentences
• Score and rank focus-salient
sentences
• Add sentences to summary until
pre-determined length
* Compression is optional .
* Sentence scoring and ranking can be an
iterative process.
Focused MDS (fMDS) Process
Focus-to-Sentences
Comparison
11
Standard Summarization System
Optional
System
deviates
12. Research System
Sentence Processing
(sentence splitting, tokenization, POS, NER,
phrase detection)
Semantic Annotation
(person, location, organization)
Semantic Triples Parsing
(subject-verb-object)
Sentence Scoring
(semantic classes, semantic
cues into aggregate score,
query overlap)
Conceptual
Representations
(semantic triples clustering)
12
Text
Summary
Text
Collection
< GATE 7 Toolkit
< MultiPax Plugin
Figure 2
Novel features of system
< Developed
Light
Semantic >
Component
Improvements >
Background (cont.)
Sentence
Selection
(STC cluster
representative)
< GATE 7 Toolkit
Semantic >
Components >
13. Overview
Summarization Timeline
Systems Comparison
› Probability/statistical modeling
› Features combination
› Multi-level text relationships
› Graph-based
› Semantic analysis
13
Literature Review
14. Summarization Timeline
14
1958
1968
1973
1977
1982
1988
1989
1979
2004
2000
1990-1999
1961
Lexical
occurrence
statistics
by Luhn
Linguistic
approaches
“Cohesion
streamlining”
Position,
cue words
by Edmundson
By Mathais
Frames,
semantic
networks
TOPIC system
Logic rules,
generative
SUSY system
“Sketchy
scripts”
(templates)
FRUMP system
Hybrid
representations,
corpus-based
SCISOR system
Statistics,
corpus training
“State-of-the-art”
Return of
occurrence
statistics
era
“Ranaissance” era
2010-2011
Deeper
semantic,
structural
analyses
Literature Review (cont.)
MDS
15. 15
Author Year
Literature Review (cont.)
System Categories
Statistical
Approaches
Features
Combination
Graph-based
Multi-level
Text
Relationships
Semantic
Analysis
Conroy 2006 x
Nenkova 2006 x
Arora 2008 x
Yih 2007 x
Ouyang 2007 x
Wan 2008 x
Wei 2008 x
Wang 2008 x
Harabagiu 2010 x
Because systems report different evaluation measures or use
different datasets, normalizing performances across years is
not possible.
16. 16
Literature Review (cont.)
Statistical Approaches
Authors Year Compress Mat Calc Freq/Prob LDA Summary Op
Conroy 2006 x x x
Nenkova* 2006
Arora 2008 x x
Legend
Focused Uses some focusing input to system
Compress Uses some form of simplification to add more information
Mat Calc Uses complex matrix calculations to filter and select sentences
Freq/Prob Uses statistical or probability distribution method to score terms
LDA Uses complex Latent Dirichlet Allocation to model summaries
Summary Op Uses a method of creating multiple summaries and choosing the optimum summary.
* Not focus-based, but still important
Authors Year Advantages Disadvantages
Conroy 2006 Uses likely human vocabulary
Uses other collections external to the one
observed
Nenkova 2006 Simple yet relatively effective
Reports frequency based indicator of
human vocabulary but uses probability
instead
Arora 2008
Captures fixed topics from corpus,
optimizes summary
Very complex, relies on sampling over
corpus, sentence can represent only one
topic
17. 17
Literature Review (cont.)
Features Combination
Authors Published Log Reg Word Pos Summary Op Freq/Prob Sentence Pos NER Count WordNet
Yih 2007 x x x x
Ouyang 2007 x x x x
Legend
Log Reg Uses logistic regression to tune a scoring estimator
Word Pos Adds word position metric to score
Freq/Prob Uses statistical or probability distribution method to score terms
Sentence Pos Adds sentence position metric to score
Summary Op Uses a method of creating multiple summaries and choosing the optimum summary.
NER Count Counts named entities found and adds to score
WordNet Used to determine semantically related words
Authors Year Advantages Disadvantages
Yih 2007
Simple estimated scoring,
optimizes summary
Uses only two sentence features, no
comparison of meaning between words
Ouyang 2007
Determines most important
features
Semantics between words in focus and
sentence compared arbitrarily
18. 18
Literature Review (cont.)
Graph-based Approaches
Authors Published Bi-partite CM Rand Walk
Wan 2008 x x
Legend
Bi-partite Uses bi-partite link graph (between clusters and sentences)
CM Rand Walk Conditional Markov Random Walk done between nodes in graph
Authors Year Advantages Disadvantages
Wan 2008
Introduces link analysis via
modified Google PageRank to
MDS
Uses only cosine similarity for all values,
thus no comparison of meaning between
words
19. 19
Literature Review (cont.)
Multi-level Text Relationships
Authors Published Mat Calc Pairwise WordNet
Wei 2008 x x x
Legend
Mat Calc Uses complex matrix calculations to filter and select sentences
Pair-wise Compares pairs of text units closely for determining score
WordNet Used to determine semantically related words
Authors Year Advantages Disadvantages
Wei 2008
Introduces affinity relationship
between text unit levels, text
units paired intersected query
reduces noise
Very complex, need better formulation of
creating vectors for singly observed terms
(e.g. too much noise from WordNet
without better constraint)
20. 20
Literature Review (cont.)
Semantic Analysis
Authors Published Mat Calc Pairwise Structure Coherence
Wang 2008 x x
Harabagiu 2010 x x
Legend
Mat Calc Uses complex matrix calculations to filter and select sentences
Pair-wise Compares pairs of text units closely for determining score
Structure Adds ordering and/or proximity of text units to scoring
Coherence Attempts to improve readability of summary
Authors Year Advantages Disadvantages
Wang 2008
Uses semantic analysis, text units
paired reduces noise
Complex matrix reduction, performance
only above average system
Harabagiu 2010
Complete semantic parsing, adds
coherence for improvement
heavy corpus training/machine learning,
internally created KB, not easily replicable,
fMDS vs generic MDS,
21. Literature Review (cont.)
Similar Research
Wang et al. 2008 vs. Proposed Research
21
Wang Harabagiu Research System
Complex Matrix
Factorization
Corpus-trained,
hand-crafted
kb
Semantic Triples
Clustering
Redundancy
Removal
Semantic frames
overlap, complex
SNMF
Heavy, all
argument roles
Light-weight, simpler
semantic triples
Scoring
Sentence-to-sentence
semantic
relationship more
important than
focus
Position,
ordering
Semantic class scoring,
semantic cues scoring
Training None Extensive None
22. 22
Approach (cont.)
Approach 1 - fMDS by Aggregated SDS via Semantic Linear Combination
− Applies to:
What is the effect on overall system performance of using the semantic
class scoring of sentences for improving fMDS?
• Stage 1 Algorthm: SDS via Semantic Linear Combination (SLC)
› Uses the combination of a feature set to create an aggregate score
› Introduces semantic class and semantic cues scoring to the feature set
• Stage 2 Algorithm: fMDS by SDS via SLC
› Uses all the scored sentences from Stage 1
› Introduces redundancy removal via cosine similarity
Approach 2 - fMDS by Semantic Triples Clustering and Aggregated SDS via SLC
› Uses only the aggregate scores from Approach 1
› Introduces semantic triples clustering for redundancy removal and sentence
selection
− Applies to:
What is the effect on overall system performance of clustering
sentences based on semantic analysis for improving fMDS?
23. Approach (cont.)
Approach 3 - fMDS MDS by Semantic Triples Clustering with Cluster Connections
› Uses the aggregate scores from Approach 1
› Uses the semantic triples clustering from Approach 2
› Introduces sentence intra- and inter-connectivity for redundancy removal and
23
sentence selection
− Applies to:
What is the effect on overall system performance of measuring
the conceptual connectivity of sentence triples for improving
fMDS?
24. Approach 1
Stage 1 Algorithm: SDS via Semantic Linear Combination
Input: A Corpus (C) of topically related Documents (D) pre-processed
into Sentences (S) by which shallow semantic analysis
has been performed: the Named Entities (NE) have been labeled
externally by GATE ANNIE.
Output: A summary (SUMM) is a subset of Sentences (S) from the
input corpus documents up to a maxLength (i.e., SUMM = {S1, S2,
... , SN}, where N is the maximum number of sentences that could
be added to the summary). SUMM contains the best sentences
toward a single-document summary.
24
Approach (cont.)
25. newswire
documents
Focus: “Airbus A380”
25
Evaluation
Test Collection
2: AFP_ENG_20050116.0346
A 380 'superjumbo' will be profitable from 2008 : Airbus chief
PARIS , Jan 16
The A 380 'superjumbo', which will be presented to the world in a lavish ceremony in southern France on Tuesday , will be
profitable from 2008 , its maker Airbus told the French financial newspaper La Tribune .
"You need to count another three years ," Airbus chief Noel Forgeard told Monday 's edition of the newspaper when asked
when the break-even point of the 10 - billion-euro-plus ( 13 - billion-dollar-plus ) A 380 programme would come .
So far , 13 airlines have placed firm orders for 139 of the new planes , which can seat between 555 and 840 passengers
and which have a catalogue price of between 263 and 286 million dollars ( 200 and 218 million euros ) .
The break-even point is calculated to arrive when the 250 th A 380 is sold .
6: AFP_ENG_20050427.0493
Paris airport neighbors complain about noise from giant Airbus A 380
TOULOUSE , France , April 27
An association of residents living near Paris 's Charles-de- Gaulle airport on Wednesday denounced the noise pollution
generated by the giant Airbus A 380 , after the new airliner 's maiden flight .
French acoustics expert Joel Ravenel , a member of the Advocnar group representing those who live near Charles de Gaulle ,
told AFP he had recorded a maximum sound level of 88 decibels just after the aircraft took off from near the southwestern city
of Toulouse .
The figure makes the world 's largest commercial jet "one of the loudest planes that will for decades fly over the heads of the
four million people living in the area" outside Paris , Advocnar said in a statement .
Ravenel said sound levels near Charles de Gaulle airport normally reached about 40 decibels .
Journalists watching the Airbus A 380 's first flight at Toulouse airport in southwestern France , however , noted how quiet the
take-off and landing had seemed .
Tens of thousands of spectators cheered as the A 380 touched down at the airport near Toulouse , home of the European
aircraft maker Airbus Industrie , after a test flight of three hours and 54 minutes .
System 100 (MDS on top of Semantic SDS Linear Combination [Semantic MDS])
Here are some key dates in its development: January 23, 2002: Production starts of Airbus A380 components.
May 7, 2004: French Prime Minister Jean-Pierre Raffarin inaugurates the Toulouse assembly line.
Ravenel said sound levels near Charles de Gaulle airport normally reached about 40 decibels.
According to the source, Wednesday's flight may be at an altitude slightly higher than the some 10,000 feet (3,000
meters) achieved in the first flight, and could climb up to 13,000 feet.
1
Input
Output
26. Approach (cont.)
Flow of Approach 1 Stage 1 Algorithm: SDS via Semantic Linear Combination 26
27. 27
Approach (cont.)
Stage 1 Algorithm: Feature set of Step 10
Aggregated Score: Formula…
Aggregate Score = Σi Є F ɷi * fi, where fi is one of the features
described in section 5.1, i is the number of the feature, ɷi is the
weight of fi, and F is the feature set that this research use.
28. Approach (cont.)
Flow of Approach 1 Stage 1 Algorithm: SDS via Semantic Linear Combination 28
29. Approach 1
Stage 2 Algorithm: MDS By SDS via Semantic Linear Combination
• Input: A corpus (C) of topically related documents (D) pre-processed
into the best sentences (S) from the Stage 1 SDS by
Shallow Semantic Analysis.
• Output: A summary (SUMMm) is a subset of Sentences (S) from the
input documents up to maxLength (i.e., SUMMm= {Sx
1, Sy
2, ... , Sz
N},
where m refers to multi-document and x, y, and z identifies its
containing document).
29
Approach (cont.)
30. 30
Approach (cont.)
Flow of Approach 1 Stage 2 Algorithm: MDS via SDS Semantic Linear Combination
31. Approach 2
Algorithm: STC Focused MDS By SDS via Semantic Linear Combination
• Input: A corpus (C) of topically related documents (D) pre-processed
into the best sentences (S) from the Stage 1 SDS by
Shallow Semantic Analysis of Approach 1 and pre-processed for
their subject-verb-object triples. Stage 2 MDS of Approach 1 is not
used as part of this approach.
• Output: A summary (SUMMstc) is a subset of Sentences (S) from
the input corpus documents up to maxLength (i.e., SUMMm= {Sx
1,
Sy
2, ... , Sz
N}, where x, y, and z identifies its containing document).
31
Approach (cont.)
32. 32
Approach (cont.)
Approach 2 Algorithm: Example of Step 1
According to police, the violence erupted after two boys, aged 14 and 16, died when they scaled a wall of
an electrical relay station and fell against a transformer.
<Sentence s-v-o=“erupt:violence:*:f;die:boy:*:f;scaled:they:wall:f;”></Sentence>
erupt
violence *
Semantic Triples
die
boy *
triple 1
scale
they wall
Example Sentence
triple 2 triple 3
Represents examples of sentences transformed into semantic
triples.
The circle node represents the verb, the first square node
represents the subject, and the last square node represents the
object (direct) if found.
33. Flow of Approach 2 Algorithm: STC fMDS By SDS via SLC
33
Approach (cont.)
agg
34. Approach 2 Algorithm: Example of Step 2
Semantic Triple Cluster Representation
Rioting spreads to at least 20 Paris-region
towns.
1 semantic triple
34
The riot has spread to 200 city
suburbs and towns, including
Marseille, Nice, Toulouse, Lille,
Rennes, Rouen, Bordeaux and
Montpellier and central Paris,
French police said.
_ Nov. 4 _ Youths torch 750 cars,
throw stones at paramedics, as
violence spreads to other towns.
Approach (cont.)
say
spread
riot *
2 semantic triples
Police *
spread
riot *
2 semantic triples
spread
throw
riot *
youth Stone
*Yellow triple (at top left) signifies the main cluster semantic triple
35. Flow of Approach 2 Algorithm: STC fMDS By SDS via SLC
35
Approach (cont.)
agg
36. Rioting spreads to at least 20 Paris-region
towns.
36
Semantic Triple Cluster Representation
The riot has spread to 200 city
suburbs and towns, including
Marseille, Nice, Toulouse, Lille,
Rennes, Rouen, Bordeaux and
Montpellier and central Paris,
French police said.
riot *
Higher ranked triple (contained
in sentence with high triple count)
say
spread
2 semantic triples
Police *
spread
riot *
Approach (cont.)
Approach 2 Algorithm: Example of Step 3
1 semantic triple
*Yellow triple (at top left) signifies the main cluster semantic triple
37. Flow of Approach 2 Algorithm: STC fMDS By SDS via SLC
37
Approach (cont.)
agg
38. Approach 2 Algorithm: Example of Step 5
Semantic Triple Cluster Representation
Query Overlap with the Semantic Triples
Rioting spreads to at least 20 Paris-region
towns.
1 semantic triple
38
The riot has spread to 200 city
suburbs and towns, including
Marseille, Nice, Toulouse, Lille,
Rennes, Rouen, Bordeaux and
Montpellier and central Paris,
French police said.
Semantic triple overlap = 1
Query: “Paris Riots”
Approach (cont.)
say
spread
riot *
2 semantic triples
Police *
spread
riot *
Semantic triple overlap = 1
*Yellow triple (at top left) signifies the main cluster semantic triple
39. Flow of Approach 2 Algorithm: STC fMDS By SDS via SLC
39
Approach (cont.)
agg
40. Approach 3
Algorithm: STC Focused MDS By SDS via Semantic Linear Combination
• Input: A corpus (C) of topically related documents (D) pre-processed
into the best sentences (S) from the Stage 1 SDS by
Shallow Semantic Analysis of Approach 1 and pre-processed for
their subject-verb-object triples. Stage 2 MDS of Approach 1 is not
used as part of this approach. In addition to the Stage 1 SDS
processing from Approach 1, Approach 2 Steps 1-6 are used to
collect the triples into their proper ordering, and the sentences are
later ordered by the connections between these triples..
• Output: A summary (SUMMconn) is a subset of Sentences (S) from
the input corpus documents up to maxLength (i.e., SUMMconn= {Sx
1,
Sy
2, ... , Sz
N}, where x, y, and z identifies its containing document)..
40
Approach (cont.)
43. Flow of Approach 3 Algorithm: STC fMDS By Connectivity
43
Approach (cont.)
agg
44. 44
Goal:
Evaluation
To get a summary with a ROUGE score higher than the gold standard baseline system
To place well against the veteran automatic systems.
For the evaluation, the following methods were used in combination:
› Counting semantic classes and semantic cues to boost informative sentences
› Simpler semantic triples clustering method (including with focus)
Method:
Gather human reference summaries and automated system summaries from the NIST 2008
Text Analysis Conference competition
Use evaluation script from the competition to compare research system summaries against all
other automatic summaries
Compare extrinsic evaluation ROUGE scores with gold standard baseline system and other
automatic systems
45. 45
Data used in Evaluation:
› Input for each focus text is a collection of 10 newswire documents
› Each document has approximately 250-500 words for ~20 sentences
› Total input for each collection range from ~150-200 sentences
› Total documents 46 collections for a total of about 10,000 sentences
newswire
documents
Focus: “Paris Riots”
Evaluation
Example Input (truncated)
<DOC id="AFP_ENG_20051028.0154" type="story" >
<HEADLINE>
Riot rocks Paris suburb after teenagers killed
</HEADLINE>
<DATELINE>
CLICHY-SOUS-BOIS, France, Oct 28
</DATELINE>
<TEXT>
<P>
Dozens of youths went
on a rampage, burning vehicles and vandalising buildings in a tough
Paris suburb Friday in an act of rage following the death by
electrocution of two teenagers trying to flee police.
</P>
…
</DOC>
Test Collection
NIST Text Analysis Conference Data
46. 46
Evaluation
Evaluation Metrics:
Three ROUGE metrics from the NIST competitions are used to evaluate the
performance of the proposed system:
ROUGE-1, ROUGE-2, and ROUGE-SU4
ROUGE is an N-gram co-occurrence statistic between a candidate system
summary and a set of human model summaries. ROUGE-1 is calculated as
follows:
ROUGE-1
Σ Σ Countmatch(gramn)
s є {Reference Summaries}
Σ Σ Count(gramn)
s є {Reference Summaries}
Reference calculation: Four (4) human model summaries created by judges
according to the NIST competition guidelines
Gold Standard: Lead Baseline System: Collects first four (4) lines of most recent
document
47. 47
Evaluation
Evaluation Metric: ROUGE-1
Unigram co-occurrence statistic between a system summary and a set of four
human reference summaries. ROUGE-1 is calculated as follows:
Semi-automatic summary vs. 1 Reference Summary
System Candidate Summary Sentence
Police detained 14 people Saturday after a second [night] of [rioting] that broke out in a working-class
[Paris] suburb following the deaths of two youths who were electrocuted while trying to evade police.
3 unigrams found
Human reference Summary Sentence
On successive [nights] the [rioting] spread to other parts of [Paris] and then to other cities.
16 unigrams total
Unigrams: {night}, {rioting}, {Paris}
ROUGE-1 = 3 / 16 = 0.1875
48. 48
Evaluation
Evaluation Metric: Rouge-SU4
Bigram co-occurrence statistic that allows for four (4) words to be appear
between two (2) words as long as they are in the same sentence order with the
human reference summary
Semi-automatic summary vs. 1 Reference Summary
System Candidate Summary Sentence
Police detained 14 people Saturday after a second [night of rioting that broke out] in a working-class
[Paris] suburb following the deaths of two youths who were electrocuted while trying to evade
police.
1 skip bigram found
Human Reference Summary Sentence
On successive [nights the rioting spread to other] parts of [Paris] and then to other cities.
21 skip bigrams total
Skip Bi-grams: {night rioting}
ROUGE-SU4 = 1 / 21 = 0.04762
49. 49
Results: Approach 1
System Ranking by MDS via SDS Semantic Analysis Approach Variations
50. Discussion: Approach 1
• The poorer performances of Systems 5, 10R and 16R show that stop word removal is
absolutely necessary for improvement, even with the semantic analysis. Without it,
systems could not outperform the gold standard baseline system. Slight improvement is shown
from the semantic analysis SDS-based MDS Systems 6 over its relative System 5.
• The improved ROUGE score of System 9 over System 16P shows some importance for
adding more semantic cueing and semantic class scoring to the selection of sentences.
The weights are similarly, but System 16P takes away from the semantic cueing and semantic
class scoring and gives it to “df”, and hence the drop.
• Another related class of tested systems are those of Systems 9, 10R and 11H. These systems differ mostly on
alternative frequency measures. System 11H use of the well-known tf*idf measure outperforms the
pure “df” measure that the other two use. “tf*idf” along with the semantic cueing and semantic class
scoring allowed system 11H to obtain a higher score than the gold standard baseline.
50
52. Discussion: Approach 2
• All systems displayed are instances of the STC MDS system, except System 0 and this work’s
System 100, which is fMDS by SDS SLC from Approach 1.
• Although its performance in singular was not as promising compared to the veteran systems, the
addition of System 3100's semantic triples clustering greatly improves performance by
more than 10 rankings over the gold standard baseline. System 3100 also used a
minimum cluster density of 2 and a ranking method that gave preference to cluster aggregate
score over the cluster density.
• Systems 1600, 1200, and 2700 show improvement over the automatic gold standard
baseline with semantic triples clustering alone; however, each additional ranking method
shows added improvement.
52
54. Discussion: Approach 3
• Systems Conn1 and Conn2 show only slight improvement over the gold standard baseline
System 0 in Table 11, with System Conn2 performing the best with stop-word pruning of the
clusters based on their semantic roles (i.e. if a stop-word was found within a subject, verb or
object slot, it was removed from consideration).
• Because these system variations show a minor drop in performance against the gold
standard baseline system in other values of rouge, the approach is not satisfactorily
performant, but may be relevant for improvement in future work.
54
55. 55
Conclusion
This work sought to answer the question of what
effects does semantic analysis of sentences have on the
improvement of focused multi-document summarization
(fMDS)?
1. What is the effect on overall system performance of using the
semantic classes scoring of sentences for improving fMDS?
Even though it was shown that tf*idf is extremely important in selecting the best
sentences, there is a gap that is created that this semantic analysis starts to fill.
The semantic classes and semantic cues scoring still improved several places over the
gold standard baseline system.
56. 56
Conclusion
2. What is the effect on overall system performance of clustering
sentences based on semantic analysis for improving fMDS?
This work’s System 3100 outperformed the gold standard baseline System 0 by over
10 places. This performance improvement is mainly attributed to the semantic
analysis technique of filtering the sentences by clustering their semantic triples. The
semantic triples represent the most basic "meaning" of the sentences during the
filtering process.
However, a short drop in performance of two places was observed when attempting
to focus the semantic triples themselves. This is possibly due to the absence of the
focus terms within the main propositions of the sentences. Yet, they may appear
somewhere else within the sentence. Using the query feature helped mitigate this
effect for the semantic tripling clustering in system 3100; hence, the better
performance without the query overlap determination.
57. 57
Conclusion
2. What is the effect on overall system performance of measuring
the conceptual connectivity of sentence triples for improving
fMDS?
Unfortunately, the technique used for sentence intra- and inter-connectivity did not
perform well enough against the gold standard baseline system. This approach was
able to obtain slightly more vocabulary as denoted by its slightly higher rouge1 score,
but other scores were slightly lower compared to the performance of the gold
standard baseline system.
3. Important note: within all three semantic analysis approaches, no word sense
disambiguation (WSD) was performed. Even for terms within semantic roles across
multiple triples that can be clustered together, their actual "meaning" may be different. It
would be worthwhile to add WSD utilizing words that appear around each role that is
discovered. This may help improve the accuracy of systems implemented in this research.
58. Contributions:
› Provides an improvement over the gold standard baseline by more
than ten positions.
› Proposes “semantic triples clustering” along with “semantic class
scoring” and “semantic cue scoring” as methods to improve
extractive fMDS.
› Provides a comparison of the semantic analysis techniques on
performance for fMDS that can be used later for new abstractive
summarization.
› Created a light-weight, portable fMDS system with no training.
58
Conclusion
59. Significance:
Improved over the gold standard baseline
The research provides a more comparable semantic analysis against fundamental techniques and a gold
standard baseline
The research outlined here provides a more comparable semantic analysis against fundamental techniques and
a standard baseline
More domain-independent improvement due to no need for training
Can be used as a new baseline and can tested easily on other corpora
Simpler, inexpensive than extensively corpus-trained ‘a prior’ systems
Other veteran methods are too expensive and time consuming to reproduce
This research does not rely on extensive corpus training and building tailored, domain-dependent resources.
Does not have the associated cost in time.
Compressing the sentences into a basic form of meaning takes a step in the direction of
an abstractive technique .
The semantic triples used for this extractive fMDS can be modified to take a step in relatively unexplored area
of abstractive fMDS.
Humans tend to extract whole sentences from documents to create a summary, however they also shorten,
move and/or infuse information depending upon importance and length.
59
Conclusion
60. Conclusion
Direct semantic triplet summaries
Weight dampening
Advanced semantic class analysis
60
Further Work
61. Publications
Submitted:
Israel, Quinsulon L., Hyoil Han, and Il-Yeol Song. Semantic analysis for focused multi-document
summarization of text. Submitted to ACM Symposium on Applied
Computing (SAC) 2015.
61
Israel, Quinsulon L., Hyoil Han, and Il-Yeol Song (2010). Focused multi-document
summarization: human summarization activity vs. automated systems techniques.
Journal of Computing Sciences in Colleges. 25(5): 10-20.
62. Publication Plan
Journals
Fall 2014
Submit to Journal of Intelligent Information Systems:
› Covers the integration of artificial intelligence and database technologies to create next generation
Intelligent Information Systems
› http://www.springer.com/computer/database+management+%26+information+retrieval/journal/10844
Submit to Information Processing Management:
› Covers experimental and advanced processes related to information retrieval (IR) and a variety of
information systems, networks, and contexts, as well as their implementations and related evaluation.
› http://www.journals.elsevier.com/information-processing-and-management
62
63. References
• Conroy, J. M., J. D. Schlesinger, et al. (2006). Topic-focused multi-document summarization using an
approximate oracle score. Proceedings of the COLING/ACL on Main conference poster sessions.
Sydney, Australia, Association for Computational Linguistics: 152-159.
• Dang, H. T. (2006). Overview of the DUC 2006. Document Understanding Conference.
• Edmundson , H. P. 1969. New Methods in Automatic Extracting. J. ACM 16, 2 (April 1969), 264-285.
• Harabagiu, S. and F. Lacatusu (2010). "Using topic themes for multi-document summarization." ACM
Transactions on Information Systems 28(3): 1-47.
• Ouyang, Y., S. Li, et al. (2007). Developing learning strategies for topic-based summarization.
Proceedings of the sixteenth ACM conference on Conference on information and knowledge
management. Lisbon, Portugal, ACM: 79-86.
• Nenkova, A. and K. McKeown (2003). References to named entities: a corpus study. Proceedings of
the 2003 Conference of the North American Chapter of the Association for Computational Linguistics
on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short
papers - Volume 2. Edmonton, Canada, Association for Computational Linguistics: 70-72.
• Yih, W.-T., J. Goodman, et al. (2007). Multi-document summarization by maximizing informative
content-words. International Joint Conferences on Artificial Intelligence, Hyderabad, India.
• Wan, X. and J. Yang (2008). Multi-document summarization using cluster-based link analysis.
Proceedings of the 31st annual international ACM SIGIR conference on Research and development
in information retrieval. Singapore, Singapore, ACM: 299-306.
• Wang, D., T. Li, et al. (2008). Multi-document summarization via sentence-level semantic analysis
and symmetric matrix factorization. Proceedings of the 31st annual international ACM SIGIR
conference on Research and development in information retrieval. Singapore, Singapore, ACM: 307-
314.
• Wei, F., W. Li, et al. (2008). Query-sensitive mutual reinforcement chain and its application in query-oriented
multi-document summarization. Proceedings of the 31st annual international ACM SIGIR
conference on Research and development in information retrieval. Singapore, Singapore, ACM: 283-
290.
63