This document presents an approach to improve the efficiency of text summarization using submodular functions. It proposes using k-means clustering and hierarchical clustering to group similar text passages into clusters, then selecting sentences from clusters to create a summary that maximizes coverage of the original text while also promoting diversity. The approach models summary quality as the coverage of the summary set plus a term rewarding diversity. Experimental results show this method achieves higher ROUGE evaluation scores than an agglomerative clustering approach.
Adversarial Reinforced Learning for Unsupervised Domain Adaptationtaeseon ryu
안녕하세요 딥러닝 논문읽기 모임입니다 오늘 업로드된 논문 리뷰 영상은 2021 WACB 에서 발표된 Adversarial Reinforced Learning for Unsupervised Domain Adaptation 라는 제목의 논문입니다.
데이터 분류의 자동화를 위해서는 많은양의 학습데이터가 필요합니다. 그렇기에 레이블이 존재하는 데이터로 학습이 끝난 모델을 재활용해서 새로운 도메인에 적용하는 연구인 도메인 어뎁션 분야는 많은 각광을 받고 있습니다.
논문의 특징으로는 크게 세가지를 둘 수 있습니다.
첫 번째로 본 논문에서는 GAN을 이용하여 비지도 방식으로 도메인 어뎁션이 가능한 프레임워크를 제안하였습니다 여기서 이제 강화학습 모델은 소스와 타겟
도메인간 가장 최적의 피처쌍을 선택하는데 사용됩니다
두 번째로 레이블링 되지않은 타겟 도메인에서 가장 적합한 피처를 찾아내기 위해
소스와 타겟간 상관관계를 보상으로 적용하는 정책을 개발하였습니다
마지막으로 제안된 적대적 강화학습 모델을 소스와 타겟 도메인간
최소화하는 피처쌍의 탐색과 각 도메인의 거리 분포상태의
Alignment 학습을 통해 소타대비 이제 성능을 향상 하였습니다
논문에 대한 디테일한 리뷰를 펀디멘탈팀 이근배님이 많은 도움 주셨습니다!
Adversarial Reinforced Learning for Unsupervised Domain Adaptationtaeseon ryu
안녕하세요 딥러닝 논문읽기 모임입니다 오늘 업로드된 논문 리뷰 영상은 2021 WACB 에서 발표된 Adversarial Reinforced Learning for Unsupervised Domain Adaptation 라는 제목의 논문입니다.
데이터 분류의 자동화를 위해서는 많은양의 학습데이터가 필요합니다. 그렇기에 레이블이 존재하는 데이터로 학습이 끝난 모델을 재활용해서 새로운 도메인에 적용하는 연구인 도메인 어뎁션 분야는 많은 각광을 받고 있습니다.
논문의 특징으로는 크게 세가지를 둘 수 있습니다.
첫 번째로 본 논문에서는 GAN을 이용하여 비지도 방식으로 도메인 어뎁션이 가능한 프레임워크를 제안하였습니다 여기서 이제 강화학습 모델은 소스와 타겟
도메인간 가장 최적의 피처쌍을 선택하는데 사용됩니다
두 번째로 레이블링 되지않은 타겟 도메인에서 가장 적합한 피처를 찾아내기 위해
소스와 타겟간 상관관계를 보상으로 적용하는 정책을 개발하였습니다
마지막으로 제안된 적대적 강화학습 모델을 소스와 타겟 도메인간
최소화하는 피처쌍의 탐색과 각 도메인의 거리 분포상태의
Alignment 학습을 통해 소타대비 이제 성능을 향상 하였습니다
논문에 대한 디테일한 리뷰를 펀디멘탈팀 이근배님이 많은 도움 주셨습니다!
Error Estimates for Multi-Penalty Regularization under General Source Conditioncsandit
In learning theory, the convergence issues of the regression problem are investigated with
the least square Tikhonov regularization schemes in both the RKHS-norm and the L 2
-norm.
We consider the multi-penalized least square regularization scheme under the general source
condition with the polynomial decay of the eigenvalues of the integral operator. One of the
motivation for this work is to discuss the convergence issues for widely considered manifold
regularization scheme. The optimal convergence rates of multi-penalty regularizer is achieved
in the interpolation norm using the concept of effective dimension. Further we also propose
the penalty balancing principle based on augmented Tikhonov regularization for the choice of
regularization parameters. The superiority of multi-penalty regularization over single-penalty
regularization is shown using the academic example and moon data set.
This is an introduction of Topic Modeling, including tf-idf, LSA, pLSA, LDA, EM, and some other related materials. I know there are definitely some mistakes, and you can correct them with your wisdom. Thank you~
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
Deep reinforcement learning (RL) has been a commonly-used strategy for the abstractive summarization task to address both the exposure bias and non-differentiable task issues. However, the conventional reward ROUGE-L simply looks for exact n-grams matches between candidates and annotated references, which inevitably makes the generated sentences repetitive and incoherent. In this paper, we explore the practicability of utilizing the distributional semantics to measure the matching degrees. Our proposed distributional semantics reward has distinct superiority in capturing the lexical and compositional diversity of natural language.
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
This presentation is about how global terminology can evolve without a centralized organisation. The simple idea is, that everybody has to disclose the identity of at least two identifiers for the same think. These local semantic handshakes will have the effect of global terminological alignment.
Error Estimates for Multi-Penalty Regularization under General Source Conditioncsandit
In learning theory, the convergence issues of the regression problem are investigated with
the least square Tikhonov regularization schemes in both the RKHS-norm and the L 2
-norm.
We consider the multi-penalized least square regularization scheme under the general source
condition with the polynomial decay of the eigenvalues of the integral operator. One of the
motivation for this work is to discuss the convergence issues for widely considered manifold
regularization scheme. The optimal convergence rates of multi-penalty regularizer is achieved
in the interpolation norm using the concept of effective dimension. Further we also propose
the penalty balancing principle based on augmented Tikhonov regularization for the choice of
regularization parameters. The superiority of multi-penalty regularization over single-penalty
regularization is shown using the academic example and moon data set.
This is an introduction of Topic Modeling, including tf-idf, LSA, pLSA, LDA, EM, and some other related materials. I know there are definitely some mistakes, and you can correct them with your wisdom. Thank you~
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
Deep reinforcement learning (RL) has been a commonly-used strategy for the abstractive summarization task to address both the exposure bias and non-differentiable task issues. However, the conventional reward ROUGE-L simply looks for exact n-grams matches between candidates and annotated references, which inevitably makes the generated sentences repetitive and incoherent. In this paper, we explore the practicability of utilizing the distributional semantics to measure the matching degrees. Our proposed distributional semantics reward has distinct superiority in capturing the lexical and compositional diversity of natural language.
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
This presentation is about how global terminology can evolve without a centralized organisation. The simple idea is, that everybody has to disclose the identity of at least two identifiers for the same think. These local semantic handshakes will have the effect of global terminological alignment.
Comparative study of classification algorithm for text based categorizationeSAT Journals
Abstract
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine
learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques.
It provides conceptual views of the collected documents and has important applications in the real world. Text based
categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number
of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes'
algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of
the above mentioned classification algorithm.
Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN,
Decision Tree.
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"IJDKP
Clustering is one of the data mining techniques that have been around to discover business intelligence by grouping objects into clusters using a similarity measure. Clustering is an unsupervised learning process that has many utilities in real time applications in the fields of marketing, biology, libraries, insurance, city-planning, earthquake studies and document clustering. Latent trends and relationships among data objects can be unearthed using clustering algorithms. Many clustering algorithms came into existence. However, the quality of clusters has to be given paramount importance. The quality objective is to achieve
highest similarity between objects of same cluster and lowest similarity between objects of different clusters. In this context, we studied two widely used clustering algorithms such as the K-Means and Fuzzy K-Means. K-Means is an exclusive clustering algorithm while the Fuzzy K-Means is an overlapping clustering algorithm. In this paper we prove the hypothesis “Fuzzy K-Means is better than K-Means for Clustering” through both literature and empirical study. We built a prototype application to demonstrate the differences between the two clustering algorithms. The experiments are made on diabetes dataset
obtained from the UCI repository. The empirical results reveal that the performance of Fuzzy K-Means is better than that of K-means in terms of quality or accuracy of clusters. Thus, our empirical study proved the hypothesis “Fuzzy K-Means is better than K-Means for Clustering”.
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Duality Theory in Multi Objective Linear Programming Problemstheijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
Theoretical work submitted to the Journal should be original in its motivation or modeling structure. Empirical analysis should be based on a theoretical framework and should be capable of replication. It is expected that all materials required for replication (including computer programs and data sets) should be available upon request to the authors.
The International Journal of Engineering & Science would take much care in making your article published without much delay with your kind cooperation
Scholarly search engines, reference management tools, and academic social networks enable modern researchers to organize their scientific libraries. Moreover, they often provide recommendations for scientific publications that might be of interest to researchers. Because of the exponentially increasing volume of publications, effective citation recommendation is of great importance to researchers, as it reduces the time and effort spent on retrieving, understanding, and selecting research papers. In this context, we address the problem of citation recommendation, i.e., the task of recommending citations for a new paper. Current research investigates this task in different settings, including cases where rich user metadata is available (e.g., user profile, publications, citations). This work focus on a setting where the user provides only the abstract of a new paper as input. Our proposed approach is to expand the semantic features of the given abstract using knowledge graphs – and, combine them with other features (e.g., indegree, recency) to fit a learning to rank model. This model is used to generate the citation recommendations. By evaluating on real data, we show that the expanded semantic features lead to improving the quality of the recommendations measured by nDCG@10.
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
RL 논문 리뷰 스터디에서 Evolving Reinforcement Learning Algorithms 논문 내용을 정리해 발표했습니다. 이 논문은 Value-based Model-free RL 에이전트의 손실 함수를 표현하는 언어를 설계하고 기존 DQN보다 최적화된 손실 함수를 제안합니다. 많은 분들에게 도움이 되었으면 합니다.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
1. Diversity Measure for text
Summarization
Guided by: Dr.Vasudev Verma
Mentor: Litton J Kurisinkel
Submitted By:- Group No.27
Nirat Attri (201201013)
Dhruva Das (201301151)
Siddharth Saklecha (201505570)
2. Introduction
➢ A summary is brief but detailed outline of a
document, which conveys the essence of
the document.
➢ The goal of a text Summarizer is
condensing the source text into a shorter
version while its overall information and
meaning remains same.
➢ The Generated Summary should cover
relevant topics in the original corpus and
diverse enough.
3. Motivation
➢ A short summary, which conveys the
essence of the document, helps in finding
relevant information quickly
➢ Text summarization also provides a way to
cluster similar documents and present a
summary.
➢ Text summarization has become an
important and timely tool for assisting and
interpreting text information in today’ fast-
growing information age.
4. Problem Statement :
➢ Derive a method to improve efficiency of the task
of Text Summarization.
➢ We design novel methods to improve efficiency of the
task of Text Summarization using a class of sub
modular functions.
➢ These functions each combine two terms, one which
encourages the summary to be representative of the
corpus (coverage), and the other which positively
rewards diversity.
➢ Our functions are monotone non-decreasing and
submodular, which means that an efficient scalable
greedy optimization scheme has a constant factor
guarantee of optimality.
Proposed Solution :
5. Submodular Function:
➢ Sub-modular functions are those that satisfy the
property of diminishing returns: for any A⊆B⊆ V
v, a sub-modular function F must satisfy
F(A+v) - F(A) >= F(B + v) - F(B).
That is, the incremental value of v decreases as
the context in which v is considered grows from A
to B.
➢ An equivalent definition, useful mathematically, is
that for any A,B⊆V, we must have that
F(A)+F(B) >= F(AUB)+F(AnB).
If this is satisfied everywhere with equality, then
the function F is called modular.
6. Summarization Tools :
➢ CLUTO
It is a software package for clustering low
and high dimensional datasets and for
analyzing the characteristics of the various
clusters.
➢ ROUGE
Recall-Oriented Understudy for Gisting
Evaluation, is a set of metrics and a
software package used for evaluating
automatic summarization and machine
translation software in natural language
processing.
7. Approach
➢ Two properties of a good summary are relevance and non
redundancy.
➢ Objective functions for extractive summarization usually
measure these two separately and then mix them together
trading off encouraging relevance and penalizing
Redundancy.
➢ The redundancy penalty usually violates the monotonicity
of the objective functions.
➢ In particular, we model the summary quality as
F(S) = L(S) + λ R(S)
where, L(S) measures the coverage, or fidelity, of summary
set S to the document, R(S) rewards diversity in S, and λ is
a trade-off coefficient.
➢ Coverage Measure:
L(S) can be interpreted either as a set function that
measures the similarity of summary set S to the document
to be summarized, or as a function representing some
form of coverage of V by S.
L(S) should be monotone, as coverage improves with a
larger summary.
8. Approach Continue…
Shannon entropy is a well-known monotone submodular
function. So, we take our coverage function as:
L(S) = Σ min { Ci(S) , α Ci(V) } and i ∈ V
Basically, Ci(S) measures how similar S is to element i, or how
much of i is covered by S and Ci (V) is just the largest value
that Ci(S) can achieve.
➢Diversity Measure :
where Pi, i = 1, ...,K is a partition of the ground set V into
separate clusters, and ri ≥ 0 indicates the singleton reward of i
(i.e., the reward of adding i into the empty set). The value ri
estimates the
importance of i to the summary.
9. Approach 1 : K-means Clustering
➢ The dataset was fed into CLUTO to perform K-means and
Hierarchical Clustering to obtain clusters referring to
similar data.
➢ We ran a Grid Search on the values of and to get the best
optimal value to maximize the sub modular function.
➢ Algorithm:
➢ Summary → ⌀
➢ allowedClusters ← allClusters
➢ while size(Summary) ≤ 665:
• pick the cluster most similar to corpus from
allowedClusters → chosenCluster
• chosenSentence ← highest ranking sentence of
chosenCluster based on coverage and diversity
measure
• Summary ← chosenSentence
10. Approach 2 : Agglomerative
Clustering
➢ Agglomerative clustering(also called Hierarchical clustering
analysis or HCA) is a method of cluster analysis which
seeks to build a hierarchy of cluster.
➢ It is a bottom up approac. Each observation starts in its
own cluster, and pairs of clusters are merged as one moves
up the hierarchy.
➢ O(n^3) approach.
➢ Each observation starts in its own cluster and clusters are
succesively merged together. The linkage criteria
determines the metric used for the merge strategy.
➢ In computing the clusters, we used Cosine Similarity
criteria.
11. Challenges Faced
➢ One the challenges that we faced was to understand the
Output given by the Tool Cluto, which Clusters the given
sentences. After which finding the summary of the folder
was just a basic implementation of the formula given in the
paper.
➢ Another challenge was that the time taken to find the
summary for a single folder was quite large. This time
could be drastically reduced by not using the condition of
Sub-modular formula f(A+v) - f(A) > f(B+v) - f(B), that is, the
incremental value of v decreases as the context in which v
is considered grows from A to B. But this would be at the
cost of accuracy.
12. Results and Conclusion
➢ The values for λ and α values for the diversity and
coverage measures giving us the submodular function were
calculated using a sweep search for the best values of
ROGUE scores.
➢ We recovered the best values as follows :
α = 15
λ = 4
➢ The clusters were summarized and their ROGUE scores
calculated to estimate the efficiency
➢ We can conclude by saying that K-means and Heirarchical
clustering using a weighted consideration for both, the
diversity and coverage, gives us the best scores in Text
Summarization.
Apporach ROUGE-R ROUGE-F
K-Means and hierarchical 0.3843 0.3792
Agglomerative 0.3724 0.3674
13. References
➢ Vishal Gupta and Gurpreet Singh Lehal, A Survey of Text
Summarization Extractive Techniques.
➢ Hui Lin and Jeff Bilmes, A Class of Submodular Functions
for Document Summerization.
➢ Hui Lin and Jeff Bilmes, Multi-document Summarization
via Budgeted Maximization of Submodular Function.
➢ Ying Zhao and George Karypis, Criterion Functions for
Document Clustering Experiments and Analysis.
➢ Chin-Yew LIN, A Package for Automatic Evaluation of
Summaries.
➢ DUC 2004,http://duc.nist.gov/data.html