In this new era, where tremendous information is available on the internet, it is of most important to
provide the improved mechanism to extract the information quickly and most efficiently. It is very difficult
for human beings to manually extract the summary of large documents of text. Therefore, there is a
problem of searching for relevant documents from the number of documents available, and absorbing
relevant information from it. In order to solve the above two problems, the automatic text summarization is
very much necessary. Text summarization is the process of identifying the most important meaningful
information in a document or set of related documents and compressing them into a shorter version
preserving its overall meanings. More specific, Abstractive Text Summarization (ATS), is the task of
constructing summary sentences by merging facts from different source sentences and condensing them
into a shorter representation while preserving information content and overall meaning. This Paper
introduces a newly proposed technique for Summarizing the abstractive newspapers’ articles based on
deep learning.
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Current multi-document summarization systems can successfully extract summary sentences, however with
many limitations including: low coverage, inaccurate extraction to important sentences, redundancy and
poor coherence among the selected sentences. The present study introduces a new concept of centroid
approach and reports new techniques for extracting summary sentences for multi-document. In both
techniques keyphrases are used to weigh sentences and documents. The first summarization technique
(Sen-Rich) prefers maximum richness sentences. While the second (Doc-Rich), prefers sentences from
centroid document. To demonstrate the new summarization system application to extract summaries of
Arabic documents we performed two experiments. First, we applied Rouge measure to compare the new
techniques among systems presented at TAC2011. The results show that Sen-Rich outperformed all systems
in ROUGE-S. Second, the system was applied to summarize multi-topic documents. Using human
evaluators, the results show that Doc-Rich is the superior, where summary sentences characterized by
extra coverage and more cohesion
Design of optimal search engine using text summarization through artificial i...TELKOMNIKA JOURNAL
Natural language processing is the trending topic in the latest research areas, which allows the developers to create the human-computer interactions to come into existence. The natural language processing is an integration of artificial intelligence, computer science and computer linguistics. The research towards natural Language Processing is focused on creating innovations towards creating the devices or machines which operates basing on the single command of a human. It allows various Bot creations to innovate the instructions from the mobile devices to control the physical devices by allowing the speech-tagging. In our paper, we design a search engine which not only displays the data according to user query but also performs the detailed display of the content or topic user is interested for using the summarization concept. We find the designed search engine is having optimal response time for the user queries by analyzing with number of transactions as inputs. Also, the result findings in the performance analysis show that the text summarization method has been an efficient way for improving the response time in the search engine optimizations.
MULTI-DOCUMENT SUMMARIZATION SYSTEM: USING FUZZY LOGIC AND GENETIC ALGORITHM IAEME Publication
In the recent times, the requirement for generation of multi-document summary has gained a lot of attention among the researchers. Mostly, the text summarization technique uses the sentence extraction technique where the salient sentences in the multiple documents are extracted and presented as a summary. In our proposed system, we have developed a sentence extraction based automatic multi-document summarization system that employs fuzzy logic and Genetic Algorithm (GA). At first, the different features are used to identify the significance of sentences in such a way that, each sentence in the documents is specified with the feature score.
The classical or traditional information system provides answer after a user submits a complete query. It is even
noticed that presently, almost all the relational database systems rely on the query which has syntax and semantics
defined completely to access data. But often it is the case that we are willing to use vague terms in our query. The main
objective of database management system is to provide an environment that is both convenient and efficient for people
to use in storing and retrieving information. A recent trend of supporting auto complete is a first step to cope up with
this problem. We can have design of both classical and fuzzy database and can use effectively fuzzy queries on these
databases. Fuzzy databases are developed to manipulate the incomplete, unclear and vague data such as low, fast, very
high, about etc. The primary focus of fuzzy logic is on the natural language. This Paper provides the users the flexibility
or freedom to query database using natural language. Here this paper implements “interactive fuzzy search”. This
framework for interactive fuzzy search permits the user to explore the data as they type even in the presence of some
minor errors. This paper applies fuzzy queries on relational database so that it is possible to have the precise result as
well as the output for the uncertain terms we generally use based on some membership function
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered 672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...TELKOMNIKA JOURNAL
Automatic multi-document summarization needs to find representative sentences not only by
sentence distribution to select the most important sentence but also by how informative a term is in a
sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and
well-spread words in the corpus but ignores the grammatical information that indicates instructive content.
The presence or absence of informative content in a sentence can be indicated by grammatical information
which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting
method by incorporating sentence distribution and POS tagging for multi -document summarization.
Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering
is based on cluster importance to determine the important clusters. Sentence extraction based on
sentence distribution and POS tagging is introduced to extract the representative sentences from the
ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004
are compared with those of the Sentence Distribution Method. Our proposed method achieved better
results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
Current multi-document summarization systems can successfully extract summary sentences, however with
many limitations including: low coverage, inaccurate extraction to important sentences, redundancy and
poor coherence among the selected sentences. The present study introduces a new concept of centroid
approach and reports new techniques for extracting summary sentences for multi-document. In both
techniques keyphrases are used to weigh sentences and documents. The first summarization technique
(Sen-Rich) prefers maximum richness sentences. While the second (Doc-Rich), prefers sentences from
centroid document. To demonstrate the new summarization system application to extract summaries of
Arabic documents we performed two experiments. First, we applied Rouge measure to compare the new
techniques among systems presented at TAC2011. The results show that Sen-Rich outperformed all systems
in ROUGE-S. Second, the system was applied to summarize multi-topic documents. Using human
evaluators, the results show that Doc-Rich is the superior, where summary sentences characterized by
extra coverage and more cohesion
Design of optimal search engine using text summarization through artificial i...TELKOMNIKA JOURNAL
Natural language processing is the trending topic in the latest research areas, which allows the developers to create the human-computer interactions to come into existence. The natural language processing is an integration of artificial intelligence, computer science and computer linguistics. The research towards natural Language Processing is focused on creating innovations towards creating the devices or machines which operates basing on the single command of a human. It allows various Bot creations to innovate the instructions from the mobile devices to control the physical devices by allowing the speech-tagging. In our paper, we design a search engine which not only displays the data according to user query but also performs the detailed display of the content or topic user is interested for using the summarization concept. We find the designed search engine is having optimal response time for the user queries by analyzing with number of transactions as inputs. Also, the result findings in the performance analysis show that the text summarization method has been an efficient way for improving the response time in the search engine optimizations.
MULTI-DOCUMENT SUMMARIZATION SYSTEM: USING FUZZY LOGIC AND GENETIC ALGORITHM IAEME Publication
In the recent times, the requirement for generation of multi-document summary has gained a lot of attention among the researchers. Mostly, the text summarization technique uses the sentence extraction technique where the salient sentences in the multiple documents are extracted and presented as a summary. In our proposed system, we have developed a sentence extraction based automatic multi-document summarization system that employs fuzzy logic and Genetic Algorithm (GA). At first, the different features are used to identify the significance of sentences in such a way that, each sentence in the documents is specified with the feature score.
The classical or traditional information system provides answer after a user submits a complete query. It is even
noticed that presently, almost all the relational database systems rely on the query which has syntax and semantics
defined completely to access data. But often it is the case that we are willing to use vague terms in our query. The main
objective of database management system is to provide an environment that is both convenient and efficient for people
to use in storing and retrieving information. A recent trend of supporting auto complete is a first step to cope up with
this problem. We can have design of both classical and fuzzy database and can use effectively fuzzy queries on these
databases. Fuzzy databases are developed to manipulate the incomplete, unclear and vague data such as low, fast, very
high, about etc. The primary focus of fuzzy logic is on the natural language. This Paper provides the users the flexibility
or freedom to query database using natural language. Here this paper implements “interactive fuzzy search”. This
framework for interactive fuzzy search permits the user to explore the data as they type even in the presence of some
minor errors. This paper applies fuzzy queries on relational database so that it is possible to have the precise result as
well as the output for the uncertain terms we generally use based on some membership function
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered 672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...TELKOMNIKA JOURNAL
Automatic multi-document summarization needs to find representative sentences not only by
sentence distribution to select the most important sentence but also by how informative a term is in a
sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and
well-spread words in the corpus but ignores the grammatical information that indicates instructive content.
The presence or absence of informative content in a sentence can be indicated by grammatical information
which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting
method by incorporating sentence distribution and POS tagging for multi -document summarization.
Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering
is based on cluster importance to determine the important clusters. Sentence extraction based on
sentence distribution and POS tagging is introduced to extract the representative sentences from the
ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004
are compared with those of the Sentence Distribution Method. Our proposed method achieved better
results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
A hybrid approach for text summarization using semantic latent Dirichlet allo...IJECEIAES
Automatic text summarization generates a summary that contains sentences reflecting the essential and relevant information of the original documents. Extractive summarization requires semantic understanding, while abstractive summarization requires a better intermediate text representation. This paper proposes a hybrid approach for generating text summaries that combine extractive and abstractive methods. To improve the semantic understanding of the model, we propose two novel extractive methods: semantic latent Dirichlet allocation (semantic LDA) and sentence concept mapping. We then generate an intermediate summary by applying our proposed sentence ranking algorithm over the sentence concept mapping. This intermediate summary is input to a transformer-based abstractive model fine-tuned with a multi-head attention mechanism. Our experimental results demonstrate that the proposed hybrid model generates coherent summaries using the intermediate extractive summary covering semantics. As we increase the concepts and number of words in the summary the rouge scores are improved for precision and F1 scores in our proposed model.
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from large amounts of data. The important term in data mining is text mining. Text mining extracts the quality information highly from text. Statistical pattern learning is used to high quality information. High –quality in text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language processing and analytical methods are highly preferred to turn text into data for analysis. This survey is about the various techniques and algorithms used in text mining.
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
From last few years online information is growing tremendously on World Wide Web or on user’s desktops and thus online information gains much more attention in the field of automatic text summarization. Text mining has become a significant research field as it produces valuable data from unstructured and large amount of texts. Summarization systems provide the possibility of searching the important keywords of the texts and so the consumer will expend less time on reading the whole document. Main objective of summarization system is to generate a new form which expresses the key meaning of the contained text. This paper study on various existing techniques with needs of novel Multi-Document summarization schemes. This paper is motivated by arising need to provide high quality summary in very short period of time. In proposed system, user can quickly and easily access correctly-developed summaries which expresses the key meaning of the contained text. The primary focus of this paper lies with thef_β-optimal merge function, a function recently presented here, that uses the weighted harmonic mean to discover a harmony in the middle of precision and recall. Proposed system utilizes Bisect K-means clustering to improve the time and Neural Networks to improve the accuracy of summary generated by NEWSUM algorithm.
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
Summarization of software artifacts is an ongoing field of research among the software engineering community due to the benefits that summarization provides like saving of time and efforts in various software engineering tasks like code search, duplicate bug reports detection, traceability link recovery, etc. Summarization is to produce short and concise summaries. The paper presents the review of the state of the art of summarization techniques in software engineering context. The paper gives a brief overview to the software artifacts which are mostly used for summarization or have benefits from summarization. The paper briefly describes the general process of summarization. The paper reviews the papers published from 2010 to June 2017 and classifies the works into extractive and abstractive summarization. The paper also reviews the evaluation techniques used for summarizing software artifacts. The paper discusses the open problems and challenges in this field of research. The paper also discusses the future scopes in this area for new researchers.
Summarization of software artifacts is an ongoing field of research among the software engineering community due to the benefits that summarization provides like saving of time and efforts in various software engineering tasks like code search, duplicate bug reports detection, traceability link recovery, etc. Summarization is to produce short and concise summaries. The paper presents the review of the state of the art of summarization techniques in software engineering context. The paper gives a brief overview to the software artifacts which are mostly used for summarization or have benefits from summarization. The paper briefly describes the general process of summarization. The paper reviews the papers published from 2010 to June 2017 and classifies the works into extractive and abstractive summarization. The paper also reviews the evaluation techniques used for summarizing software artifacts. The paper discusses the open problems and challenges in this field of research. The paper also discusses the future scopes in this area for new researchers.
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHijcsit
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a preprocessing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
A statistical model for gist generation a case study on hindi news articleIJDKP
Every day, huge number of news articles are reported and disseminated on the Internet. By generating gist
of an article, reader can go through the main topics instead of reading the whole article as it takes much
time for reader to read the entire content of the article. An ideal system would understand the document
and generate the appropriate theme(s) directly from the results of the understanding. In the absence of
natural language understanding system, it is required to design an appropriate system. Gist generation is a
difficult task because it requires both maximizing text content in short summary and maintains
grammaticality of the text. In this paper we present a statistical approach to generate a gist of a Hindi
news article. The experimental results are evaluated using the standard measures such as precision, recall
and F1 measure for different statistical models and their combination on the article before pre-processing
and after pre-processing.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Elevating forensic investigation system for file clusteringeSAT Journals
Abstract In computer forensic investigation, thousands of files are usually surveyed. Much of the data in those files consists of formless manuscript, whose investigation by computer examiners is very tough to accomplish. Clustering is the unverified organization of designs that is data items, remarks, or feature vectors into groups (clusters). To find a noble clarification for this automated method of analysis are of great interest. In particular, algorithms such as K-means, K-medoids, Single Link, Complete Link and Average Link can simplify the detection of new and valuable information from the documents under investigation. This paper is going to present an tactic that applies text clustering algorithms to forensic examination of computers seized in police investigations using multithreading technique for data clustering. Keywords- Clustering, forensic computing, text mining, multithreading.
Similar to A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articles based on Deep Learning (20)
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articles based on Deep Learning
1. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
DOI:10.5121/mlaij.2020.7401 1
A NEWLY PROPOSED TECHNIQUE FOR
SUMMARIZING THE ABSTRACTIVE NEWSPAPERS’
ARTICLES BASED ON DEEP LEARNING
Sherif Kamel Hussein1
and Joza Nejer AL-Otaibi2
1
Associate Professor – Department of Communications and Computer Engineering
October University for Modern Sciences and Arts, Giza- Egypt
1,2
Head of Computer Science Department, Arab East Colleges for Graduate
Studies, Riyadh, KSA
ABSTRACT
In this new era, where tremendous information is available on the internet, it is of most important to
provide the improved mechanism to extract the information quickly and most efficiently. It is very difficult
for human beings to manually extract the summary of large documents of text. Therefore, there is a
problem of searching for relevant documents from the number of documents available, and absorbing
relevant information from it. In order to solve the above two problems, the automatic text summarization is
very much necessary. Text summarization is the process of identifying the most important meaningful
information in a document or set of related documents and compressing them into a shorter version
preserving its overall meanings. More specific, Abstractive Text Summarization (ATS), is the task of
constructing summary sentences by merging facts from different source sentences and condensing them
into a shorter representation while preserving information content and overall meaning. This Paper
introduces a newly proposed technique for Summarizing the abstractive newspapers’ articles based on
deep learning.
KEYWORDS
Abstractive Text Summarization, Natural Language Processing, Extractive Summarization, Automatic Text
Summarization, Machine Learning.
1. INTRODUCTION
Nowadays with the dramatic growth of the Internet, there is an explosion in the amount of text
data from a variety of sources. That may lead to an expansion for the availability of the
documents that need an exhaustive research in the area of automatic text summarization [1]. The
volume of text is considered as an invaluable source of information and knowledge, which needs
an effective summarization to be useful. Text summarization is the problem of creating a short,
accurate, and fluent summary of a longer text document [1].Automatic text summarization
methods are greatly needed to address the ever-growing amount of text data that are available
online to discover and consume the relevant information faster [2]. Text Summarization methods
can be classified into extractive and abstractive summarization [3]. An extractive summarization
method consists of selecting important sentences, paragraphs etc. from the original document and
concatenating them into shorter form. An Abstractive summarization is based on understanding
the main concepts in a document and then express those concepts in clear natural language [3].
2. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
2
There are many reasons why Automatic Text Summarization is useful [4]:
• Summaries reduces reading time.
• When researching documents, summaries make the selection process easier.
• Automatic summarization improves the effectiveness of indexing.
• Automatic summarization algorithms are less biased than human summarizers.
• Personalized summaries are useful in question-answering systems as they provide
personalized information.
• Using automatic or semi-automatic summarization systems enables commercial abstract
services to increase the number of text documents that are able to process.
Automatic text summarization is a common problem in machine learning and natural language
processing (NLP). Skyhoshi, who is a U.S.-based machine learning expert with 13 years of
experience and currently teaches people his skills, said that “the technique has proved to be
critical in quickly and accurately summarizing voluminous texts, something which could be
expensive and time consuming if done without machines.” Machine learning models are usually
trained to understand documents and distill the useful information before extracting the required
summarized texts. With such a big amount of data circulating in the digital space, there is a need
to develop machine learning algorithms that can automatically shorten longer texts and deliver
accurate summaries that can fluently pass the intended messages. Applying text summarization
reduces reading time, accelerates the process of researching for information, and increases the
amount of information that can fit in an area [5]. In this study,the machine-learning approach will
be used to conduct a model to produce an abstractive text summarization based on Python
programming language.
1.1. Main Steps for Text Summarization
There are three main steps for summarizing documents. These are topic identification,
interpretation and summary generation [3]. First step, is Topic Identification which identifies the
most prominent information in the text. There are different techniques for topic identification as
Position, Cue Phrases and word frequency. Methods which are based on that position of phrases
are the most useful methods for topic identification. Second Step, is the Interpretation. In This
step, different subjects are fused in order to form a general content. Finally the third step is the
Summary Generation in which, the system uses the text generation methods.
2. LITERATURE REVIEW
2.1. Abstractive Summarization with the Aid of Extractive [6]
This technique proposed a general framework for abstractive summarization which incorporates
extractive summarization as an auxiliary task by composeding of a shared hierarchical document
encoder, an attention-based decoder for abstractive summarization, and an extractor for sentence-
level extractive summarization. Learning these two tasks jointly with the shared encoder allow to
better capture the semantics in the document. Furthermore, experiments on the Cable News
Network (CNN)/Daily Mail dataset demonstrate that both the auxiliary task and the attention
constraint contribute to improve the performance significantly, and their model is comparable to
3. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
3
the state-of-the-art abstractive models. In particular, general framework of their proposed model
is composed of 5 components namely: word-level encoder encodes the sentences word-by-word
independently, sentence- level encoder encodes the document sentence-by-sentence, Sentence
extractor makes binary classification for each sentence, Hierarchical attention calculates the
word-level and sentence-level context vectors for decoding steps, Decoder decodes the output
sequential word sequence with a beam-search algorithm.
2.2. Toward Abstractive Summarization using Semantic Representations
Arxivpreprint, Arxiv: [7]
This technique presented a novel abstractive summarization framework that draws on the recent
development of a treebank for the Abstract Meaning Representation(AMR).In this framework,
the source text is parsed to a set of AMR graphs, the graphs are transformed into a summary
graph, and then text is generated from the summary graph. They focus on the graph-to graph
transformation that reduces the source semantic graph into a summary graph, making use of an
existing AMR parser and assuming the eventual availability of an AMR-to text generator. The
framework is data-driven, trainable, and not specifically designed for a particular domain.
Experiments based on gold standard AMR annotations and system parses show promising results.
2.3. Abstractive Document Summarization Via Bidirectional Decoder [8]
This technique introduced an abstractive document summarization via bidirectional decoder. It is
based on Sequence-to-sequence architecture with attention mechanism which is widely used in
abstractive text summarization, and has achieved a series of remarkable results. However, this
method may suffer from error accumulation. It proposed a Summarization model using a
Bidirectional decoder (BiSum), in which the backward decoder provides a reference for the
forward decoder. The authors used attention mechanism at both encoder and backward decoder
sides to ensure that the summary generated by backward decoder can be understood.
Experimental results show that the work can produce higher-quality summary on Chinese
datasets Transport Topics News (TTNews) and English datasets Cable News Network
(CNN)/Daily Mail.
2.4. A Comprehensive Survey on Extractive and Abstractive Techniques for Text
Summarization [9]
In this technique the generation of summary is done by understanding the whole content and
representing it in its own terms. This is achieved using a Recurrent Neural Network consisting of
Gated Recurrent Unit (GRU) or Long Short-Term Memory (LSTM) cells. The author highlights
the recent abstractive techniques used for text summarization and provides information on the
standardized datasets in addition to testing methodologies that are used to evaluate the
performance of the system. For instance, structured approach fundamentally encodes the most
indispensable information from the document(s) through mental blueprints like layouts,
extraction principles, and elective structures like tree, ontology, rule, and graph structure. On the
other hand, in tree-based approach, sentences from multiple documents are clustered according to
the themes they represent. Second, these themes are re-ranked, selected, and ordered according to
their significance. The formed syntactic trees are subsequently merged using Fusion Lattice
Computation to assimilate information from different themes. Linearization is carried out for the
formation of sentences from the merged tree using Tree traversal.On the other hand, Graph-Based
Approach uses the graph data structure for language representation. The sentence formation is
subjected to constraints such as, it is mandatory to have a subject, verb, and predicate in it. Along
with this, a compendium is used for Linguistic and Summary Generation purposes. General
4. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
4
notion in Extractive Text Summarization is to weight the sentences of a document as a function
of high-frequency words, disregarding the very high frequency common words. Location Method
exploits the idea of identifying important information in certain part of context. Sentence
extraction should be possible utilizing Neural Network Architectures. One of these strategies is a
classifier which includes the navigation of the archive consecutively, and choosing whether to
include the sentence into the rundown. Extractive techniques include picking sentences in an
arbitrary way. Extractive text summarization is conducted using neural model. The advantage of
using this method over the traditional pure mathematical and Natural Language Processing (NLP)
techniques is to understand more contexts. The models improve the depiction of sentences by
combining the important sentences to shorten the size and maintain the semantics at the same
time.
2.5. Neural Extractive Summarization with Side Information, Arxiv Preprint,
Arxiv: [10]
This Technique was implemented by (Narayan et al.) to achieve summarization using sentence
extraction. The model was based on an encoder—decoder architecture consisting of Recurrent
Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). The idea behind it was to
acquire portrayals of sentences by applying single-layer convolutional neural systems over
sequences of words embedding and using the recurrent neural network to make groupings out of
sentences. CNNs were used to capture the important patterns amongst the sentences in the article.
This CNN was used to extract the sequence of words in a sentence. The document encoder was
used to identify the sequence of sentences to get the document representation. Abstractive
Summarization has been achieved using a sequence to sequence encoder–decoder model. This
model has its famous application in Neural Machine Translation. It is essentially used to preserve
the dependencies LSTM cells that are used. These cells are the most atomic unit of an encoder
and decoder and better method to address the problem of dependency. Time Delay Neural
Network (TDNN) is used to achieve this along with max pooling layers. TDNN receives the
output of the bottom layer to the input layer. This helps in preserving the dependency over long
sentences.
2.6. Abstractive Multi-Document Text Summarization using a Genetic Algorithm
[11]
This Technique Introduced an abstractive multi-document text summarization using a genetic
algorithm (MDTS) .It is based on Multi-Document Text Summarization (MDTS), which consists
of generating an abstract from a group of two or more number of documents that represent only
the most important information of all documents. In this article, the authors proposed a new
MDTS method based on a Genetic Algorithm (GA). The fitness function is calculated
considering two text features: sentence position and coverage. They proposed the binary coding
representation, selection, crossover and mutation operators to improve the state-of-the-art results.
The proposed method consists of three steps. In the first step, the documents of the collection
were chronologically ordered, then the original text is adapting to the entry of the format of the
GA, where the original text is separated in sentences. Also, the text pre-processing is applied to
the collection of documents. Firstly, the text was divided into words separated by commas, then
some tags were placed in the text to be able to differentiate quantities, emails, among others.
5. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
5
3. THE NEWLY PROPOSED SYSTEM
3.1. Overview
The newly proposed System for summarization based on machine learning technique is
constructed based on extracting key phrases from sentences and then using a deep learning model
to learn the collocation of phrases.
3.2. Block Diagram of the Newly Proposed System
The process flow of the proposed system is divided into three steps: text pre-processing, phrase
extraction and text summary generation as shown in Fig 1.
Figure1. System block diagram
Data pre-processing phase includes, word segmentation, morphological reduction, and
coreference resolution. After word segmentation, the morphological reduction will be applied to
merge these phrases into one. In the original text there are pronouns and demonstratives that will
cause ambiguity during model training, so the coreference resolution will be used before the
model is trained. Phrase extraction includes three sub-steps: phrase acquisition, phrase
refinement, and phrase combination. Phrase extraction method is used to acquire phrases. Phrase
refinement is applied to remove the redundancy in semantics or syntactic structure in the
extracted phrase before training the model. The purpose of phrase combination is to combine
different phrases with similar semantics into one phrase. For text generation, long short-term
memory (LSTM) and Convolutional neural network (CNN) model are applied. LSTM-CNN
model is based on deep learning for training. After training the model, the summary will be
generated. LSTM-ATS model encoding, reads the phrase sequence and encodes it into an internal
representation and LSTM-ATS model decoding, reads encoded phrase sequence from the
encoder and generates the output sequence.
Original
Text
Word
segmentation
Phrase
combination
Coreference
resolution
Morphological
reduction
Phrase
acquisition
Phrase
refinement
Processed
text
Phrase
sequence
Text
summary
LSTM-ATS
model
encoding
LSTM-ATS
model
decoding
Data pre-process
Phrase process
Text generation
6. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
6
3.3. Flow Chart of the Newly Proposed System
The proposed system aims to develop an Arabic abstractive text summarization system that
generates a human-like short summary from long Arabic text. It is very difficult and time
consuming for the human to manually summarize large documents of text. Therefore, this system
will help in solving this issue by enabling the machine to learn and understand the text and then
summarize it into shorter sentences. The developed system uses a deep learning algorithm called
Long-Short-Term-Memory (LSTM) to learn and generate the abstractive summary
As shown in Figure 2, at the beginning the user enters the name and password. Then the system
checks whether the user is registered on the system or not. After validating the user, the text will
be uploaded .The uploaded text will go through the stage of word segmentation followed by the
morphological reduction stage and finally the reference resolution. After this phase, the processed
text will be displayed to the user and then the process will move through the phrase acquisition
followed by phrase refinement and finally phrase combination. After that, a phrase sequence will
come out to be applied into the LSTM-ATS model for encoding and decoding. Finally, the
system generates text summary and displays it for the user.
7. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
7
Processed Text
Phrase Sequence
Figure 2. System flow chart
Valid
or not
Register or
not
Enter user name and
password
Registration form
Text summary
Upload text
Start
Stop
NO NO
Yes
Word
segmentation
Morphological
reduction
Coreference
resolution
Phrase
acquisition
Phrase
refinement
Phrase
combination
LSTM-ATS model
encoding
LSTM-ATS model
decoding
8. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
8
3.3.1. Word segmentation
Word segmentation is the process of dividing the written text into words and determining the
sentence words in boundaries and also adding space between words.
3.3.2. Morphological reduction
In case the phrases have same semantics and lose morphology, the morphological reduction will
be used for gathering these phrases into one.
3.3.3. Coreference resolution
Coreference resolution is used for identifying or clustering the expressions and noun phrases refer
to the same entity in the text. During model training, the ambiguities might be occurred due to
the existence of many pronouns in the text that should use coreference resolution before the
model is trained to display the processed text.
3.3.4. Phrase acquisition
Phrase acquisition occurs after obtaining the processed text to extract and get the acquired
phrases.
3.3.5. Phrase refinement
Phrase refinement is the process of extracting the phrases that should guarantee correct semantics
or syntactic structure before the training model.
3.3.6. Phrase combination
Phrase combination is used for improving the redundancy of phrases. Therefore, it will combine
different phrases with similar semantics into one phrase and give phrase sequence.
3.3.7. LSTM-ATS model encoding/ decoding
LSTM-CNN model is using deep learning for training. After training the model the summary will
be generated..
4. SYSTEM DESIGN
4.1. Hardware Requirements:
• Personal computer. With the following configuration
1. Processor: The proposed model will use (Intel(R) Core (TM) i7-8550U).This is latest
technology which will help us to develop the proposed system.
2. Random-access memory (RAM): The proposed model is using RAM (8.00GB).
3. Hard disk: The proposed system needs hard disk with capacity (224 GB).
9. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
9
4.2. Software Requirements
• The Operating system (OS): Windows 10 is the operating system used with proposed
system is [12].
• Python 3.0: is a recent, general purpose and higher level programming language. It is
available for free, and runs on every current platform .It is also a dynamic object oriented
programming language that can be used for many kinds of software development. It offers
strong support for integration with other languages and tools [13].
• Google Colab: Colaboratory is a free Jupyter notebook environment provided by Google
that requires no setup and it runs entirely under the cloud. With Colaboratory you can write
and execute code, save and share the analysis, and also access the powerful computing
resources. In addition, Google Colab provides inbuilt version controlling system using Git
and it is quite easy to create notes and documentations, including figures, and tables using
markdown. It also runs on Google servers using virtual machines [14].
5. IMPLEMENTATION AND TESTING
The implementation is done using Google colab and Python programming language. The
implementation of the code is based on four steps as shown in figure 3.
Figure 3 Implemenation steps
5.1. Import all necessary libraries
At first step, all libraries will be called like: Nltk, Corpus, Cluster. Util, and Cosine_distance.
Nltk, is a library in python, Corpus library is used to delete the point, comma, etc., to assist in
the implementation.
Cluster. Util, is used to call the Cosine_distance function which measures the percentage of
similarities in sentences as depicted in the code segment shown in figure 4.
import nltk
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
Figure 4 Importing all necessary libraries
10. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
10
5.2. Inserting the Article
Figure 5 shows, the article title that will be entered to the system.
article_url = 'https://sabq.org/kRzGHX'
title = ' تبديلالمواليدبينالحمايةوسوءاإلدارة '
article='
Figure 5 Insertion of the article
5.3. Generation of Clean Sentences
Figure 6 shows the code segment that reads the article, replaces the comma, places the point, and
replaces (..) with (.) ,in addition to separating each sentence with a point.
def read_article(article):
article = article.replace("..", " . ").replace(". ", " . ").replace(" .", " . ")
article = article.split(" . ")
sentences = []
for sentence in article:
if sentence == '': continue
print(sentence)
sentence = sentence.replace("..", " . ")
sentences.append (sentence.replace (".", " . ").split(" "))
#sentences.pop()
return sentences
Figure 6 Generation of clean sentences
5.4. Similarity Matrix
Cosine similarity is used to find similarity between sentences, as shown in the code segment
depicted in figure 7.
def sentence_similarity(sent1, sent2, stopwords=None):
if stopwords is None:
stopwords = []
#sent1 = [w.lower() for w in sent1]
#sent2 = [w.lower() for w in sent2]
all_words = list(set(sent1 + sent2))
vector1 = [0] * len(all_words)
vector2 = [0] * len(all_words)
# build the vector for the first sentence
for w in sent1:
if w in stopwords:
continue
vector1[all_words.index(w)] += 1
# build the vector for the second sentence
for w in sent2:
if w in stopwords:
continue
vector2[all_words.index(w)] += 1
return 1 - cosine_distance(vector1, vector2)
Figure 7 Similarity matrix
11. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
11
5.5. Generating Summary Method
Method will keep calling all other help functions to keep the summarization pipeline going as
depicted in the code segments shown in figure 8 and figure 9.
def build_similarity_matrix(sentences, stop_words):
# Create an empty similarity matrix
similarity_matrix = np.zeros((len(sentences), len(sentences)))
for idx1 in range(len(sentences)):
for idx2 in range(len(sentences)):
if idx1 == idx2: #ignore if both are same sentences
continue
similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], st
op_words)
return similarity_matrix
Figure 8 Generating summary
def generate_summary(file_name, top_n=5):
nltk.download("stopwords")
stop_words = stopwords.words('arabic')
summarize_text = []
# Step 1 - Read text anc split it
sentences = read_article(file_name)
# Step 2 - Generate Similary Martix across sentences
sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)
# Step 3 - Rank sentences in similarity martix
sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
scores = nx.pagerank(sentence_similarity_graph)
# Step 4 - Sort the rank and pick top sentences
ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
print("n")
print("Indexes of top ranked_sentence order are ", ranked_sentence)
print(ranked_sentence)
for i in range(top_n):
summarize_text.append(" ".join(ranked_sentence[i][1]))
# Step 5 - Offcourse, output the summarize texr
print("n")
print("Summarize Text: n", ". ".join(summarize_text))
Figure 9 Code for generating the summary method
5.6. Summarizing the Text
Figure 10 shows the article summary in one sentence and also ranks for each sentence.
12. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
12
Figure 10 Summarized text
5.7. Implantation Stage
There are three stages namely: sign up page, sign in page, summary page.
5.7.1. Sign up page
The user has to enter his credentials, first name, last name, mobile number, e-mail, password, and
confirm password for sign up .
5.8. Sign in page
The user has to enter the email and password to confirm sign in ,in order to upload the article.
5.9. Summary page
In figure 11, the user enters the text and determine the number of sentences, then click
Summarize "."لخص As shown in figure 12, it will display the summarization of the text ( the
number of sentences that was decided by the user) .
Figure 11 Summary page
13. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
13
Figure 12 Displaying summary
6. CONCLUSION
Text summarization is the problem of creating a short, accurate, and fluent summary of a longer
text document. Automatic text summarization methods are greatly needed to address the ever-
growing amount of text data available online for both better help to discover relevant information
and to consume relevant information faster. This paper introduced an overview of various past
researches and studies in the field of Automatic Text Summarization. The proposed system aims
at developing an Arabic abstractive text summarization system that generates a human-like short
summary from long Arabic text. It is very difficult and time consuming for the human to
manually summarize large documents of text. Therefore, this system will help in solving this
issue by enabling the machine to learn, understand and summarize the text into shorter sentences.
The developed system uses a deep learning algorithm called Long-Short-Term-Memory (LSTM)
to learn and generate the abstractive summary. The implementation of this project had achieved a
great result by suggesting a useful summary.
The Future Work will be dedicated to implement the newly proposed system for the newspapers
articles summarizing using deep learning approach. More specifically, semantic analysis
approach will be used to get deep analysis
REFERENCES
[1] Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D. Trippe, Juan
B.Gutierrez, Krys Kochut, "Text Summarization Techniques: A Brief Survey", arXiv preprint,
ArXiv: 1707.02268, (2017).
[2] 1Yogan Jaya Kumar, 1Ong Sing Goh, 1Halizah Basiron, 1Ngo Hea Choon and 2Puspalata C
Suppiah, "A Review on Automatic Text Summarization Approaches ", Journal of Computer Science,
Vol 178.190, 180-186, (2016).
[3] S.A.Babar, M.Tech-CSE, RIT." Text Summarization: An Overview ", Research Gate.(2013)
[4] Aishwarya Padmakumar, Akanksha Saran, "Unsupervised Text Summarization using Sentence
Embeddings ", Semantic Scholar, (2016).
14. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
14
[5] Dr. Michael J. Garbade," A Quick Introduction to Text Summarization in Machine Learning "
Towards Data Science, (2018).
[6] Chen Y., Ma Y., Mao X., Li Q, "Abstractive Summarization with the Aid of Extractive
Summarization ", Springer, vol 10987, 4-13, (2018).
[7] Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh, Noah A. Smith, "Toward Abstractive
Summarization Using Semantic Representations ", arXivpreprint, arXiv: 1805.1039, (2018).
[8] Wan X., Li C., Wang R., Xiao D., Shi C," Abstractive Document Summarization via Bidirectional
Decoder ", Springer, vol 11323, 365-367, (2018).
[9] Mahajani A., Pandya V., Maria I., Sharma D," A Comprehensive Survey on Extractive and
Abstractive Techniques for Text Summarization ", Springer, vol 904,339-349, (2019).
[10] Narayan, S., Papasarantopoulos, N., Cohen, S.B., Lapata, M.m, "Neural extractive Summarization
with side information ", arXiv preprint, arXiv: 1704.04530, (2017).
[11] Neri Mendoza V., Ledeneva Y., García-Hernández R.A, "Abstractive Multi-Document Text
Summarization Using a Genetic Algorithm ", Springer, vol 11524, 423-431, (2019).
[12] Kinnary Jangla, "Windows 10 Revealed: The Universal Windows Operating System for PC, Tablets,
and Windows Phone ", Apress, (2015).
[13] Reeta Sahoo, Gagan Sahoo, "Computer Science with Python ", Saraswati House Pvt Ltd, (2016).
[14] "Google Colab", [OnLine]. Available: https: //colab.research.google.com.