This document discusses using the Expectation Maximization clustering algorithm for text summarization. It begins with an introduction to text summarization and natural language processing. It then describes implementing Expectation Maximization clustering on text that has undergone natural language processing steps like splitting, tokenization, part-of-speech tagging, and parsing. This clusters similar sentences based on their similarity values from the distance matrix. The clustered sentences can then be used to generate a summary by selecting the most representative sentences.
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELijaia
The document presents a new model for extractive text summarization that uses BERT (Bidirectional Encoder Representations from Transformers), a pretrained deep bidirectional transformer model, as the text encoder. The model consists of a BERT encoder and a sentence classifier. Sentences are encoded using BERT and classified as to whether they should be included in the summary. Evaluation on the CNN/Daily Mail corpus shows the model achieves state-of-the-art results comparable to other top models according to automatic metrics and human evaluation, making it the first work to apply BERT to text summarization.
This document discusses text summarization using machine learning. It begins by defining text summarization as reducing a text to create a summary that retains the most important points. There are two main types: single document summarization and multiple document summarization. Extractive summarization creates summaries by extracting phrases or sentences from the source text, while abstractive summarization expresses ideas using different words. Supervised machine learning approaches use labeled training data to train classifiers to select content, while unsupervised approaches select content based on metrics like term frequency-inverse document frequency. ROUGE is commonly used to automatically evaluate summaries by comparing them to human references. Query-focused multi-document summarization aims to answer a user's information need by summarizing relevant documents
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONcsandit
There are sixteen known methods for automatic text summarization. In our study we will use Natural language processing NLP within hybrid approach that will improve the quality of important sentences selection by thickening sentence score along with reducing the number of long sentences that would be included in the final summarization. The based approach which is used in the algorithm is Term Occurrences.
Multi Document Text Summarization using Backpropagation NetworkIRJET Journal
The document describes a system for summarizing multiple Hindi language documents using a backpropagation neural network. It extracts features from the documents like sentence length, position, similarity, and subjects. These features are used to train a neural network to learn which sentences should be included in a summary. The system segments, tokenizes, removes stop words and stems sentences before extracting the features. It then generates a summary by selecting important sentences based on the trained network. The system was tested on 70 Hindi news documents and showed improved performance as more documents were used for training.
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Quinsulon Israel
This document outlines Quinsulon Israel's Ph.D. dissertation defense on using semantic analysis to improve multi-document summarization. The dissertation examines using semantic triples clustering and semantic class scoring of sentences to generate summaries. It reviews prior work on statistical, features combination, graph-based, multi-level text relationship, and semantic analysis approaches. The dissertation aims to improve the baseline method and evaluate the effects of semantic analysis on focused multi-document summarization performance.
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
This document summarizes an approach to automatic text summarization using the Simplified Lesk algorithm and WordNet. It analyzes sentences to determine relevance based on semantic information rather than surface features like position or format. Sentences are assigned weights based on the number of overlaps between words' dictionary definitions and the full text. Higher weighted sentences are selected for the summary based on a percentage of the original text length. The approach achieves 80% accuracy on 50% summarizations of diverse texts compared to human summaries.
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELijaia
The document presents a new model for extractive text summarization that uses BERT (Bidirectional Encoder Representations from Transformers), a pretrained deep bidirectional transformer model, as the text encoder. The model consists of a BERT encoder and a sentence classifier. Sentences are encoded using BERT and classified as to whether they should be included in the summary. Evaluation on the CNN/Daily Mail corpus shows the model achieves state-of-the-art results comparable to other top models according to automatic metrics and human evaluation, making it the first work to apply BERT to text summarization.
This document discusses text summarization using machine learning. It begins by defining text summarization as reducing a text to create a summary that retains the most important points. There are two main types: single document summarization and multiple document summarization. Extractive summarization creates summaries by extracting phrases or sentences from the source text, while abstractive summarization expresses ideas using different words. Supervised machine learning approaches use labeled training data to train classifiers to select content, while unsupervised approaches select content based on metrics like term frequency-inverse document frequency. ROUGE is commonly used to automatically evaluate summaries by comparing them to human references. Query-focused multi-document summarization aims to answer a user's information need by summarizing relevant documents
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONcsandit
There are sixteen known methods for automatic text summarization. In our study we will use Natural language processing NLP within hybrid approach that will improve the quality of important sentences selection by thickening sentence score along with reducing the number of long sentences that would be included in the final summarization. The based approach which is used in the algorithm is Term Occurrences.
Multi Document Text Summarization using Backpropagation NetworkIRJET Journal
The document describes a system for summarizing multiple Hindi language documents using a backpropagation neural network. It extracts features from the documents like sentence length, position, similarity, and subjects. These features are used to train a neural network to learn which sentences should be included in a summary. The system segments, tokenizes, removes stop words and stems sentences before extracting the features. It then generates a summary by selecting important sentences based on the trained network. The system was tested on 70 Hindi news documents and showed improved performance as more documents were used for training.
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Quinsulon Israel
This document outlines Quinsulon Israel's Ph.D. dissertation defense on using semantic analysis to improve multi-document summarization. The dissertation examines using semantic triples clustering and semantic class scoring of sentences to generate summaries. It reviews prior work on statistical, features combination, graph-based, multi-level text relationship, and semantic analysis approaches. The dissertation aims to improve the baseline method and evaluate the effects of semantic analysis on focused multi-document summarization performance.
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
This document summarizes an approach to automatic text summarization using the Simplified Lesk algorithm and WordNet. It analyzes sentences to determine relevance based on semantic information rather than surface features like position or format. Sentences are assigned weights based on the number of overlaps between words' dictionary definitions and the full text. Higher weighted sentences are selected for the summary based on a percentage of the original text length. The approach achieves 80% accuracy on 50% summarizations of diverse texts compared to human summaries.
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it comes to high-performance chunking systems, transformer models have proved to be the state of the art benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...TELKOMNIKA JOURNAL
Automatic multi-document summarization needs to find representative sentences not only by
sentence distribution to select the most important sentence but also by how informative a term is in a
sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and
well-spread words in the corpus but ignores the grammatical information that indicates instructive content.
The presence or absence of informative content in a sentence can be indicated by grammatical information
which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting
method by incorporating sentence distribution and POS tagging for multi -document summarization.
Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering
is based on cluster importance to determine the important clusters. Sentence extraction based on
sentence distribution and POS tagging is introduced to extract the representative sentences from the
ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004
are compared with those of the Sentence Distribution Method. Our proposed method achieved better
results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
The document summarizes a technique for generating summaries using sentence compression and statistical measures. It first implements a graph-based technique to achieve sentence compression and information fusion. It then uses hand-crafted syntactic rules to prune compressed sentences. Finally, it uses probabilistic measures and word co-occurrence to obtain the summaries. The system can generate summaries at any user-defined compression rate.
This document discusses several approaches for embedding knowledge bases and relations into continuous vector spaces using neural networks. It first describes earlier models like semantic embedding which used simple scoring functions based on distance between entity embeddings. More advanced models like semantic matching energy and neural tensor networks learn separate relation embeddings and use them to calculate entity interactions. The document also discusses applications of these embeddings for tasks like link prediction, question answering and knowledge base expansion. It provides details of various models' scoring functions, training objectives and datasets used for evaluation.
The document discusses query-based summarization, including defining the task, evaluation criteria, and different approaches used. Some key approaches discussed are using document graphs to identify relevant sections, rhetorical structure theory to create a graph representation, linguistics techniques like Hidden Markov Models for sentence selection, and machine learning methods like using support vector machines to rank sentences. Different domains like medical and opinion summarization are also outlined.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
This document discusses several approaches for embedding knowledge bases and relations into continuous vector spaces using neural networks. It first describes earlier models like semantic embedding and semantic matching energy which used single hidden layers. It then explains more complex models like neural tensor networks that use tensors to model relations. The document also discusses applications of these embeddings for tasks like link prediction, question answering, and knowledge base expansion. It provides details on model formulations, scoring functions, training objectives, and datasets used for evaluation.
An automatic text summarization using lexical cohesion and correlation of sen...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...mlaij
In this new era, where tremendous information is available on the internet, it is of most important to
provide the improved mechanism to extract the information quickly and most efficiently. It is very difficult
for human beings to manually extract the summary of large documents of text. Therefore, there is a
problem of searching for relevant documents from the number of documents available, and absorbing
relevant information from it. In order to solve the above two problems, the automatic text summarization is
very much necessary. Text summarization is the process of identifying the most important meaningful
information in a document or set of related documents and compressing them into a shorter version
preserving its overall meanings. More specific, Abstractive Text Summarization (ATS), is the task of
constructing summary sentences by merging facts from different source sentences and condensing them
into a shorter representation while preserving information content and overall meaning. This Paper
introduces a newly proposed technique for Summarizing the abstractive newspapers’ articles based on
deep learning.
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATIONijscai
Online reviews are a feedback to the product and play a key role in improving the product to cater to consumers. Online reviews that rely heavily on manual categorization are time consuming and labor intensive.The recurrent neural network in deep learning can process time series data, while the long and short term memory network can process long time sequence data well. This has good experimental verification support in natural language processing, machine translation, speech recognition and language model.The merits of the extracted data features affect the classification results produced by the classification model. The LDA topic model adds a priori a posteriori knowledge to classify the data so that the characteristics of the data can be extracted efficiently.Applied to the classifier can improve accuracy and efficiency. Two-way long-term and short-term memory networks are variants and extensions of cyclic neural networks.The deep learning framework Keras uses Tensorflow as the backend to build a convenient two-way long-term and short-term memory network model, which provides a strong technical support for the experiment.Using the LDA topic model to extract the keywords needed to train the neural network and increase the internal relationship between words can improve the learning efficiency of the model. The experimental results in the same experimental environment are better than the traditional word frequency features.
The project re-implements the architecture of the paper Reasoning with Neural Tensor Networks for Knowledge Base Completion in Torch framework, achieving similar accuracy results with an elegant implementation in a modern language.
Below are some links for further details:
https://github.com/agarwal-shubham/Reasoning-Over-Knowledge-Base
http://darsh510.github.io/IREPROJ/
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
This document presents a method for generating suggestions for specific erroneous parts of sentences in Indian languages like Malayalam using deep learning. The method uses recurrent neural networks with long short-term memory layers to train a model on input-output examples of sentences and their corrections. The model takes in preprocessed sentence data and generates a set of possible corrections for erroneous parts through multiple network layers. An analysis of the model shows that it can accurately generate suggestions for word length of three, but requires more data and study to handle the complex morphology and symbols of Malayalam. The performance of the method is limited by the hardware used and it could be improved with a more powerful system and additional training data.
Chat bot using text similarity approachdinesh_joshy
1. There are three main techniques for chat bots to generate responses: static responses using templates, dynamic responses by scoring potential responses from a knowledge base, and generated responses using deep learning to generate novel responses from training data.
2. Text similarity can be measured using string-based, corpus-based, or knowledge-based approaches. String-based measures operate on character sequences while corpus-based measures use word co-occurrence statistics and knowledge-based measures use information from semantic networks like WordNet.
3. Popular corpus-based measures include LSA, ESA, and PMI-IR which analyze word contexts and co-occurrences in corpora. Knowledge-based measures like Resnik, Lin, and Leacock
This document summarizes a research paper that analyzed attribute association in heart disease using data mining techniques. The paper proposed a new measure called AA(I) to quantify the strength of association among attributes in a dataset. The measure was applied to both frequent and infrequent itemsets. The dataset contained 1897 subjects characterized by 21 attributes related to demographics, medical history, and lab tests. Analysis found that risk of cardiovascular disease was higher in males aged 56-65. The proposed association measure could help predict diseases like heart attacks by identifying relationships between comorbid attributes. Future work aims to test the measure on larger health datasets and compare its performance to other algorithms.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document summarizes a research paper on color image enhancement using an adaptive filter. It proposes a new algorithm that uses an adaptive filter to obtain the background image from a video based on color information. It then performs adaptive adjustment on the luminance image to get a locally enhanced image. Finally, it applies color restoration to obtain the enhanced color image. The algorithm aims to better preserve color information and reduce halo effects compared to techniques using discrete wavelet transforms. Experimental results show the adaptive filter produces clearer details and more natural colors in enhanced images and video frames.
This document proposes a new negative edge triggered flip-flop circuit called the Switching Transistor Based D Flip-Flop (STDFF) that reduces power consumption and area compared to conventional D flip-flops. The STDFF uses only 8 transistors compared to other designs, reducing power usage by 85% and area by 40%. Simulations show the STDFF operates correctly as a negative edge triggered flip-flop with reduced delay. It is proposed that this low power design is suitable for battery-powered mobile devices.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it comes to high-performance chunking systems, transformer models have proved to be the state of the art benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...TELKOMNIKA JOURNAL
Automatic multi-document summarization needs to find representative sentences not only by
sentence distribution to select the most important sentence but also by how informative a term is in a
sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and
well-spread words in the corpus but ignores the grammatical information that indicates instructive content.
The presence or absence of informative content in a sentence can be indicated by grammatical information
which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting
method by incorporating sentence distribution and POS tagging for multi -document summarization.
Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering
is based on cluster importance to determine the important clusters. Sentence extraction based on
sentence distribution and POS tagging is introduced to extract the representative sentences from the
ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004
are compared with those of the Sentence Distribution Method. Our proposed method achieved better
results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
The document summarizes a technique for generating summaries using sentence compression and statistical measures. It first implements a graph-based technique to achieve sentence compression and information fusion. It then uses hand-crafted syntactic rules to prune compressed sentences. Finally, it uses probabilistic measures and word co-occurrence to obtain the summaries. The system can generate summaries at any user-defined compression rate.
This document discusses several approaches for embedding knowledge bases and relations into continuous vector spaces using neural networks. It first describes earlier models like semantic embedding which used simple scoring functions based on distance between entity embeddings. More advanced models like semantic matching energy and neural tensor networks learn separate relation embeddings and use them to calculate entity interactions. The document also discusses applications of these embeddings for tasks like link prediction, question answering and knowledge base expansion. It provides details of various models' scoring functions, training objectives and datasets used for evaluation.
The document discusses query-based summarization, including defining the task, evaluation criteria, and different approaches used. Some key approaches discussed are using document graphs to identify relevant sections, rhetorical structure theory to create a graph representation, linguistics techniques like Hidden Markov Models for sentence selection, and machine learning methods like using support vector machines to rank sentences. Different domains like medical and opinion summarization are also outlined.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
This document discusses several approaches for embedding knowledge bases and relations into continuous vector spaces using neural networks. It first describes earlier models like semantic embedding and semantic matching energy which used single hidden layers. It then explains more complex models like neural tensor networks that use tensors to model relations. The document also discusses applications of these embeddings for tasks like link prediction, question answering, and knowledge base expansion. It provides details on model formulations, scoring functions, training objectives, and datasets used for evaluation.
An automatic text summarization using lexical cohesion and correlation of sen...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...mlaij
In this new era, where tremendous information is available on the internet, it is of most important to
provide the improved mechanism to extract the information quickly and most efficiently. It is very difficult
for human beings to manually extract the summary of large documents of text. Therefore, there is a
problem of searching for relevant documents from the number of documents available, and absorbing
relevant information from it. In order to solve the above two problems, the automatic text summarization is
very much necessary. Text summarization is the process of identifying the most important meaningful
information in a document or set of related documents and compressing them into a shorter version
preserving its overall meanings. More specific, Abstractive Text Summarization (ATS), is the task of
constructing summary sentences by merging facts from different source sentences and condensing them
into a shorter representation while preserving information content and overall meaning. This Paper
introduces a newly proposed technique for Summarizing the abstractive newspapers’ articles based on
deep learning.
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATIONijscai
Online reviews are a feedback to the product and play a key role in improving the product to cater to consumers. Online reviews that rely heavily on manual categorization are time consuming and labor intensive.The recurrent neural network in deep learning can process time series data, while the long and short term memory network can process long time sequence data well. This has good experimental verification support in natural language processing, machine translation, speech recognition and language model.The merits of the extracted data features affect the classification results produced by the classification model. The LDA topic model adds a priori a posteriori knowledge to classify the data so that the characteristics of the data can be extracted efficiently.Applied to the classifier can improve accuracy and efficiency. Two-way long-term and short-term memory networks are variants and extensions of cyclic neural networks.The deep learning framework Keras uses Tensorflow as the backend to build a convenient two-way long-term and short-term memory network model, which provides a strong technical support for the experiment.Using the LDA topic model to extract the keywords needed to train the neural network and increase the internal relationship between words can improve the learning efficiency of the model. The experimental results in the same experimental environment are better than the traditional word frequency features.
The project re-implements the architecture of the paper Reasoning with Neural Tensor Networks for Knowledge Base Completion in Torch framework, achieving similar accuracy results with an elegant implementation in a modern language.
Below are some links for further details:
https://github.com/agarwal-shubham/Reasoning-Over-Knowledge-Base
http://darsh510.github.io/IREPROJ/
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
This document presents a method for generating suggestions for specific erroneous parts of sentences in Indian languages like Malayalam using deep learning. The method uses recurrent neural networks with long short-term memory layers to train a model on input-output examples of sentences and their corrections. The model takes in preprocessed sentence data and generates a set of possible corrections for erroneous parts through multiple network layers. An analysis of the model shows that it can accurately generate suggestions for word length of three, but requires more data and study to handle the complex morphology and symbols of Malayalam. The performance of the method is limited by the hardware used and it could be improved with a more powerful system and additional training data.
Chat bot using text similarity approachdinesh_joshy
1. There are three main techniques for chat bots to generate responses: static responses using templates, dynamic responses by scoring potential responses from a knowledge base, and generated responses using deep learning to generate novel responses from training data.
2. Text similarity can be measured using string-based, corpus-based, or knowledge-based approaches. String-based measures operate on character sequences while corpus-based measures use word co-occurrence statistics and knowledge-based measures use information from semantic networks like WordNet.
3. Popular corpus-based measures include LSA, ESA, and PMI-IR which analyze word contexts and co-occurrences in corpora. Knowledge-based measures like Resnik, Lin, and Leacock
This document summarizes a research paper that analyzed attribute association in heart disease using data mining techniques. The paper proposed a new measure called AA(I) to quantify the strength of association among attributes in a dataset. The measure was applied to both frequent and infrequent itemsets. The dataset contained 1897 subjects characterized by 21 attributes related to demographics, medical history, and lab tests. Analysis found that risk of cardiovascular disease was higher in males aged 56-65. The proposed association measure could help predict diseases like heart attacks by identifying relationships between comorbid attributes. Future work aims to test the measure on larger health datasets and compare its performance to other algorithms.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document summarizes a research paper on color image enhancement using an adaptive filter. It proposes a new algorithm that uses an adaptive filter to obtain the background image from a video based on color information. It then performs adaptive adjustment on the luminance image to get a locally enhanced image. Finally, it applies color restoration to obtain the enhanced color image. The algorithm aims to better preserve color information and reduce halo effects compared to techniques using discrete wavelet transforms. Experimental results show the adaptive filter produces clearer details and more natural colors in enhanced images and video frames.
This document proposes a new negative edge triggered flip-flop circuit called the Switching Transistor Based D Flip-Flop (STDFF) that reduces power consumption and area compared to conventional D flip-flops. The STDFF uses only 8 transistors compared to other designs, reducing power usage by 85% and area by 40%. Simulations show the STDFF operates correctly as a negative edge triggered flip-flop with reduced delay. It is proposed that this low power design is suitable for battery-powered mobile devices.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document discusses using video compression techniques to compress multichannel neural signals. It proposes using a multiwavelet transform to decorrelate the signals, followed by vector quantization to exploit correlations between electrodes. Motion estimation and compensation are also used to reduce redundancy between successive neural frames by determining motion vectors, similar to how video compression analyzes frame-to-frame motion. The goal is to significantly reduce the large amounts of neural data for easier wireless transmission without degrading quality.
This document summarizes a study on the effects of elevated temperatures on the compressive and splitting tensile strengths of ultra-high strength concrete. Cubes and cylinders of M100 grade concrete were exposed to temperatures between 50-250°C for durations of 1-4 hours. Testing found that compressive and splitting tensile strengths initially increased with temperature up to 100°C but then decreased with further increases in temperature. The maximum strengths were observed when specimens were heated to 100°C for 1 hour. Understanding how high-strength concrete properties change after fire exposure can help determine the load capacity of damaged structures.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document describes the design of a low power multiserial to Ethernet gateway for unmanned aerial vehicle data acquisition systems. The gateway uses an FPGA and Ethernet controller chip to interface with multiple serial devices and transmit the data over Ethernet. The FPGA implements UART modules to interface with sensors and an ADC. It collects data from the serial devices and sends it to the Ethernet module packed in Ethernet frames. This simplifies wiring and allows the data to be transmitted to the ground station computer over Ethernet for processing and storage.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document summarizes a research paper that proposes a secure communal verification and key exchange scheme for mobile communications. The scheme uses nested one-time secrets, with an outer secret shared between a user and the home location register, and an inner secret shared between a user and the current visited location register. This allows for efficient verification when a user does not change locations, reducing computation costs. The scheme aims to improve security and performance over existing verification schemes for mobile networks. It uses different verification mechanisms like timestamps, one-time secrets, and nonces depending on the situation to optimize efficiency.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Improvement of Text Summarization using Fuzzy Logic Based MethodIOSR Journals
The document describes a method for improving text summarization using fuzzy logic. It proposes using fuzzy logic to determine the importance of sentences based on calculated feature scores. Eight features are used to score sentences, including title words, length, term frequency, position, and similarity. Sentences are then ranked based on their fuzzy logic-determined scores. The highest scoring sentences are extracted to create a summary. An evaluation of summaries generated using this fuzzy logic method found it performed better than other summarizers in accurately reflecting the content and order of human-generated reference summaries. The method could be expanded to multi-document summarization and automatic selection of fuzzy rules based on input type.
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
The document describes a study that used BERT (Bidirectional Encoder Representations from Transformers), a pretrained language model, for extractive text summarization. The researchers developed a two-phase encoder-decoder model where BERT encoded sentences from documents and classified them as included or not included in the summary. They evaluated the model on the CNN/Daily Mail corpus and found it achieved state-of-the-art results comparable to previous models based on both automatic metrics and human evaluation.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it comes to high-performance chunking systems, transformer models have proved to be the state of the art benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it
comes to high-performance chunking systems, transformer models have proved to be the state of the art
benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where
each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
A hybrid approach for text summarization using semantic latent Dirichlet allo...IJECEIAES
Automatic text summarization generates a summary that contains sentences reflecting the essential and relevant information of the original documents. Extractive summarization requires semantic understanding, while abstractive summarization requires a better intermediate text representation. This paper proposes a hybrid approach for generating text summaries that combine extractive and abstractive methods. To improve the semantic understanding of the model, we propose two novel extractive methods: semantic latent Dirichlet allocation (semantic LDA) and sentence concept mapping. We then generate an intermediate summary by applying our proposed sentence ranking algorithm over the sentence concept mapping. This intermediate summary is input to a transformer-based abstractive model fine-tuned with a multi-head attention mechanism. Our experimental results demonstrate that the proposed hybrid model generates coherent summaries using the intermediate extractive summary covering semantics. As we increase the concepts and number of words in the summary the rouge scores are improved for precision and F1 scores in our proposed model.
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALijaia
This document proposes a methodology to extract information from big data sources like course handouts and directories and represent it in a graphical, ontological tree format. Keywords are extracted from documents using natural language processing techniques and used to generate a hierarchical tree based on the DMOZ open directory project. The trees provide a comprehensive overview of document content and structure. The method is implemented using Python for natural language processing and Java for visualization. Evaluation on computer science course handouts shows the trees accurately represent topic coverage and depth. Future work aims to increase the number of keywords extracted.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
This paper proposes a multi-document summarization system that uses bisect k-means clustering, an optimal merge function, and a neural network. The system first preprocesses input documents through stemming and removing stop words. It then applies bisect k-means clustering to group similar sentences. The clusters are merged using an optimal merge function to find important keywords. The NEWSUM algorithm is used to generate a primary summary for each keyword. A neural network trained on sentence classifications is then used to classify sentences in the primary summary as positive or negative. Only positively classified sentences are included in the final summary to improve accuracy. The system aims to generate a concise and accurate summary in a short period of time from multiple documents on a given topic.
Automatic Text Summarization: A Critical ReviewIRJET Journal
This document provides a literature review and critical analysis of automatic text summarization techniques. It discusses extractive and abstractive summarization approaches and reviews 10 papers published between 2009-2021 on topics like graph-based, keyword-based, and feature-based summarization methods. The document aims to identify strengths and limitations of the approaches discussed and opportunities for future work in automatic text summarization.
The document presents an ensemble model for chunking natural language text that combines a transformer model (RoBERTa) with a bidirectional LSTM and CNN model. The authors train these models on common chunking datasets like CoNLL 2000 and English Penn Treebank. They find that by using an ensemble of the transformer and RNN-CNN models, which compensate for each other's weaknesses, they are able to achieve state-of-the-art results on chunking, with an F1 score of 97.3% on CoNLL 2000, exceeding previous work. The transformer model provides attention-based contextual embeddings while the RNN-CNN model uses custom embeddings including POS tags to improve accuracy on tags that the transformer model struggles with.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATIONIJDKP
In our study we will use approach that combine Natural language processing NLP with Term occurrences to improve the quality of important sentences selection by thickening sentence score along with reducing the number of long sentences that would be included in the final summarization. There are sixteen known methods for automatic text summarization. In our paper we utilized Term frequency approach and built an algorithm to re filter sentences score.
A domain specific automatic text summarization using fuzzy logicIAEME Publication
This paper proposes a domain-specific automatic text summarization technique using fuzzy logic. It extracts important sentences from documents using both statistical and linguistic features. The technique first preprocesses documents by segmenting sentences, removing stop words, and stemming words. It then extracts eight features from each sentence, including title words, sentence length, position, and similarity. Fuzzy logic is used to assign an importance score to each sentence based on these features. Sentences are ranked and the top sentences are selected for the summary based on the compression rate. The technique was evaluated on news articles and shown to generate good quality summaries compared to other summarizers based on precision, recall, and F-measure.
Comparative analysis of c99 and topictiling text segmentation algorithmseSAT Journals
Abstract In this paper, the work done includes the extraction of information from image datasets which contain natural text. The difficulty level of segmenting natural text from an image is too high and so precision is the most important factor to be kept in mind. To minimize the error rates, error filtration technique is provided, as filtration is adopted while doing image segmentation basically text segmentation present in images. Furthermore, a comparative analysis of two different text segmentation algorithms namely C99 and TopicTiling on image documents is presented. To assess how well each algorithm works, each was applied on different datasets and results were compared. The work done also proves the efficiency of TopicTiling over C99. Index Terms: Text Segmentation, text extraction, image documents,C99 and TopicTiling.
The document presents a comparative analysis of two text segmentation algorithms, C99 and TopicTiling, that are applied to extract natural text from image documents. It first discusses related work on text segmentation techniques. It then provides an overview of the two-phase implementation: 1) text is extracted from images using preprocessing, thresholding, boundary detection and text recognition, and 2) the extracted text is segmented using C99 and TopicTiling, and the results of each are compared. The analysis shows that TopicTiling performs more efficiently than C99 at segmenting text from images.
This document describes a method for sentence similarity based text summarization using clusters. It involves preprocessing text, extracting primitives from sentences, linking primitives, computing sentence similarity, merging similarity values, clustering similar sentences, and extracting a representative sentence from each cluster to generate a summary. Key steps include identifying common elements (primitives) between sentences, representing sentences as vectors of primitives, computing similarity based on shared primitives, clustering similar sentences, pruning clusters to remove dissimilar sentences, ranking clusters by importance, and selecting a representative sentence from each cluster for the summary. The goal is to automatically generate a short summary that captures the essential information from a collection of documents or text on the same topic.
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
In recent years, there has been an increasing use of social media among people in Myanmar and writing review on social media pages about the product, movie, and trip are also popular among people. Moreover, most of the people are going to find the review pages about the product they want to buy before deciding whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very important and time consuming for people. Sentiment analysis is one of the important processes for extracting useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar Language.
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
Y24168171
1. Ms. Meghana. N.Ingole, Mrs.M.S.Bewoor, Mr.S.H.Patil / International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 4, July-August 2012, pp.168-171
Text Summarization using Expectation Maximization Clustering
Algorithm
Ms. Meghana. N.Ingole, Mrs.M.S.Bewoor, Mr.S.H.Patil
Department of Computer Engineering,
Bharati Vidyapeeth Deemed University, College of Engineering, Pune, India.
ABSTRACT
Large amount of data is available on internet due to increase in growth of World Wide Web. It is very
difficult for human beings to manually find out useful and significant data. This problem can be resolved by
using text summerization.Text Summarization is the process of condensing the input text file into shorter version
by preserving its overall content and meaning. This paper is about called text summarization using natural
language processing. This paper describes a system where in first module we are implementing the phases of
natural language processing that is splitting, tokenization, part of speech tagging, chunking and parsing. In
second module we are implementing Expectation Maximization Clustering Algorithm to find out sentence
similarity. Based on the value of sentences similarity, we can easily summarize text.
Keywords: Document Graph, splitting, similarity ,tokenization,.
I.INTRODUCTION
Overview There is abundant amount of calculate weight. Then it is represented in the form of
information available on the web. It will be easy for document graph. Expectation Maximization Clustering
user to search particular information by entering Algorithm is used before summarization to generate
keyword but it is difficult to find out useful and effective summary.
significant information. Text summarization is an
important tool for assisting and interpreting the data II. RELATED WORK
when vast amount of data is available. Text The Expectation Maximization clustering
summarization not only means to condense the algorithm is used for query dependent text
information but also to maintain the original meaning. summerization.This technique will avoid cumbersome
Abstractive summarization means a method which work of user, the user can easily get query dependent
consists of understanding the meaning of text and then summary of text document. The proposed work is
giving it in fewer words where extractive context sensitive text summarization using
summarization means picking only important text Expectation Maximization Clustering algorithm. The
from original document and concatenating them into different clustering algorithm can also be implemented.
shorter version. Interpretation of the text can be one by The main focus of summarization is using natural
using natural language processing. In natural language Language processing which will remove sentence
Processing, the input text has to go through different ambiguity, and will get accurate summary maintaining
phases where sentence can be split, parsing can be the originality. The phases of NLP are an important
done, and then tokenization and finally chunking has task where it will chunk the text in the form of graph
to be done. by understanding meaning of text.
These pre-processing can be done before However after going through some papers and the
generating summery. The main aim of research work findings are somewhat different from my analysis in
is to twofold. One is to implement Expectation the algorithm. The paper [1] considers how to achieve
Maximization Clustering algorithm. Second is query high accuracy and how to represent count data where
dependent summarization by removing ambiguity. For they have considered different data clustering model.
interpreting the text used word net dictionary is used. Finally the mixture model made up from different
The input text is processed where each sentence is distribution is optimized by expectation-maximization
considered as a single node and each node will be and minimum description length..In this paper, [2]
compared with all other node. This comparison can be unsupervised document has been considered where
done by calling word net dictionary where it will document is text file and they have used probabilistic
168 | P a g e
2. Ms. Meghana. N.Ingole, Mrs.M.S.Bewoor, Mr.S.H.Patil / International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 4, July-August 2012, pp.168-171
model. The model is of word count and different
component which corresponds to different theme.
Here Expectation Maximization is used as a tool
which is used to calculate maximum a posteriori
(MAP) estimates of the parameters and in obtaining
meaningful document clusters. Finally MAP has
estimated by Gibbs Sampling. Paper [3] is about
sentence clustering in a document. Generally sentence
similarity does not represent sentence in metric form,
this paper used fuzzy clustering algorithm operating
on data which is in the form of square matrix. Here for
graph representation, Expectation maximization Figure1. Phases of text analysis
algorithm has object in graph is interpreted as a
likelihood. Finally the paper result is the algorithm is This association (edge weight) is greater than or equal
capable of been used where central identifying to the cluster threshold value which is taken as input
overlapping clusters of semantically related sentences. from user.
The drawback is time complexity because page rank is 2. NLP Parser Engine
applied to each EM cycle which takes large
convergence time if there are large number of
clusters.[4] Expectation Maximization algorithm is
used for interaction between image segmentation and
object recgnition.Image segmentation is done at E step
and reconstruction of object is done at M step. For
capturing shape and appearance of image, Active
Appearance Model (AAMs) has been used. The [5] Figure 2. Phases of NLP Parser Engine
technique is used Expectation Maximization algorithm 2.1 Split Sentence
for estimation of statically parameters of classes in Recognizing the end of a sentence is not an easy task
Bayesian decision rule. It is composed of context for a computer. The input data is split into separate
sensitive initialization and iterative procedure for sentences by the new line character and converted into
statically parameters. They have used different the array of paragraphs by using split method. This can
algorithms for unsupervised classification in be done by treating each of the characters '.', '!', '?' as
multistage clustering. [6] Focused on co-clustering and separator rather than definite end-of-sentence markers.
constrained clustering to improve the performance and 2.2 Tokenization
for increasing effectiveness supervised and It will separate the input text into separate tokens. The
unsupervised constraints has been considered. For text can be separated into tokens. Punctuation marks,
optimization of model, Expectation Maximization spaces and word terminators are the word breaking
algorithm is used. They have used NE extractor for characters.
constructing document constraints based on 2.3 Part Of Speech tagger
overlapping entities and word net for word constraints POS tagger is applied for grammatical semantics. Part-
based on semantic distance. of-speech tagging is the process which is applied after
tokenization. The input to a tagging algorithm is a
III.SYSTEM DESCRIPTION string of words of a natural language sentence. The
1. Input: output is a single best POS tag for each word. Part-of-
Input text file containing set of sentences probably speech tags are
from same context is fed to the system. The stop NN: Noun DT: Determiner
words like “the”, “him” and “had” which do not VBN: Verb, past participle IN: Preposition
contribute to understanding the main ideas present in CC- coordinating conjunction :
NNP: Proper noun
the text can be removed.This long text can be pre-
processed through NLP phases. Distance matrix can be DD: Common determinant
calculated and Expectation Maximization clustering VBD: Verb, past tense
algorithm is implemented which builds document :
graph and will generate specific summary. Every node 2.4 Chunker
in the cluster maintains association with every other Text chunking is dividing text into parts of words and
forming groups like verb group, noun group. Text
node.
chunking divides the input text into such phrases and
assigns a type such as NP for noun phrase, VP for verb
169 | P a g e
3. Ms. Meghana. N.Ingole, Mrs.M.S.Bewoor, Mr.S.H.Patil / International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 4, July-August 2012, pp.168-171
phrase, PP for prepositional phrase where the chunk
borders are indicated by square brackets. Each word is
assigned only one unique tag.
2.5 Parser
It generates the parse tree for given sentence. Parsing
is converting a input sentence into a hierarchical
structure that corresponds to the units of meaning in
the sentence.
3. Built Distance matrix:
Lexical semantics begins with recognition that a word
is a conventional association between lexicalized
concepts.
4. User fires a query:
We have distance matrix as an input to the clustering
algorithm. Once the distance matrix giving
associatively of different sentences is ready, user can
fire a query.
5. Clustering
5.1 Expectation Maximization Clustering Algorithm
The EM algorithm is an iterative procedure that
consists of two alternating steps: an expectation step,
followed by a maximization step. The expectation is
with respect to the unknown underlying variables,
using the current estimates of the parameters and Figure 3. Document Graph
conditioned upon the observations. The Figure 3. Shows document graph(Similarity Matrix),
Maximization step provides new estimates of the where clustercomputing.txt is uploaded as input text
parameters. At each iteration, the estimated parameters file and showing node table data and document graph.
provide an increase in Node table data contains node number and node
the maximum-likelihood (ML) function until a local (sentence), document graph shows similarity between
maximum is achieved. Initialize k distribution each node with each other node.
parameters; each distribution parameter corresponds to
a cluster center. Iterate between two steps global
maximum of the likelihood function. This depends on
the initial starting points.
Expectation step: assign points to clusters
Pr( x C )
i k
wk i
n
Pr( xi Ck ) Pr( xi | Ck ) Pr(x | C )
j
i j
Maximization step: estimate model parameters that
maximize the likelihood for the given assignment of
point
6. Build Document Graph
Depending upon weight, sentence similarity can be
found out. Sentence with maximum weight can be
considered in summarization
IV. RESULT
1 n Pr( xi Ck ) Figure 4.Graphical analysis with respect to time and
rk space complexity
n i 1 Pr( xi C j )
k
170 | P a g e
4. Ms. Meghana. N.Ingole, Mrs.M.S.Bewoor, Mr.S.H.Patil / International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 4, July-August 2012, pp.168-171
Graphical Analysis for Time Complexity in terms of 1. “Count Data Modeling and Classification Using
Computational cycles Finite Mixtures of Distributions”, IEEE
Hierarchical: 304, EM: 13 Transaction on Neural Networks.Vol.22, No.2,
Graphical Analysis for Space Complexity in terms of February 2011.
Computational cycles 2.” Inference for Probilistic Unsupervised Text
Hierarchical: 105, EM: 1 clustering”, 2005 IEEE.
3.” Clustering Sentence-Level Text using a Novel
V. CONCLUSION Fuzzy Relational Clustering Algorithm”, IEEE
In this work we presented a structure-based technique Transactions on Knowledge and Data
to create query-specific summaries for text documents. Engineering 2011.
The clustering algorithm plays an important role in 4.” Synergy between Object Recognition and
result space and time consumption. Different Image Segmentation Using
clustering algorithms can be used on the same the Expectation-Maximization Algorithm”, IEEE
framework and their performance can be measured in Transaction on Pattern Analysis and Machine
term of quality of result and execution time. We show Intelligence ,Vol. 31, No. 8, August 2009
with a user survey that our approach performs better 5.” A Context-Sensitive Clustering Technique
than other state of the art approaches. Algorithm Based on Graph-Cut Initialization
and Expectation-Maximization”, IEEE
VI. FUTURE WORK Geosciences and Remote Sensing Letters”, Vol.
In the future, we plan to extend our work to account 5, No. 1, January 2008.
for links between documents of the dataset. Also we 6.” Constrained Text Co-clustering with
will try to implement same algorithm in different Supervisedand Unsupervised Constraints”,
applications. IEEE Transaction on Knowledge and Data
Engineering2012.
VII.REFERENCES
171 | P a g e