The Volume of text resources have been increasing in digital libraries and internet. Organizing these text documents has become a practical need. For organizing great number of objects into small or minimum number of coherent groups automatically, Clustering technique is used. These documents are widely used for information retrieval and Natural Language processing tasks. Different Clustering algorithms require a metric for quantifying how dissimilar two given documents are. This difference is often measured by similarity measure such as Euclidean distance, Cosine similarity etc. The similarity measure process in text
mining can be used to identify the suitable clustering algorithm for a specific problem. This survey discusses the existing works on text similarity by partitioning them into three significant approaches; String-based, Knowledge based and Corpus-based similarities.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
An Application of Pattern matching for Motif IdentificationCSCJournals
Pattern matching is one of the central and most widely studied problem in theoretical computer science. Solutions to the problem play an important role in many areas of science and information processing. Its performance has great impact on many applications including database query, text processing and DNA sequence analysis. In general Pattern matching algorithms are based on the shift value, the direction of the sliding window and the order in which comparisons are made. The performance of the algorithms can be enhanced to a great extent by a larger shift value and less number of comparison to get the shift value. In this paper we proposed an algorithm, for finding motif in DNA sequence. The algorithm is based on preprocessing of the pattern string(motif) by considering four consecutive nucleotides of the DNA that immediately follow the aligned pattern window in an event of mismatch between pattern(motif) and DNA sequence .Theoretically, we found the proposed algorithms work efficiently for motif identification.
This document presents an algorithm for semantic-based similarity measure (SBSM) to improve text clustering. The algorithm assigns semantic weights to documents terms and phrases based on their use as arguments in proposition bank notation. It calculates similarity between a document and query based on matching weighted terms and phrases. Experimental results on a dataset show the SBSM using proposition bank notation improves performance over traditional measures like cosine and Jaccard similarity. The algorithm captures semantic information within documents for more accurate similarity assessment and clustering.
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
Chat bot using text similarity approachdinesh_joshy
1. There are three main techniques for chat bots to generate responses: static responses using templates, dynamic responses by scoring potential responses from a knowledge base, and generated responses using deep learning to generate novel responses from training data.
2. Text similarity can be measured using string-based, corpus-based, or knowledge-based approaches. String-based measures operate on character sequences while corpus-based measures use word co-occurrence statistics and knowledge-based measures use information from semantic networks like WordNet.
3. Popular corpus-based measures include LSA, ESA, and PMI-IR which analyze word contexts and co-occurrences in corpora. Knowledge-based measures like Resnik, Lin, and Leacock
A Survey of String Matching AlgorithmsIJERA Editor
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
An Application of Pattern matching for Motif IdentificationCSCJournals
Pattern matching is one of the central and most widely studied problem in theoretical computer science. Solutions to the problem play an important role in many areas of science and information processing. Its performance has great impact on many applications including database query, text processing and DNA sequence analysis. In general Pattern matching algorithms are based on the shift value, the direction of the sliding window and the order in which comparisons are made. The performance of the algorithms can be enhanced to a great extent by a larger shift value and less number of comparison to get the shift value. In this paper we proposed an algorithm, for finding motif in DNA sequence. The algorithm is based on preprocessing of the pattern string(motif) by considering four consecutive nucleotides of the DNA that immediately follow the aligned pattern window in an event of mismatch between pattern(motif) and DNA sequence .Theoretically, we found the proposed algorithms work efficiently for motif identification.
This document presents an algorithm for semantic-based similarity measure (SBSM) to improve text clustering. The algorithm assigns semantic weights to documents terms and phrases based on their use as arguments in proposition bank notation. It calculates similarity between a document and query based on matching weighted terms and phrases. Experimental results on a dataset show the SBSM using proposition bank notation improves performance over traditional measures like cosine and Jaccard similarity. The algorithm captures semantic information within documents for more accurate similarity assessment and clustering.
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
Chat bot using text similarity approachdinesh_joshy
1. There are three main techniques for chat bots to generate responses: static responses using templates, dynamic responses by scoring potential responses from a knowledge base, and generated responses using deep learning to generate novel responses from training data.
2. Text similarity can be measured using string-based, corpus-based, or knowledge-based approaches. String-based measures operate on character sequences while corpus-based measures use word co-occurrence statistics and knowledge-based measures use information from semantic networks like WordNet.
3. Popular corpus-based measures include LSA, ESA, and PMI-IR which analyze word contexts and co-occurrences in corpora. Knowledge-based measures like Resnik, Lin, and Leacock
A Survey of String Matching AlgorithmsIJERA Editor
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
This paper deals with Sentence Validation - a sub-field of Natural Language Processing. It finds various applications in
different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on
understanding and extracting important information delivered to the computer and make possible efficient human computer
interaction. Sentence Validation is approached in two ways - by Statistical approach and Semantic approach. In both approaches
database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based
on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis,
tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing,
and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by
extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and
variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the
probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.
The document presents a knowledge-based method for measuring semantic similarity between texts. It combines word-to-word semantic similarity metrics with information about word specificity to calculate a text-to-text similarity score. An example application shows how word similarity scores from WordNet are combined using the Wu & Palmer metric to determine the semantic similarity between two text segments. The method is evaluated on paraphrase identification tasks and shown to outperform approaches based only on lexical matching.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Informationcsandit
The document describes a novel word sense disambiguation method using global co-occurrence information and non-negative matrix factorization. It proposes using global co-occurrence frequencies between words in a large corpus rather than local context windows to address data sparseness issues. An experiment compares the method to two baselines and shows it achieves the highest and most stable precision for word sense disambiguation.
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Textual Document Categorization using Bigram Maximum Likelihood and KNNRounak Dhaneriya
In recent year’s text mining has evolved as a vast field of research in machine learning and artificial intelligence. Text Mining is a difficult task to conduct with an unstructured data format. This research work focuses on the classification of textual data of three different literature books. The collection of data is extracted from books entitled: Oliver Twist, Don Quixote, Pride and Prejudge. We used two different algorithms: KNN and Bigram based Maximum Likelihood for the mentioned purpose and the evaluation of accuracy is done using the confusion matrix. The results suggest that the text mining using bigram based maximum likelihood logic performs well.
The document discusses methods for statistical spelling correction based on decision theory. It defines the spelling correction problem and previous approaches. It then presents two new methods: 1) Evaluating the error rate per sentence using Bayes optimal and approximate algorithms based on dynamic programming. 2) Evaluating the error rate per word also using Bayes optimal and approximate algorithms like the BCJR algorithm. The goal is to minimize error rates with computational efficiency for real-world applications.
An automatic text summarization using lexical cohesion and correlation of sen...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
An Index Based K-Partitions Multiple Pattern Matching AlgorithmIDES Editor
The study of pattern matching is one of the
fundamental applications and emerging area in computational
biology. Searching DNA related data is a common activity for
molecular biologists. In this paper we explore the applicability
of a new pattern matching technique called Index based Kpartition
Multiple Pattern Matching algorithm (IKPMPM), for
DNA sequences. Current approach avoids unnecessary
comparisons in the DNA sequence. Due to this, the number of
comparisons gradually decreases and comparison per character
ratio of the proposed algorithm reduces accordingly when
compared to other existing popular methods. The experimental
results show that there is considerable amount of performance
improvement.
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHSijcsit
The paper proposes a method to check whether two documents are having textual plagiarism or not. This
technique is based on Extrinsic Plagiarism Detection. The technique that is applied here is quite similar to
the one that is used to grade the short answers. The triplets and associated information are extracted from
both texts and are stored in friendship matrices. Then these two friendship matrices are compared and a
similarity percentage is calculated. This similarity percentage is used to take decisions
This is a short presentation that explains the famous TextRank papers that used graphs to produce summaries and document indices (keywords).
Link to paper : https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Based on the Influence Factors in the Heterogeneous Network t-path Similarity...IJRESJOURNAL
ABSTRACT:In view of the existing heterogeneous network largely depends on the meta-path similarity calculation method of direct link, ignores the influence factors of different entities and the time difference of different paths. The hetero-geneous network is proposed in this paper, based on the factors affecting the time path of similarity algorithm Hete-DS. In HeteRecom algorithm on the basis of the algorithm, make up for the path of only consider different types of weights, different object relations by in the path to the weight computation of time factor, strengthen the effectiveness of the path to weight. At the same time, through the influence factors of different entities, object relational matrix was established considering the difference between different nodes under the same path type links. In multiple data sets on the experimental results show that, compared with HeteRecom algorithm, Hete-DS algorithm has higher accuracy.
The document presents a new ontology matching system based on a multi-agent architecture. The system takes ontologies described in XML, RDF Schema, and OWL as input. It uses multiple matchers and filtering to generate mappings between ontology entities. The mappings are then validated. The system is implemented as a multi-agent system with different agent types responsible for resources, matching, generating mappings, and filtering/validating mappings. The architecture allows for robust, flexible, and scalable ontology matching.
The document discusses different types of pollution that harm the environment, including air pollution from vehicle exhaust and coal combustion, water pollution from discharge without proper treatment, and how pollution threatens both the Earth and humanity. It urges everyone to help clean the environment as it is our shared home.
This document discusses how certain technological developments may threaten the planet's survival. It argues that non-biodegradable manufacturing like plastic packaging could take over a millennium to decompose, polluting the environment. Dependence on non-renewable energy sources such as coal, oil and gas risks depleting them within 100 years, while their extraction and use releases pollution and greenhouse gases. Nuclear power poses radiation risks illustrated by Chernobyl, and could become more dangerous if nuclear weapons are used in war, with impacts like mass casualties and long-term land contamination. The proliferation of hard-to-control technologies like biological weapons may also threaten mass extinctions of life.
This document discusses several causes of environmental problems including air pollution and waste from increasing population levels. It suggests governments introduce laws to limit emissions and promote renewable energy as well as impose green taxes. Individuals can take public transport, reduce packaging, and recycle more. Both governments and individuals must work together to address these issues.
Herpes zoster, also known as shingles, is caused by the reactivation of the varicella zoster virus which causes chickenpox. It presents with severe pain and a rash of grouped vesicles in a dermatomal distribution, most commonly on the thoracic region. Complications can include secondary bacterial infection, disseminated lesions, herpes zoster ophthalmicus, and post-herpetic neuralgia. Treatment involves antiviral medications like acyclovir or famciclovir to reduce symptoms and duration.
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
This paper deals with Sentence Validation - a sub-field of Natural Language Processing. It finds various applications in
different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on
understanding and extracting important information delivered to the computer and make possible efficient human computer
interaction. Sentence Validation is approached in two ways - by Statistical approach and Semantic approach. In both approaches
database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based
on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis,
tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing,
and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by
extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and
variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the
probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.
The document presents a knowledge-based method for measuring semantic similarity between texts. It combines word-to-word semantic similarity metrics with information about word specificity to calculate a text-to-text similarity score. An example application shows how word similarity scores from WordNet are combined using the Wu & Palmer metric to determine the semantic similarity between two text segments. The method is evaluated on paraphrase identification tasks and shown to outperform approaches based only on lexical matching.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Informationcsandit
The document describes a novel word sense disambiguation method using global co-occurrence information and non-negative matrix factorization. It proposes using global co-occurrence frequencies between words in a large corpus rather than local context windows to address data sparseness issues. An experiment compares the method to two baselines and shows it achieves the highest and most stable precision for word sense disambiguation.
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Textual Document Categorization using Bigram Maximum Likelihood and KNNRounak Dhaneriya
In recent year’s text mining has evolved as a vast field of research in machine learning and artificial intelligence. Text Mining is a difficult task to conduct with an unstructured data format. This research work focuses on the classification of textual data of three different literature books. The collection of data is extracted from books entitled: Oliver Twist, Don Quixote, Pride and Prejudge. We used two different algorithms: KNN and Bigram based Maximum Likelihood for the mentioned purpose and the evaluation of accuracy is done using the confusion matrix. The results suggest that the text mining using bigram based maximum likelihood logic performs well.
The document discusses methods for statistical spelling correction based on decision theory. It defines the spelling correction problem and previous approaches. It then presents two new methods: 1) Evaluating the error rate per sentence using Bayes optimal and approximate algorithms based on dynamic programming. 2) Evaluating the error rate per word also using Bayes optimal and approximate algorithms like the BCJR algorithm. The goal is to minimize error rates with computational efficiency for real-world applications.
An automatic text summarization using lexical cohesion and correlation of sen...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
An Index Based K-Partitions Multiple Pattern Matching AlgorithmIDES Editor
The study of pattern matching is one of the
fundamental applications and emerging area in computational
biology. Searching DNA related data is a common activity for
molecular biologists. In this paper we explore the applicability
of a new pattern matching technique called Index based Kpartition
Multiple Pattern Matching algorithm (IKPMPM), for
DNA sequences. Current approach avoids unnecessary
comparisons in the DNA sequence. Due to this, the number of
comparisons gradually decreases and comparison per character
ratio of the proposed algorithm reduces accordingly when
compared to other existing popular methods. The experimental
results show that there is considerable amount of performance
improvement.
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHSijcsit
The paper proposes a method to check whether two documents are having textual plagiarism or not. This
technique is based on Extrinsic Plagiarism Detection. The technique that is applied here is quite similar to
the one that is used to grade the short answers. The triplets and associated information are extracted from
both texts and are stored in friendship matrices. Then these two friendship matrices are compared and a
similarity percentage is calculated. This similarity percentage is used to take decisions
This is a short presentation that explains the famous TextRank papers that used graphs to produce summaries and document indices (keywords).
Link to paper : https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Based on the Influence Factors in the Heterogeneous Network t-path Similarity...IJRESJOURNAL
ABSTRACT:In view of the existing heterogeneous network largely depends on the meta-path similarity calculation method of direct link, ignores the influence factors of different entities and the time difference of different paths. The hetero-geneous network is proposed in this paper, based on the factors affecting the time path of similarity algorithm Hete-DS. In HeteRecom algorithm on the basis of the algorithm, make up for the path of only consider different types of weights, different object relations by in the path to the weight computation of time factor, strengthen the effectiveness of the path to weight. At the same time, through the influence factors of different entities, object relational matrix was established considering the difference between different nodes under the same path type links. In multiple data sets on the experimental results show that, compared with HeteRecom algorithm, Hete-DS algorithm has higher accuracy.
The document presents a new ontology matching system based on a multi-agent architecture. The system takes ontologies described in XML, RDF Schema, and OWL as input. It uses multiple matchers and filtering to generate mappings between ontology entities. The mappings are then validated. The system is implemented as a multi-agent system with different agent types responsible for resources, matching, generating mappings, and filtering/validating mappings. The architecture allows for robust, flexible, and scalable ontology matching.
The document discusses different types of pollution that harm the environment, including air pollution from vehicle exhaust and coal combustion, water pollution from discharge without proper treatment, and how pollution threatens both the Earth and humanity. It urges everyone to help clean the environment as it is our shared home.
This document discusses how certain technological developments may threaten the planet's survival. It argues that non-biodegradable manufacturing like plastic packaging could take over a millennium to decompose, polluting the environment. Dependence on non-renewable energy sources such as coal, oil and gas risks depleting them within 100 years, while their extraction and use releases pollution and greenhouse gases. Nuclear power poses radiation risks illustrated by Chernobyl, and could become more dangerous if nuclear weapons are used in war, with impacts like mass casualties and long-term land contamination. The proliferation of hard-to-control technologies like biological weapons may also threaten mass extinctions of life.
This document discusses several causes of environmental problems including air pollution and waste from increasing population levels. It suggests governments introduce laws to limit emissions and promote renewable energy as well as impose green taxes. Individuals can take public transport, reduce packaging, and recycle more. Both governments and individuals must work together to address these issues.
Herpes zoster, also known as shingles, is caused by the reactivation of the varicella zoster virus which causes chickenpox. It presents with severe pain and a rash of grouped vesicles in a dermatomal distribution, most commonly on the thoracic region. Complications can include secondary bacterial infection, disseminated lesions, herpes zoster ophthalmicus, and post-herpetic neuralgia. Treatment involves antiviral medications like acyclovir or famciclovir to reduce symptoms and duration.
The document discusses prostaglandin production in ovine placentas. It finds that PGHS-1 is expressed in trophoblast epithelial cells and weakly in maternal tissues, while PGHS-2 expression increases specifically in trophoblast cells near term and during labor. Glucocorticoid treatment and spontaneous labor both significantly increase PGHS-2 levels in trophoblasts, indicating its role in elevated prostaglandin production during labor. The study elucidates the cellular localization of PGHS isozymes in ovine placentas and how their expression changes with gestation and the onset of labor.
A game-based blended learning platform allows teachers to create Kahoot quizzes and games for students to play to test their knowledge on class topics. Students join the game by entering a pin code displayed by the teacher and answer multiple choice questions on their own devices, competing to be the top scorer. The platform provides instant feedback on student performance through scores, rankings, and optional post-game surveys to gauge learning and engagement.
In early days the main emphases were on the cognitive aspects of learning and traditional instructions of teaching in the classroom using outdated and conventional techniques. But today in this world of constant innovations and discoveries, scientists and gadget-experts are continuously searching for one or the two technological devices a day. Nodoubt technology has made our life much easier and better in many aspects. In developed countries, technology facilitates and helps students and teacher to learn things in more effective ways. But in the country like India, the development in technology is not upto that mark. We still are moving towards the path of progress. Thus, this paper will best describes about the conceptual framework regarding futuristic studies related to future technologies such as M-Learning, E-Learning, , iPod, I-Pad self-efficacy learning, Virtual Learning Environment (VLE ) etc. In this paper investigator highlighted some of the studies related to trends in futurology and innovations that could prove an important aspect of education technology.
This document provides 5 tips for saving money: 1) Save loose change in a jar and convert it to bills for free at the bank. 2) Use a 30-day rule to avoid impulse purchases by waiting 30 days to buy something expensive. 3) Sign up for a savings account at a bank to earn interest and track spending online. 4) Ask your employer to divide paychecks between checking and savings accounts. 5) Create a budget and savings accounts for different expenditures like vacations or emergencies. The tips helped the author save over $400 in loose change over 1.5 years and separate accounts for specific purposes.
This document provides information about a study conducted on customer satisfaction with the Maruti Suzuki Ertiga car. It includes the following key points:
1. An introduction to the Ertiga car, mentioning its launch date in India and competitors. It is powered by 1.4L petrol and 1.3L diesel engines.
2. The objectives of the study are to understand customer satisfaction with dealer services, spare part availability and costs, car features like mileage and comfort, and the effect of advertisements.
3. The conclusion indicates that most customers recommend the Ertiga based on their positive brand preference and satisfaction levels. People are influenced in their purchase decision by friends and relatives as well as services and
The document discusses an automated system of administration (ASA SPC) for suburban passenger companies developed by ZAO "Group of companies ISKANDER". The system allows for automated control of processes from ticket sales to analytical reporting. It includes modules for ticket sales via terminals, mobile services, and online. The system provides benefits like reduced queues and optimized schedules for passengers and optimization of costs and revenue for carriers. The developer plans further expansions like integrating multiple agencies and developing additional modules.
AutoCAD es un software de diseño asistido por computadora utilizado por arquitectos, ingenieros y diseñadores. Permite el dibujo digital en 2D y la creación de imágenes en 3D. Requiere un procesador de al menos 3.0 GHz, 2 GB de RAM y 6 GB de espacio en disco. Se ejecuta en Windows 7/8 con pantalla de 1024x768 píxeles. Incluye barras de menús, herramientas y ventanas para realizar dibujos.
This document contains a list of English words and their Spanish translations. Some of the words translated include:
- "Though" translated to "Aunque"
- "Van" translated to "Camioneta"
- "About" translated to "Acerca de"
As tecnologias educacionais podem ser usadas para compartilhar conteúdo em slides. Uma equipe de quatro pessoas, Bruna Benezatto, Edinelza Venâncio da Costa, Mona Priantti Rabelo e Rosineide Bento, criou uma aula no SlideShare para ensinar outros sobre tecnologias educacionais.
Say Goodbye To Boring Dates - 10 Fun Dating Ideasdsmith76
Creative dates help you bond with your partner, problem-solve, and create the sort of memories you can chew on when you need a pick-me-up.
Here are 10 date ideas. Try one this weekend!!
This document summarizes information about Ebola virus. It describes how Peter Piot received a thermos of blood from a nun who had fallen ill with a mysterious disease in Congo in 1976. Piot and his colleagues studied the epidemic and found clues about how the virus spreads. The recent 2014 outbreak in West Africa caused over 6000 deaths. The document provides details on the structure and life cycle of Ebola virus, how it is transmitted from animals to humans and between humans, symptoms of the disease, and challenges in developing treatments.
As tecnologias educacionais podem ser usadas para compartilhar conteúdo em formato de slides. Uma equipe de quatro pessoas criou uma aula no SlideShare para compartilhar suas ideias sobre o uso de tecnologias no ensino.
Internet of Things mengacu pada objek fisik yang dapat teridentifikasi secara unik dan terhubung ke internet, memungkinkan kontrol dan berbagi data melalui jaringan. Istilah ini pertama kali diusulkan oleh Kevin Ashton pada tahun 1999 untuk menggambarkan konsep benda-benda yang terhubung ke internet dan saling berinteraksi. Teknologi inti IoT meliputi sensor jaringan nirkabel, RFID, dan integrasi antara dunia digital dan fisik untuk mem
Seismic waves are elastic waves that travel through the Earth's layers as a result of earthquakes, explosions, or volcanic eruptions. There are three main types of seismic waves: primary waves that cause back-and-forth motion in the direction of travel, secondary waves that cause changes in shape, and Love waves that form at surfaces and differ from other types of body and surface waves.
This document summarizes changes to MNREGS implementation in Samastipur, India between 2011 and the present. It describes how the program shifted from a top-down, compliance-focused approach to a more participatory one centered around measuring social and livelihood impacts. Key accomplishments included connecting over 350 isolated villages and remote fields through new infrastructure like roads, bridges, and plantations, improving rural connectivity and livelihoods.
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...acijjournal
Representation of semantic information contained in the words is needed for any Arabic Text Mining
applications. More precisely, the purpose is to better take into account the semantic dependencies
between words expressed by the co-occurrence frequencies of these words. There have been many
proposals to compute similarities between words based on their distributions in contexts. In this paper,
we compare and contrast the effect of two preprocessing techniques applied to Arabic corpus: Rootbased (Stemming), and Stem-based (Light Stemming) approaches for measuring the similarity between
Arabic words with the well known abstractive model -Latent Semantic Analysis (LSA)- with a wide
variety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity,
Jaccard Coefficient, and the Pearson Correlation Coefficient. The obtained results show that, on the one
hand, the variety of the corpus produces more accurate results; on the other hand, the Stem-based
approach outperformed the Root-based one because this latter affects the words meanings.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
This document presents an algorithm for semantic-based similarity measure (SBSM) to improve text clustering. The algorithm assigns semantic weights to documents terms and phrases based on their use as arguments in proposition bank notation. It calculates similarity between a document and query based on matching weighted terms and phrases. Experimental results on a dataset show the SBSM using proposition bank notation achieves better performance than traditional similarity measures for text clustering.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
The document describes a system for semantic textual similarity (STS) that uses various techniques to estimate the semantic similarity between texts. The system combines lexical, syntactic, and semantic information sources using state-of-the-art algorithms. In SemEval 2016 tasks, the system achieved a mean Pearson correlation of 75.7% on the monolingual English task and 86.3% on the cross-lingual Spanish-English task, ranking first in the cross-lingual task. The system utilizes techniques such as word embeddings, paragraph vectors, tree-structured LSTMs, and word alignment to capture semantic similarity.
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
This document discusses methods for calculating semantic similarity between terms in ontologies. It summarizes the Wu-Palmer algorithm, which calculates similarity based on the depth of terms from their closest common ancestor. The document also describes a modified "tbk" algorithm that adds a penalization factor for terms in neighboring hierarchies to address limitations of Wu-Palmer. The paper proposes a new algorithm that calculates similarity based on the direct distance between terms rather than their distance from the root node. It argues this approach could provide better results than existing edge-based algorithms like Wu-Palmer and tbk.
A Survey on Unsupervised Graph-based Word Sense DisambiguationElena-Oana Tabaranu
This paper presents comparative evaluations of graph based
word sense disambiguation techniques using several measures of word
semantic similarity and several ranking algorithms. Unsupervised word
sense disambiguation has received a lot of attention lately because of it's
fast execution time and it's ability to make the most of a small input
corpus. Recent state of the art graph based systems have tried to close
the gap between the supervised and the unsupervised approaches.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...kevig
The tremendous increase in the amount of available research documents impels researchers to propose
topic models to extract the latent semantic themes of a documents collection. However, how to extract the
hidden topics of the documents collection has become a crucial task for many topic model applications.
Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of
documents collection increases. In this paper, the Correlated Topic Model with variational Expectation-
Maximization algorithm is implemented in MapReduce framework to solve the scalability problem. The
proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of
the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are
conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed
approach has a comparable performance in terms of topic coherences with LDA implemented in
MapReduce framework.
Combined cosine-linear regression model similarity with application to handwr...IJECEIAES
This document presents a combined cosine-linear regression model for calculating similarity between handwritten word images. It first provides an overview of various commonly used similarity and distance measures such as Euclidean, Manhattan, Minkowski, Cosine, Jaccard, and Chebyshev distances. It then compares the performance of these measures on a handwritten Arabic document dataset, finding that cosine distance performs best. However, cosine distance is affected by the size of the visual codebook used. The document proposes a floating threshold based on a linear regression model that considers both the codebook size and number of image features, in order to better measure similarity between word images. Experiments on a historical Arabic document collection demonstrate the effectiveness of this combined cosine-linear regression
Scene text recognition in mobile applications by character descriptor and str...eSAT Journals
This document presents a method for scene text recognition in mobile applications using character descriptors and structure configuration. It proposes using a character descriptor that combines feature detectors and descriptors to extract text features effectively. It also models character structure using stroke configuration maps derived from character boundaries and skeletons. The method was tested on various datasets and achieved accuracy rates above 70%, outperforming existing methods. It can detect text regions and recognize text information for applications like text understanding and retrieval on mobile devices.
This document discusses various techniques for document clustering and retrieval, including cosine similarity, k-means clustering, hierarchical clustering, and the EM algorithm. Cosine similarity measures the similarity between document vectors and is often used to compare documents, with higher values indicating more similar documents. K-means clustering partitions documents into k groups to minimize intra-cluster similarity, while hierarchical clustering creates a dendrogram of document clusters by progressively merging the most similar pairs. The EM algorithm computes maximum likelihood estimates for document clustering when data is incomplete. Evaluation of document clusters considers internal metrics like intra-cluster similarity and inter-cluster dissimilarity.
This document discusses various techniques for document clustering and retrieval, including cosine similarity, k-means clustering, hierarchical clustering, and the EM algorithm. Cosine similarity measures the similarity between document vectors and is often used to compare documents, with higher values indicating more similar documents. K-means clustering partitions documents into k groups to minimize intra-cluster similarity, while hierarchical clustering creates a dendrogram of document clusters by progressively merging the most similar groups. The EM algorithm computes maximum likelihood estimates for document clustering when data is incomplete. Evaluation of document clusters considers internal metrics like intra-cluster similarity and inter-cluster dissimilarity.
This document discusses various techniques for document clustering and retrieval, including cosine similarity, k-means clustering, hierarchical clustering, and the EM algorithm. Cosine similarity measures the similarity between document vectors based on the angle between them. K-means clustering partitions documents into k clusters to minimize intra-cluster similarity, while hierarchical clustering merges clusters in a dendogram based on similarity. The EM algorithm computes maximum likelihood estimates of document distributions. Evaluation of clustering assesses the quality based on intra-class and inter-class similarity.
Behavior study of entropy in a digital image through an iterative algorithmijscmcj
Image segmentation is a critical step in computer vision tasks constituting an essential issue for pattern recognition and visual interpretation. In this paper, we study the behavior of entropy in digital images through an iterative algorithm of mean shift filtering. The order of a digital image in gray levels is defined. The behavior of Shannon entropy is analyzed and then compared, taking into account the number of iterations of our algorithm, with the maximum entropy that could be achieved under the same order. The use of equivalence classes it induced, which allow us to interpret entropy as a hyper-surface in real m dimensional space. The difference of the maximum entropy of order n and the entropy of the image is used to group the the iterations, in order to caractrizes the performance of the algorithm.
This document summarizes a research paper that proposes and compares fuzzy and Naive Bayes models for detecting obfuscated plagiarism in Marathi language texts. It first provides background on plagiarism detection and describes different types of plagiarism, including obfuscated plagiarism. It then presents the fuzzy semantic similarity model, which uses fuzzy logic rules and semantic relatedness between words to calculate similarity scores between texts. Next, it describes the Naive Bayes model for plagiarism detection using Bayes' theorem. The paper compares the performance of the fuzzy and Naive Bayes models on precision, recall, F-measure and granularity. It finds that the Naive Bayes model provides more accurate detection of obfuscated plagiar
This presentation is a briefing of a paper about Networks and Natural Language Processing. It describes many graph based methods and algorithms that help in syntactic parsing, lexical semantics and other applications.
This document summarizes a survey on string similarity matching search techniques. It discusses how string similarity matching is used to find relevant information in text collections. The document reviews different algorithms for string matching, including edit distance, NR-grep, n-grams, and approaches based on hashing and locality-sensitive hashing. It analyzes techniques like pattern matching, threshold-based joins, and vector representations. The goal is to present an overview of the field and compare algorithm performance for similarity searches.
Recognition of Words in Tamil Script Using Neural NetworkIJERA Editor
In this paper, word recognition using neural network is proposed. Recognition process is started with the partitioning of document image into lines, words, and characters and then capturing the local features of segmented characters. After classifying the characters, the word image is transferred into unique code based on character code. This code ideally describes any form of word including word with mixed styles and different sizes. Sequence of character codes of the word form input pattern and word code is a target value of the pattern. Neural network is used to train the patterns of the words. Trained network is tested with word patterns and is recognized or unrecognized based on the network error value. Experiments have been conducted with a local database to evaluate the performance of the word recognizing system and obtained good accuracy. This method can be applied for any language word recognition system as the training is based on only unique code of the characters and words belonging to the language.
Similar to A SURVEY ON SIMILARITY MEASURES IN TEXT MINING (20)
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Recycled Concrete Aggregate in Construction Part III
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
1. Machine Learning and Applications: An International Journal (MLAIJ) Vol.3, No.1, March 2016
DOI : 10.5121/mlaij.2016.3103 19
A SURVEY ON SIMILARITY MEASURES IN TEXT
MINING
M.K.Vijaymeena1
and K.Kavitha2
1
M.E. Scholar, Department of Computer Science & Engineering, Nandha Engineering
College, Erode-638052, Tamil Nadu, India
2
Assistant Professor, Department of Computer Science & Engineering, Nandha
Engineering College, Erode-638052, Tamil Nadu, India
ABSTRACT
The Volume of text resources have been increasing in digital libraries and internet. Organizing these text
documents has become a practical need. For organizing great number of objects into small or minimum
number of coherent groups automatically, Clustering technique is used. These documents are widely used
for information retrieval and Natural Language processing tasks. Different Clustering algorithms require a
metric for quantifying how dissimilar two given documents are. This difference is often measured by
similarity measure such as Euclidean distance, Cosine similarity etc. The similarity measure process in text
mining can be used to identify the suitable clustering algorithm for a specific problem. This survey
discusses the existing works on text similarity by partitioning them into three significant approaches;
String-based, Knowledge based and Corpus-based similarities.
Keywords
Similarity measure, Corpus, Co-occurrence, Semantics, Lexical analysis, Synsets.
1. INTRODUCTION
Document clustering is a process that involves a computational burden in measuring the similarity
between document pairs. Similarity measure is the function which assigns a real number between
0 and 1 to the documents. A zero value means that the documents are dissimilar completely
whereas one indicates that the documents are identical practically. Vector-based models have
been used for computing the similarity in document, traditionally. The Different features
presented in the documents are represented by vector- based models.
Figure 1: High-level text mining functional architecture.
Text similarity measures play an increasingly vital role in text related research and applications in
several tasks such as text classification, information retrieval, topic tracking, document clustering,
questions generation, question answering, short answer scoring, machine translation, essay
scoring, text summarization, topic detection and others.
2. Machine Learning and Applications: An International Journal (MLAIJ) Vol.3, No.1, March 2016
20
Finding similarity between words is a necessary part of text similarity. It is the primary stage for
sentence and document similarities. Figure 1 represents High-level text mining functional
architecture. Words can be possibly similar in two ways lexically and semantically. Words are
lexically similar if they have a similar character sequence. If the words have same theme, then
they are semantically similar and if they have dissimilar theme but used in same context, then
they are not related to each other. String-Based algorithms are used to measure Lexical similarity;
Corpus-Based and Knowledge-Based algorithms are based on Semantic similarity.
String-Based measures operate on sequence of strings and composition of characters. For
measuring the similarity and dissimilarity between the text strings, String metric is used. It is used
for string matching or comparison but it is approximate. The semantic similarity determines the
similarity between words based on the information gained from large corpora is called as Corpus-
Based similarity. Knowledge-Based similarity is a semantic similarity measure which is used to
determine the degree of similarity between words based on information derived from semantic
networks. Each type will be described briefly.
This paper is arranged as follows: Section 2 presents String-Based algorithms by divided them
into character-based and term-based measures. Sections 3 and 4 introduce Corpus-Based and
knowledge-Based algorithms respectively. Concept of advanced text mining system is introduced
in section 5 and section 6 concludes the survey.
2. STRING-BASED SIMILARITY
String similarity measures are operated on string sequences and composition of characters. A
metric that is used for measuring the distance between the text strings is called String metric and
used for string matching and comparison. This survey represents various string similarity
measures and they were implemented in SimMetrics package [1]. Various algorithms are
introduced in which seven are character based while the other is term-based distance measures
and they are discussed in the following section.
2.1. Character-based Similarity Measures
The similarity between the strings is based on the contiguous chain of characters length which are
present in both strings. This is known as Longest Common Substring algorithm (LCS) measures.
The distance between two strings is measured by counting process. The minimum number of
operations required transforming one string into another string and the operations such as
insertion, deletion, or substitution of a single character and transposition of two characters
which are adjacent is defined by Damerau-Levenshtein [2, 3]. The Mathematical computation
of Levenshtein distance between two strings a,b where
where 1(ai≠bj) is the indicator function and it is equal to 0 if ai=bj ,equal to 1 otherwise,
and is the distance between the first i characters in a and the first j characters in
b.
Jaro distance measure is based on the number or order of the characters between two strings
which are common; It is mainly used for record linkage [4, 5]. Jaro–Winkler is an extension
3. Machine Learning and Applications: An International Journal (MLAIJ) Vol.3, No.1, March 2016
21
of Jaro distance; a prefix scale gives more ratings related to strings and it matches from the
beginning for a set prefix length [6]. The Jaro distance dj of two given strings S1 and S2 is
Where m is the number that denotes the matching characters and t denotes half the number
of transpositions.
The algorithm which follows dynamic programming is Needleman-Wunsch algorithm [7]. It
was one of the applications of dynamic programming for biological sequence comparison. It
also performs a global alignment in order to find the best alignment over the entire of two
sequences and suitable only when the two sequences are of similar length.
Smith-Waterman [8] is an example of dynamic programming. For finding the best alignment
over the conserved domain of sequences, local alignment is performed. It is very useful for
dissimilar sequences with similar regions. A matrix H is built as follows:
Where a,b = Strings (Alphabet ), m is equal to length(a), n is equal to length(b), S(a,b) is a
similarity function. The maximum Similarity-Score between a suffix of a[1...i] and a suffix of
b[1...j] is denoted by H(i,j),Wi is the gap-scoring scheme.
N-gram [9] is a sub-sequence of n items in a given text sequence. N-gram similarity
algorithms are used to compare the n-grams from each character in two strings. Distance is
computed to divide the number of similar n-grams by maximal number of n-grams. For
statistical natural language processing, N-gram models are used widely. Phonemes and its
sequences are modelled using a n-gram distribution in speech recognition. For parsing, words
are modelled such that each n-gram is composed of n words and used for language
identification.
2.2. Term-based Similarity Measures
The Block Distance [10] is known as Manhattan distance. It is used to compute the distance
that would be travelled to get from one data point to another only if a grid-like path is
followed. The Block distance between the items is the sum of the differences of their
components correspondingly. In n-dimensional real vector space with fixed Cartesian
coordinate system, the taxicab distance (d1) between vectors p,q is the sum of the lengths of
the line segment projections between the points onto the coordinate axes.
where (p,q) are vectors.
The Similarity measure which is primarily used to measure the cosine angle between the
documents is known as Cosine Similarity.
Dice’s coefficient [11] is defined as two times the number of terms which are common in the
compared strings and divided by the total number of terms present in both strings.
4. Machine Learning and Applications: An International Journal (MLAIJ) Vol.3, No.1, March 2016
22
Euclidean distance is measured by calculating the square root of the sum of squared
differences between the two vectors elements.
Jaccard similarity [12] is computed in such a way that the number of shared terms divided by
the number of all unique terms present in both strings.
A simple vector based approach is named as Matching Coefficient. It simply counts the
number of similar terms and both vectors should be non zero. The Simple Matching
Coefficient (SMC) is a statistics mainly used to compare the similarity of the sample sets and
its diversity. Consider two objects named as A and B and each have n binary attributes.SMC
is defined as:
Where:
The total number of attributes is represented as M11 and A, B both are equal to 1.
The total number of attributes is represented as M01 and where attribute A is 0 and attribute
B is 1.
The total number of attributes is represented as M10 where attribute A is 1 and attribute B is 0.
The total number of attributes is represented as M00 where A and B values are equal to 0.
Overlap coefficient considers two strings a full match if one is a subset of another and it is
similar to Dice Coefficient. The overlap coefficient also called as Szymkiewicz-Simpson
coefficient is a similarity measure that is related to the Jaccard index. It measures the overlap
between two sets. The measure is calculated by dividing the size of the intersection by the
smaller of the size of the two sets:
If set X is a subset of Y or Vice-versa, it shows that the overlap coefficient value is 1.
3. CORPUS-BASED SIMILARITY
Corpus-Based similarity is a semantic similarity measure which determines the similarity between
the words based on the information gained from corpora. Figure 2 shows the Corpus-Based
similarity measures. A Corpus is a large collection of texts and it is used for language research.
Hyperspace Analogue to Language (HAL) creates a space semantically and the strength of
association between the word are represented by the row and the words are represented by the
column [13,14]. The user of the algorithm could drop out the low entropy columns in the matrix.
In the beginning of the window, a focus word is placed if the text is analyzed. The neighbouring
words are recorded while co-occurring. Word-ordering information is recorded by treating the co-
occurrence based on whether the neighbouring word appeared before and after the focus word.
Latent Semantic Analysis (LSA) is a important technique of Corpus-Based similarity. It assumes
that the words with close meaning will mostly occur in the similar group. A matrix which
contains word counts per paragraph was constructed from a text dataset. To reduce the number of
columns, singular value decomposition (SVD) is used and it preserves the similarity structure
among the rows.
5. Machine Learning and Applications: An International Journal (MLAIJ) Vol.3, No.1, March 2016
23
Figure 2: Corpus-Based Similarity Measures
Generalized Latent Semantic Analysis (GLSA) is a framework for computing semantically
motivated term. The LSA approach is extended by focusing on term vectors instead of the
representation of dual document-term. A measure used for computing the semantic relatedness of
arbitrary texts is called as Explicit Semantic Analysis (ESA). The cosine measures between the
vectors are expressed based on the semantic relatedness between two texts.
4. KNOWLEDGE-BASED SIMILARITY
Knowledge-Based Similarity is a semantic similarity measures and it based on identifying the
degree of similarity between words and it uses various information derived from semantic
networks [15]. The most popular semantic network is known as WordNet which is a big lexical
database of English language. Nouns, verbs, adverbs and adjectives are grouped into sets of
cognitive synonyms known as synsets. By means of conceptual-semantic and lexical relations,
Synsets are interlinked. Knowledge-based similarity measures can be divided into two groups
such as measures of semantic similarity and semantic relatedness. Measures of semantic
similarity have been defined between words and concepts. The prominence on word-to-word
similarity metrics is due to the resources availability and specifically encode relations between
words and concepts (e.g. WordNet), and the various test beds that allow for their evaluation (e.g.
TOEFL or SAT analogy/synonymy tests). Consider that the input text segments are T1 and T2.
The similarity between the text segments could be determined using the below scoring function:
This similarity score value should be between 0 and 1. The score of 1 indicates identical text
segments and score of 0 indicates that there is no semantic overlap between the two segments.
The following word similarity metrics: Lesk, Leacock & Chodorow, Wu & Palmer, Lin, Resnik,
and Jiang & Conrath are word-to-word similarity metrics which selects any given pair of words
with highest concept-to-concept similarity.
6. Machine Learning and Applications: An International Journal (MLAIJ) Vol.3, No.1, March 2016
24
The Leacock & Chodorow (1998) similarity is determined as follows:
Here, length is the shortest path length between two concepts which uses node-counting and
D is the maximum taxonomy depth.
The Lesk similarity between two concepts is a function of the overlap between the definitions
correspondingly and it is provided by a dictionary. Based on an algorithm proposed by Lesk
(1986), it is the solution for word sense disambiguation process. The application of the Lesk
similarity measure is used in semantic networks and in conjunction with any dictionary but it
should provide definitions for words.
The Wu and Palmer (1994) similarity metric is used to measure the depth of the given
concepts in the Word Net taxonomy, the least common subsumer (LCS) depth and combines
these figures into a similarity score:
The measure introduced by Resnik (1995) is used to return the information content (IC) of
the least common subsumer of two concepts:
where IC is defined as:
and P(c) = Probability of encountering an instance of concept c in a big corpus. Lin (1998)
metric was builded on Resnik’s measure which adds a normalization factor consisting of the
information content of the two input concepts as follows:
5. ADVANCED TEXT MINING SYSTEM
The general model provided by various data mining applications is followed by the Text mining
system. The main areas are pre-processing tasks, core mining tasks, components of presentation
layer and browsing functionality, and refinement techniques. Figure 3 shows the System
architecture for an advanced text mining system.
Pre-processing is a technique that prepares data for knowledge discovery operations and
removes the noise present in the data. Generally, Pre-processing tasks convert the
information from each original source into a canonical format.
Core Mining Operations are considered as the heart of a text mining system. It includes trend
analysis, pattern discovery and incremental knowledge discovery algorithms.
Presentation Layer Components consists of GUI and pattern browsing functionality.
Visualization tools and user-facing query editors also fall under this architectural category.
Presentation layer components include graphical tools for creating and modifying concept
clusters and it creates annotated profiles for various specific concepts.
Refinement Techniques are methods used to filter redundant information and it is used for
clustering closely related data. It represents a full, comprehensive suite of pruning, ordering,
generalization, and clustering approaches.
7. Machine Learning and Applications: An International Journal (MLAIJ) Vol.3, No.1, March 2016
25
Figure 3: System architecture for a text mining system.
6. APPLICATIONS OF TEXT MINING
Similarity measures in Text mining are increasing function of common features and decreasing
function of dissimilar features. The Similarity measures will be applied in various fields and for
different purposes in Text mining domain. Unstructured text mining is an area which is gaining
significant importance in adoptions for business applications. Text mining is being applied to
answer various business questions and to optimize day-to-day operational efficiencies and used
for improving long-term strategic decisions. Following are the practical real-world instances
where text mining has been successfully applied.
8. Machine Learning and Applications: An International Journal (MLAIJ) Vol.3, No.1, March 2016
26
6.1. Text Mining in the Automotive Industry
It is essential that auto companies must explore all the ways for reducing costs. One of the very
important levers in the cost equation is Optimizing warranty cost for automobile manufacturers.
One of the underutilized aspects of optimizing warranty cost is input from service technician’s
comments. Automobile companies can take following actions which will optimize dealer
inventory for spare parts, reduce warranty-related cost erosion and help suppliers deliver quality
components:
Auto manufacturers can share the results of technician’s comments with specific product
suppliers. It is required to undertake joint initiatives to reduce the number of defective
components.
Automobile companies can build an early warning system and this could happen by
considering the frequency of occurrence of keywords. They are specific in a watch list
like “brake failure,” ”short circuit,”,”low mileage” etc.,. And this may cause legal
liability in few cases.
If the component was manufactured internally, then the specific manufacturing process
that was responsible for the defective component can be re-engineered and it eliminates
reoccurrence.
By calculating the select spare parts or auto component’s frequency of occurrence, it can
be used as an input to forecast the regional needs for spare parts. This is known as auto
spare part inventory optimization.
Most dealer management systems have a preliminary taxonomy for classifying defects.
This taxonomy changes depending upon how well it was defined originally. If most
defects get classified under a miscellaneous section, then keyword and theme frequency
analysis can guide in the creation of new defect classifications and used for text mining.
Figure 4: Text mining in Auto Industry.
6.2. Text Mining in the Healthcare Industry
The healthcare industry is a huge spender on technology with the proliferation of hospital
management systems and to log patient statistics, low-cost devices are used. Since there is an
increase in the breadth and depth of patient data suddenly, it is required to mine the comments in
doctor’s diagnosis transcripts. Outputs of the mining process can yield information which benefits
the healthcare industry in following numerous ways.
9. Machine Learning and Applications: An International Journal (MLAIJ) Vol.3, No.1, March 2016
27
The top ten diseases are isolated by frequencies of keyword per region. The mix of
tablets/medicines to stock is leveraged to optimize on the limited outlets store. It should
be noted that the frequency of disease related keywords will change.
Based on doctor’s comments, an early warning system can be built within text mining
outputs in order to detect sudden changes. For example, if the keyword frequency of
lungs or breathing exceeds 60 appearances, that too specifically in the last month for a
given region, it can be a hint to excessive environmental conditions that results in
respiratory problems. To remedy the situation, a proactive intervention can be activated.
Figure 5: Text mining in Healthcare Industry.
7. CONCLUSION
Text Mining is a significant research area which is gaining an increasing popularity in the recent
years. Measuring the similarity between the text documents is an important operation of text
mining. In this survey, three text similarity approaches such as String-based, Corpus-based and
Knowledge-based similarities are discussed. String-Based measures are operated on character
composition and string sequences. Corpus-Based similarity is a semantic similarity measure
which determines the similarity between words based on the information gained from large
corpus. A semantic similarity measures known as Knowledge-Based similarity is based on the
degree of similarity between the words and concepts. Some of these algorithms were combined
together in many researches and they are hybrid similarity measures. Useful similarity packages
such as SimMetrics, WordNet Similarity and NLTK were mentioned. The System architecture for
an advanced text mining system and the functionalities of various components are described.
Different real time applications of text mining used in environments such as Automobile industry
and Healthcare industry have been discussed.
REFERENCES
[1] Chapman, S. (2006). SimMetrics : a java & c# .net library of similarity metrics,
http://sourceforge.net/projects/simmetrics/.
[2] Hall , P. A. V. & Dowling, G. R. (1980) Approximate string matching, Comput. Surveys, 12:381-
402.
[3] Peterson, J. L. (1980). Computer programs for detecting and correcting spelling errors, Comm.
Assoc. Comput. Mach., 23:676-687.
[4] Jaro, M. A. (1989). Advances in record linkage methodology as applied to the 1985 census of
Tampa Florida, Journal of the American Statistical Society, vol. 84, 406, pp 414-420.
10. Machine Learning and Applications: An International Journal (MLAIJ) Vol.3, No.1, March 2016
28
[5] Jaro, M. A. (1995). Probabilistic linkage of large public health data file, Statistics in Medicine 14
(5-7), 491-8.
[6] Winkler W. E. (1990). String Comparator Metrics and Enhanced Decision Rules in the Fellegi-
Sunter Model of Record Linkage, Proceedings of the Section on Survey Research Methods,
American Statistical Association, 354–359.
[7] Needleman, B. S. & Wunsch, D. C.(1970). A general method applicable to the search for
similarities in the amino acid sequence of two proteins", Journal of Molecular Biology 48(3): 443–
53.
[8] Smith, F. T. & Waterman, S. M. (1981). Identification of Common Molecular Subsequences,
Journal of Molecular Biology 147: 195–197.
[9] Alberto, B. , Paolo, R., Eneko A. & Gorka L. (2010). Plagiarism Detection across Distant
Language Pairs, In Proceedings of the 23rd International Conference on Computational
Linguistics, pages 37–45.
[10] Eugene F. K. (1987). Taxicab Geometry , Dover. ISBN 0-486-25202-7.
[11] Dice, L. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3).
[12] Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et des
Jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37, 547-579.
[13] Lund, K., Burgess, C. & Atchley, R. A. (1995). Semantic and associative priming in a high-
dimensional semantic space. Cognitive Science Proceedings (LEA), 660-665.
[14] Lund, K. & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-
occurrence. Behavior Research Methods, Instruments & Computers, 28(2),203-208.
[15] Mihalcea, R., Corley, C. & Strapparava, C. (2006). Corpus based and knowledge-based measures
of text semantic similarity. In Proceedings of the American Association for Artificial
Intelligence.(Boston, MA).
AUTHORS
M.K.VIJAYMEENA received the B.E degree in computer science and engineering from
Jansons Institute of Technology in 2014. She is currently pursuing the M.E. degree in
Computer science and Engineering at Nandha engineering college, Erode, India. Her current
research interests include data mining, text mining and information retrieval.
Mrs.K.KAVITHA received the M.Tech. degree in Information Technology from SNS College
of Technology, Coimbatore in 2011.She is currently working as an Assistant Professor in
Nandha engineering college, Erode, India. Her current research interests include data mining
and Network Security.