Pattern matching is one of the central and most widely studied problem in theoretical computer science. Solutions to the problem play an important role in many areas of science and information processing. Its performance has great impact on many applications including database query, text processing and DNA sequence analysis. In general Pattern matching algorithms are based on the shift value, the direction of the sliding window and the order in which comparisons are made. The performance of the algorithms can be enhanced to a great extent by a larger shift value and less number of comparison to get the shift value. In this paper we proposed an algorithm, for finding motif in DNA sequence. The algorithm is based on preprocessing of the pattern string(motif) by considering four consecutive nucleotides of the DNA that immediately follow the aligned pattern window in an event of mismatch between pattern(motif) and DNA sequence .Theoretically, we found the proposed algorithms work efficiently for motif identification.
A Survey of String Matching AlgorithmsIJERA Editor
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
An Index Based K-Partitions Multiple Pattern Matching AlgorithmIDES Editor
The study of pattern matching is one of the
fundamental applications and emerging area in computational
biology. Searching DNA related data is a common activity for
molecular biologists. In this paper we explore the applicability
of a new pattern matching technique called Index based Kpartition
Multiple Pattern Matching algorithm (IKPMPM), for
DNA sequences. Current approach avoids unnecessary
comparisons in the DNA sequence. Due to this, the number of
comparisons gradually decreases and comparison per character
ratio of the proposed algorithm reduces accordingly when
compared to other existing popular methods. The experimental
results show that there is considerable amount of performance
improvement.
Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification? IJORCS
An algorithm for locating all occurrences of a finite number of keywords in an arbitrary string, also known as multiple strings matching, is commonly required in information retrieval (such as sequence analysis, evolutionary biological studies, gene/protein identification and network intrusion detection) and text editing applications. Although Aho-Corasick was one of the commonly used exact multiple strings matching algorithm, Commentz-Walter has been introduced as a better alternative in the recent past. Comments-Walter algorithm combines ideas from both Aho-Corasick and Boyer Moore. Large scale rapid and accurate peptide identification is critical in computational proteomics. In this paper, we have critically analyzed the time complexity of Aho-Corasick and Commentz-Walter for their suitability in large scale peptide identification. According to the results we obtained for our dataset, we conclude that Aho-Corasick is performing better than Commentz-Walter as opposed to the common beliefs.
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
The document presents an algorithm for optimized searching using non-overlapping iterative neighbor intervals. The algorithm aims to reduce the number of checked conditions by saving the frequency of replicated words and using non-overlapping intervals based on the plane sweep algorithm. It does this by focusing the search on a smaller subspace of relevant intervals near the minimum frequency keyword. The algorithm iterates through ranges, eliminating unsatisfied keywords to detect relevant ranges. This improves efficiency and reduces the number of comparisons compared to previous methods.
A Comparative Result Analysis of Text Based Steganographic Approaches iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
The document presents a novel fuzzy clustering algorithm called FRECCA that clusters sentences from multi-documents to discover new information. FRECCA uses fuzzy relational eigenvector centrality to calculate page rank scores for sentences within clusters, treating the scores as likelihoods. It uses expectation maximization to optimize cluster membership values and mixing coefficients without a parameterized likelihood function. An evaluation shows FRECCA achieves superior performance to other clustering algorithms on a quotations dataset, identifying overlapping clusters of semantically related sentences.
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingIJERA Editor
This document presents a novel framework for identifying Short Tandem Repeats (STRs) using parallel string matching. It begins with background on STRs and challenges with existing sequential algorithms. It then describes a two-phase methodology - first applying a basic improved right prefix algorithm sequentially, then applying it in parallel using multi-threading on multicore processors. Results show the basic algorithm outperforms Boyer-Moore, Knuth-Morris-Pratt and brute force algorithms sequentially. When applied in parallel, processing time is reduced from 80ms sequentially to 40ms in parallel on multicore systems. The parallel STR identification framework allows efficient searching of repeats in large genomes.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
A Survey of String Matching AlgorithmsIJERA Editor
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
An Index Based K-Partitions Multiple Pattern Matching AlgorithmIDES Editor
The study of pattern matching is one of the
fundamental applications and emerging area in computational
biology. Searching DNA related data is a common activity for
molecular biologists. In this paper we explore the applicability
of a new pattern matching technique called Index based Kpartition
Multiple Pattern Matching algorithm (IKPMPM), for
DNA sequences. Current approach avoids unnecessary
comparisons in the DNA sequence. Due to this, the number of
comparisons gradually decreases and comparison per character
ratio of the proposed algorithm reduces accordingly when
compared to other existing popular methods. The experimental
results show that there is considerable amount of performance
improvement.
Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification? IJORCS
An algorithm for locating all occurrences of a finite number of keywords in an arbitrary string, also known as multiple strings matching, is commonly required in information retrieval (such as sequence analysis, evolutionary biological studies, gene/protein identification and network intrusion detection) and text editing applications. Although Aho-Corasick was one of the commonly used exact multiple strings matching algorithm, Commentz-Walter has been introduced as a better alternative in the recent past. Comments-Walter algorithm combines ideas from both Aho-Corasick and Boyer Moore. Large scale rapid and accurate peptide identification is critical in computational proteomics. In this paper, we have critically analyzed the time complexity of Aho-Corasick and Commentz-Walter for their suitability in large scale peptide identification. According to the results we obtained for our dataset, we conclude that Aho-Corasick is performing better than Commentz-Walter as opposed to the common beliefs.
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
The document presents an algorithm for optimized searching using non-overlapping iterative neighbor intervals. The algorithm aims to reduce the number of checked conditions by saving the frequency of replicated words and using non-overlapping intervals based on the plane sweep algorithm. It does this by focusing the search on a smaller subspace of relevant intervals near the minimum frequency keyword. The algorithm iterates through ranges, eliminating unsatisfied keywords to detect relevant ranges. This improves efficiency and reduces the number of comparisons compared to previous methods.
A Comparative Result Analysis of Text Based Steganographic Approaches iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
The document presents a novel fuzzy clustering algorithm called FRECCA that clusters sentences from multi-documents to discover new information. FRECCA uses fuzzy relational eigenvector centrality to calculate page rank scores for sentences within clusters, treating the scores as likelihoods. It uses expectation maximization to optimize cluster membership values and mixing coefficients without a parameterized likelihood function. An evaluation shows FRECCA achieves superior performance to other clustering algorithms on a quotations dataset, identifying overlapping clusters of semantically related sentences.
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingIJERA Editor
This document presents a novel framework for identifying Short Tandem Repeats (STRs) using parallel string matching. It begins with background on STRs and challenges with existing sequential algorithms. It then describes a two-phase methodology - first applying a basic improved right prefix algorithm sequentially, then applying it in parallel using multi-threading on multicore processors. Results show the basic algorithm outperforms Boyer-Moore, Knuth-Morris-Pratt and brute force algorithms sequentially. When applied in parallel, processing time is reduced from 80ms sequentially to 40ms in parallel on multicore systems. The parallel STR identification framework allows efficient searching of repeats in large genomes.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
This paper improves the skip-gram model for learning word and phrase embeddings. It proposes using phrases instead of words as training samples, which greatly reduces the number of samples. It also introduces subsampling frequent words to speed up training and improve performance on infrequent words. Further, it compares different methods for reducing computational complexity like hierarchical softmax and negative sampling, finding negative sampling works best. Empirical tests on analogical reasoning tasks show the best model uses hierarchical softmax with subsampling and is trained on billions of words. The paper also demonstrates additive compositionality of word vectors.
This work considers the multi-objective optimization problem constrained by a system of bipolar fuzzy relational equations with max-product composition. An integer optimization based technique for order of preference by similarity to the ideal solution is proposed for solving such a problem. Some critical features associated with the feasible domain and optimal solutions of the bipolar max-Tp equation constrained optimization problem are studied. An illustrative example verifying the idea of this paper is included. This
is the first attempt to study the bipolar max-T equation constrained multi-objective optimization problems
from an integer programming viewpoint.
The document presents a new ontology matching system based on a multi-agent architecture. The system takes ontologies described in XML, RDF Schema, and OWL as input. It uses multiple matchers and filtering to generate mappings between ontology entities. The mappings are then validated. The system is implemented as a multi-agent system with different agent types responsible for resources, matching, generating mappings, and filtering/validating mappings. The architecture allows for robust, flexible, and scalable ontology matching.
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...IOSR Journals
The document describes a hybrid computational intelligence method for clustering sense-tagged Nepali documents. It combines self-organizing maps (SOM), particle swarm optimization (PSO), and k-means clustering. Feature vectors are generated from the sense-tagged documents and the hybrid algorithm is applied in three phases: (1) SOM produces prototype vectors from the feature vectors, (2) PSO initializes k-means centroids, (3) k-means clusters the prototypes. The method aims to address limitations of bag-of-words representations by incorporating word sense information. Experiments show the approach effectively clusters sense-tagged Nepali texts.
Matching the String with a pattern is known as String Match”. Now, what are these strings and patterns? Alright, the string is that which is to be checked entered by the user and is matched with the pattern which is already in the database. Copy the link given below and paste it in new browser window to get more information on String Match:- www.transtutors.com/homework-help/computer-science/string-match.aspx
This document summarizes a survey on string similarity matching search techniques. It discusses how string similarity matching is used to find relevant information in text collections. The document reviews different algorithms for string matching, including edit distance, NR-grep, n-grams, and approaches based on hashing and locality-sensitive hashing. It analyzes techniques like pattern matching, threshold-based joins, and vector representations. The goal is to present an overview of the field and compare algorithm performance for similarity searches.
Given presentation tell us about string, string matching and the navie method of string matching. Well this method has O((n-m+1)*m) time complexicity. It also tells the problem with naive approach and gives list of approaches which can be applied to reduce the time complexicity
The document discusses various algorithms for pattern searching in a text, including:
1. Naive pattern searching which slides the pattern over the text and checks for matches in O(nm) time in worst case.
2. KMP algorithm which uses a preprocessing step to construct a lps array to avoid rematching characters, improving worst case to O(n).
3. Rabin-Karp algorithm which computes hashes of patterns and substrings to quickly eliminate non-matching candidates before character matching.
4. Finite automata based algorithm which preprocesses the pattern to construct a state machine, allowing searches in O(n) time.
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...CSCJournals
Biclusters are required to analyzing gene expression patterns of genes comparing rows in expression profiles and analyzing expression profiles of samples by comparing columns in gene expression matrix. In the process of biclustering we need to cluster genes and samples. The algorithm presented in this paper is based upon the two-way clustering approach in which the genes and samples are clustered using parallel fuzzy C-means clustering using message passing interface, we call it MFCM. MFCM applied for clustering on genes and samples which maximize membership function values of the data set. It is a parallelized rework of a parallel fuzzy two-way clustering algorithm for microarray gene expression data [9], to study the efficiency and parallelization improvement of the algorithm. The algorithm uses gene entropy measure to filter the clustered data to find biclusters. The method is able to get highly correlated biclusters of the gene expression dataset.
Huffman is one of the compression algorithms. It is the most famous algorithm to compress text. There are four phases in the Huffman algorithm to compress text. The first is to group the characters. The second is to build the Huffman tree. The third is the encoding, and the last one is the construction of coded bits. The Huffman algorithm principle is the character that often appears on encoding with a series of short bits and characters that rarely appeared in bit-encoding with a longer series. Huffman compression technique can provide savings of 30% from the original bits. It works based on the frequency of characters. The more the similar character reached, the higher the compression rate gained.
ADAPTIVE AUTOMATA FOR GRAMMAR BASED TEXT COMPRESSIONcsandit
The Internet and the ubiquitous presence of computing devices anywhere is generating a
continuously growing amount of information. However, the information entropy is not uniform.
It allows the use of data compression algorithms to reduce the demand for more powerful
processors and larger data storage equipment. This paper presents an adaptive rule-driven
device - the adaptive automata - as the device to identify repetitive patterns to be compressed in
a grammar based lossless data compression scheme.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Informationcsandit
The document describes a novel word sense disambiguation method using global co-occurrence information and non-negative matrix factorization. It proposes using global co-occurrence frequencies between words in a large corpus rather than local context windows to address data sparseness issues. An experiment compares the method to two baselines and shows it achieves the highest and most stable precision for word sense disambiguation.
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemIRJET Journal
The document proposes an algorithm for dynamically assigning papers to reviewers based on keywords. It discusses:
1) Existing exact string matching algorithms like brute force, KMP, and Boyer-Moore are ineffective for this problem since keyword phrases may be similar but not exact matches.
2) The algorithm uses dynamic programming to calculate an "expertise distance" between each paper's keywords and a reviewer's keywords based on the edit distance between the strings. Reviewers with lower expertise distances are better matches.
3) This approach accounts for minor differences in keyword phrases while still capturing the underlying semantic similarity, making it more suitable than exact string matching for the paper-reviewer assignment problem.
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
The Volume of text resources have been increasing in digital libraries and internet. Organizing these text documents has become a practical need. For organizing great number of objects into small or minimum number of coherent groups automatically, Clustering technique is used. These documents are widely used for information retrieval and Natural Language processing tasks. Different Clustering algorithms require a metric for quantifying how dissimilar two given documents are. This difference is often measured by similarity measure such as Euclidean distance, Cosine similarity etc. The similarity measure process in text
mining can be used to identify the suitable clustering algorithm for a specific problem. This survey discusses the existing works on text similarity by partitioning them into three significant approaches; String-based, Knowledge based and Corpus-based similarities.
Performance Efficient DNA Sequence DetectionalgoRahul Shirude
The document proposes a parallel string matching algorithm using CUDA to perform DNA sequence detection on a GPU. It aims to optimize GPU core utilization and achieve high performance gains over sequential approaches. The proposed algorithm finds correct pattern matches in DNA database sequences. Experimental results show the parallel GPU approach has much higher performance than sequential CPU methods.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
This paper improves the skip-gram model for learning word and phrase embeddings. It proposes using phrases instead of words as training samples, which greatly reduces the number of samples. It also introduces subsampling frequent words to speed up training and improve performance on infrequent words. Further, it compares different methods for reducing computational complexity like hierarchical softmax and negative sampling, finding negative sampling works best. Empirical tests on analogical reasoning tasks show the best model uses hierarchical softmax with subsampling and is trained on billions of words. The paper also demonstrates additive compositionality of word vectors.
This work considers the multi-objective optimization problem constrained by a system of bipolar fuzzy relational equations with max-product composition. An integer optimization based technique for order of preference by similarity to the ideal solution is proposed for solving such a problem. Some critical features associated with the feasible domain and optimal solutions of the bipolar max-Tp equation constrained optimization problem are studied. An illustrative example verifying the idea of this paper is included. This
is the first attempt to study the bipolar max-T equation constrained multi-objective optimization problems
from an integer programming viewpoint.
The document presents a new ontology matching system based on a multi-agent architecture. The system takes ontologies described in XML, RDF Schema, and OWL as input. It uses multiple matchers and filtering to generate mappings between ontology entities. The mappings are then validated. The system is implemented as a multi-agent system with different agent types responsible for resources, matching, generating mappings, and filtering/validating mappings. The architecture allows for robust, flexible, and scalable ontology matching.
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...IOSR Journals
The document describes a hybrid computational intelligence method for clustering sense-tagged Nepali documents. It combines self-organizing maps (SOM), particle swarm optimization (PSO), and k-means clustering. Feature vectors are generated from the sense-tagged documents and the hybrid algorithm is applied in three phases: (1) SOM produces prototype vectors from the feature vectors, (2) PSO initializes k-means centroids, (3) k-means clusters the prototypes. The method aims to address limitations of bag-of-words representations by incorporating word sense information. Experiments show the approach effectively clusters sense-tagged Nepali texts.
Matching the String with a pattern is known as String Match”. Now, what are these strings and patterns? Alright, the string is that which is to be checked entered by the user and is matched with the pattern which is already in the database. Copy the link given below and paste it in new browser window to get more information on String Match:- www.transtutors.com/homework-help/computer-science/string-match.aspx
This document summarizes a survey on string similarity matching search techniques. It discusses how string similarity matching is used to find relevant information in text collections. The document reviews different algorithms for string matching, including edit distance, NR-grep, n-grams, and approaches based on hashing and locality-sensitive hashing. It analyzes techniques like pattern matching, threshold-based joins, and vector representations. The goal is to present an overview of the field and compare algorithm performance for similarity searches.
Given presentation tell us about string, string matching and the navie method of string matching. Well this method has O((n-m+1)*m) time complexicity. It also tells the problem with naive approach and gives list of approaches which can be applied to reduce the time complexicity
The document discusses various algorithms for pattern searching in a text, including:
1. Naive pattern searching which slides the pattern over the text and checks for matches in O(nm) time in worst case.
2. KMP algorithm which uses a preprocessing step to construct a lps array to avoid rematching characters, improving worst case to O(n).
3. Rabin-Karp algorithm which computes hashes of patterns and substrings to quickly eliminate non-matching candidates before character matching.
4. Finite automata based algorithm which preprocesses the pattern to construct a state machine, allowing searches in O(n) time.
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...CSCJournals
Biclusters are required to analyzing gene expression patterns of genes comparing rows in expression profiles and analyzing expression profiles of samples by comparing columns in gene expression matrix. In the process of biclustering we need to cluster genes and samples. The algorithm presented in this paper is based upon the two-way clustering approach in which the genes and samples are clustered using parallel fuzzy C-means clustering using message passing interface, we call it MFCM. MFCM applied for clustering on genes and samples which maximize membership function values of the data set. It is a parallelized rework of a parallel fuzzy two-way clustering algorithm for microarray gene expression data [9], to study the efficiency and parallelization improvement of the algorithm. The algorithm uses gene entropy measure to filter the clustered data to find biclusters. The method is able to get highly correlated biclusters of the gene expression dataset.
Huffman is one of the compression algorithms. It is the most famous algorithm to compress text. There are four phases in the Huffman algorithm to compress text. The first is to group the characters. The second is to build the Huffman tree. The third is the encoding, and the last one is the construction of coded bits. The Huffman algorithm principle is the character that often appears on encoding with a series of short bits and characters that rarely appeared in bit-encoding with a longer series. Huffman compression technique can provide savings of 30% from the original bits. It works based on the frequency of characters. The more the similar character reached, the higher the compression rate gained.
ADAPTIVE AUTOMATA FOR GRAMMAR BASED TEXT COMPRESSIONcsandit
The Internet and the ubiquitous presence of computing devices anywhere is generating a
continuously growing amount of information. However, the information entropy is not uniform.
It allows the use of data compression algorithms to reduce the demand for more powerful
processors and larger data storage equipment. This paper presents an adaptive rule-driven
device - the adaptive automata - as the device to identify repetitive patterns to be compressed in
a grammar based lossless data compression scheme.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Informationcsandit
The document describes a novel word sense disambiguation method using global co-occurrence information and non-negative matrix factorization. It proposes using global co-occurrence frequencies between words in a large corpus rather than local context windows to address data sparseness issues. An experiment compares the method to two baselines and shows it achieves the highest and most stable precision for word sense disambiguation.
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemIRJET Journal
The document proposes an algorithm for dynamically assigning papers to reviewers based on keywords. It discusses:
1) Existing exact string matching algorithms like brute force, KMP, and Boyer-Moore are ineffective for this problem since keyword phrases may be similar but not exact matches.
2) The algorithm uses dynamic programming to calculate an "expertise distance" between each paper's keywords and a reviewer's keywords based on the edit distance between the strings. Reviewers with lower expertise distances are better matches.
3) This approach accounts for minor differences in keyword phrases while still capturing the underlying semantic similarity, making it more suitable than exact string matching for the paper-reviewer assignment problem.
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
The Volume of text resources have been increasing in digital libraries and internet. Organizing these text documents has become a practical need. For organizing great number of objects into small or minimum number of coherent groups automatically, Clustering technique is used. These documents are widely used for information retrieval and Natural Language processing tasks. Different Clustering algorithms require a metric for quantifying how dissimilar two given documents are. This difference is often measured by similarity measure such as Euclidean distance, Cosine similarity etc. The similarity measure process in text
mining can be used to identify the suitable clustering algorithm for a specific problem. This survey discusses the existing works on text similarity by partitioning them into three significant approaches; String-based, Knowledge based and Corpus-based similarities.
Performance Efficient DNA Sequence DetectionalgoRahul Shirude
The document proposes a parallel string matching algorithm using CUDA to perform DNA sequence detection on a GPU. It aims to optimize GPU core utilization and achieve high performance gains over sequential approaches. The proposed algorithm finds correct pattern matches in DNA database sequences. Experimental results show the parallel GPU approach has much higher performance than sequential CPU methods.
A Comparison of Serial and Parallel Substring Matching Algorithmszexin wan
This document summarizes a study comparing serial and parallel substring matching algorithms. It describes implementing a naive serial algorithm and dynamic programming algorithm in parallel using RPI's supercomputer. Testing on randomly generated and real-world text data showed the parallel naive algorithm outperformed the parallel dynamic algorithm for smaller files, while the dynamic algorithm scaled better to larger files. Analysis found the parallel algorithms grew exponentially with input size due to blocking MPI calls, though the dynamic implementation had memory issues for large files due to its 2D array structure. On average, the parallel implementations provided a 25x speedup over the fastest serial algorithm.
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemLiwei Ren任力偉
The bad character shift rule of Boyer-Moore string search algorithm is studied in this paper for the purpose of extending it to more general string match problems. An abstract problem of string match is defined in general. An optimized string match algorithm based one the bad character heuristics is proposed to solve the abstract match problem efficiently.
This document summarizes various stemming algorithms used for information retrieval. It discusses rule-based stemming algorithms like the Porter stemmer and Lovins stemmer which use language rules to remove suffixes and extract word stems. It also describes statistical stemming methods which use machine learning on text corpora to identify morphological variants. Finally, it analyzes different techniques for conflating words during stemming, such as affix removal, successor variety, table lookup, and n-gram methods.
A survey of Stemming Algorithms for Information Retrievaliosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
Proposed Method for String Transformation using Probablistic ApproachEditor IJMTER
This document discusses a probabilistic approach to string transformation and describes four modules: 1) approximate string search, 2) candidate selection for spelling error correction, 3) candidate generation for spelling error correction, and 4) query reformulation. It presents a new statistical learning approach to string transformation and describes how this approach can be applied to spelling error correction of queries and query reformulation for web search.
The Improved Hybrid Algorithm for the Atheer and Berry-ravindran Algorithms IJECEIAES
Exact String matching considers is one of the important ways in solving the basic problems in computer science. This research proposed a hybrid exact string matching algorithm called E-Atheer. This algorithm depended on good features; searching and shifting techniques in the Atheer and BerryRavindran algorithms, respectively. The proposed algorithm showed better performance in number of attempts and character comparisons compared to the original and recent and standard algorithms. E-Atheer algorithm used several types of databases, which are DNA, Protein, XML, Pitch, English, and Source. The best performancein the number of attempts is when the algorithm is executed using the pitch dataset. The worst performance is when it is used with DNA dataset. The best and worst databases in the number of character comparisons with the E-Atheer algorithm are the Source and DNA databases, respectively.
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...acijjournal
Representation of semantic information contained in the words is needed for any Arabic Text Mining
applications. More precisely, the purpose is to better take into account the semantic dependencies
between words expressed by the co-occurrence frequencies of these words. There have been many
proposals to compute similarities between words based on their distributions in contexts. In this paper,
we compare and contrast the effect of two preprocessing techniques applied to Arabic corpus: Rootbased (Stemming), and Stem-based (Light Stemming) approaches for measuring the similarity between
Arabic words with the well known abstractive model -Latent Semantic Analysis (LSA)- with a wide
variety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity,
Jaccard Coefficient, and the Pearson Correlation Coefficient. The obtained results show that, on the one
hand, the variety of the corpus produces more accurate results; on the other hand, the Stem-based
approach outperformed the Root-based one because this latter affects the words meanings.
Combining text and pattern preprocessing in an adaptive dna pattern matcherIAEME Publication
This paper presents an adaptive DNA pattern matching algorithm that combines both text and pattern preprocessing to efficiently detect patterns in DNA sequences. It initializes a pattern preprocessor similar to KMP and builds a suffix tree for the text. It then marks regions in the text and finds the maximum overlap between each region and the pattern. The algorithm searches the region with highest overlap first before moving to other regions. Experiments show it performs faster than KMP, with running time of O(m) where m is the pattern length.
The document summarizes string matching algorithms. It discusses the naive string matching algorithm which compares characters at each shift to find matches. It also describes the Rabin-Karp algorithm which uses hashing to match the hash value of the pattern to the hash value of substrings in the text. If the hash values match, it then checks for an exact character match. The Rabin-Karp algorithm has better average-case performance than the naive algorithm but the same worst-case performance of O((n-m+1)m) time.
The document summarizes string matching algorithms. It discusses the naive string matching algorithm which compares characters at each shift to find matches. It also discusses the Rabin-Karp algorithm which uses hashing to match the hash value of the pattern to the hash value of substrings in the text. If the hash values match, it then checks for an exact character match. The Rabin-Karp algorithm has better average-case performance than the naive algorithm but the same worst-case performance of O((n-m+1)m) time.
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
This paper deals with Sentence Validation - a sub-field of Natural Language Processing. It finds various applications in
different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on
understanding and extracting important information delivered to the computer and make possible efficient human computer
interaction. Sentence Validation is approached in two ways - by Statistical approach and Semantic approach. In both approaches
database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based
on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis,
tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing,
and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by
extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and
variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the
probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.
Indexing for Large DNA Database sequencesCSCJournals
Bioinformatics data consists of a huge amount of information due to the large number of sequences, the very high sequences lengths and the daily new additions. This data need to be efficiently accessed for many needs. What makes one DNA data item distinct from another is its DNA sequence. DNA sequence consists of a combination of four characters which are A, C, G, T and have different lengths. Use a suitable representation of DNA sequences, and a suitable index structure to hold this representation at main memory will lead to have efficient processing by accessing the DNA sequences through indexing, and will reduce number of disk I/O accesses. I/O operations needed at the end, to avoid false hits, we reduce the number of candidate DNA sequences that need to be checked by pruning, so no need to search the whole database. We need to have a suitable index for searching DNA sequences efficiently, with suitable index size and searching time. The suitable selection of relation fields, where index is build upon has a big effect on index size and search time. Our experiments use the n-gram wavelet transformation upon one field and multi-fields index structure under the relational DBMS environment. Results show the need to consider index size and search time while using indexing carefully. Increasing window size decreases the amount of I/O reference. The use of a single field and multiple fields indexing is highly affected by window size value. Increasing window size value lead to better searching time with special type index using single filed indexing. While the search time is almost good and the same with most index types when using multiple field indexing. Storage space needed for RDMS indexing types are almost the same or greater than the actual data.
Recognition of Words in Tamil Script Using Neural NetworkIJERA Editor
In this paper, word recognition using neural network is proposed. Recognition process is started with the partitioning of document image into lines, words, and characters and then capturing the local features of segmented characters. After classifying the characters, the word image is transferred into unique code based on character code. This code ideally describes any form of word including word with mixed styles and different sizes. Sequence of character codes of the word form input pattern and word code is a target value of the pattern. Neural network is used to train the patterns of the words. Trained network is tested with word patterns and is recognized or unrecognized based on the network error value. Experiments have been conducted with a local database to evaluate the performance of the word recognizing system and obtained good accuracy. This method can be applied for any language word recognition system as the training is based on only unique code of the characters and words belonging to the language.
ROBUST TEXT DETECTION AND EXTRACTION IN NATURAL SCENE IMAGES USING CONDITIONA...ijiert bestjournal
In Natural Scene Image,Text detection is important tasks which are used for many content based image analysis. A maximally stable external region based method is us ed for scene detection .This MSER based method incl udes stages character candidate extraction,text candida te construction,text candidate elimination & text candidate classification. Main limitations of this method are how to detect highly blurred text in low resolutio n natural scene images. The current technology not focuses on any t ext extraction method. In proposed system a Conditi onal Random field (CRF) model is used to assign candidat e component as one of the two classes (text& Non Te xt) by Considering both unary component properties and bin ary contextual component relationship. For this pur pose we are using connected component analysis method. The proposed system also performs a text extraction usi ng OCR
Similar to An Application of Pattern matching for Motif Identification (20)
This presentation was provided by Racquel Jemison, Ph.D., Christina MacLaughlin, Ph.D., and Paulomi Majumder. Ph.D., all of the American Chemical Society, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
An Application of Pattern matching for Motif Identification
1. K. K. Senapati, D. R. Das Adhikary & G. Sahoo
International Journal of Biometrics and Bioinformatics (IJBB), Volume (6), Issue (5): 2012. 135
An Application of Pattern Matching for Motif Identification
K. K. Senapati kksenapati@bitmesra.ac.in
Department of Computer Science & Engineering.
Birla Institute of Technology, Mesra
Ranchi, 835215, India
D. R. Das Adhikary dibya21@gmail.com
Department of Computer Science & Engineering.
Birla Institute of Technology, Mesra
Ranchi, 835215, India
G. Sahoo gsahoo@bitmesra.ac.in
Department of Information Technology.
Birla Institute of Technology, Mesra
Ranchi, 835215, India
Abstract
Pattern matching is one of the central and most widely studied problem in theoretical computer science.
Solutions to the problem play an important role in many areas of science and information processing. Its
performance has great impact on many applications including database query, text processing and DNA
sequence analysis. In general Pattern matching algorithms are based on the shift value, the direction of
the sliding window and the order in which comparisons are made. The performance of the algorithms can
be enhanced to a great extent by a larger shift value and less number of comparison to get the shift
value. In this paper we proposed an algorithm, for finding motif in DNA sequence. The algorithm is based
on preprocessing of the pattern string(motif) by considering four consecutive nucleotides of the DNA that
immediately follow the aligned pattern window in an event of mismatch between pattern(motif) and DNA
sequence .Theoretically, we found the proposed algorithms work efficiently for motif identification.
Key words: Pattern Matching, Pattern String, Motif Finding, Motif Identification, Gene Identification,
Preprocessing.
1. INTRODUCTION
Pattern matching consists in finding one, or more generally, all the occurrences of a given query string
(pattern) from a possibly very large text is an old and fundamental problem in computer science. It
emerges in applications ranging from text processing and music retrieval to bioinformatics. This task,
collectively known as string matching, has several different variations. The most natural and important of
these is exact string matching, in which, like the name suggests, one wish to find only occurrences that
are exactly identical to the pattern string, which is the focus of our work. The field of approximate string
matching, on the other hand find occurrences that are similar to the pattern string.
Given a text array, T [1 . . . n], of n character and a pattern array, P [1 . . . m], of m characters. The
problem is to find an integer s, called valid shift where 0 ≤s ≤ n - m and T[s +1 . . . s + m] = P [1 . . . m]. In
other words, to find whether P in T i.e., where P is a substring of T. The elements of P and T are
character drawn from some finite alphabet such as {0, 1} or {A, B ... Z, a, b . . . z} [1] [2].
All pattern-matching algorithms scan the text with the help of a window, which is equal to the length of the
pattern. The first process is to align the left ends of the window and the text, and then compare the
corresponding characters of the window and the pattern. After a whole match or a mismatch of the
pattern, the text window is shifted in the forward direction until the right end of the window reaches the
end of the text. The algorithms vary in the order in which character comparisons are made and the
distance by which the window is shifted on the text after each attempt. Many pattern matching algorithms
2. K. K. Senapati, D. R. Das Adhikary & G. Sahoo
International Journal of Biometrics and Bioinformatics (IJBB), Volume (6), Issue (5): 2012. 136
are available with their own merits and demerits based on the pattern length, periodicity and alphabet set.
An efficient way is to move the pattern on the text using the best shift value. To this end, several
algorithms have been proposed to get a better shift value, for example: Boyer–Moore [3], Quick Search
[4], Karp-Rabin [5], Raita [6] and Berry–Ravindran [7]. The efficiency of an algorithm lies in two phases:
pre-processing phase and the searching phase. Effective searching phase can be established by altering
the order of comparison of characters in each attempt and by choosing an optimum shift value that allows
a maximum skip on the text [8]. The difference between various algorithms is mainly due to the shifting
procedure and the speed at which a mismatch is detected.
The rest of the paper is organized as follows. Section 2 gives the review of several efficient algorithms in
practice. Section 3 describes the proposed algorithm in detail. Section 4 is about complexity analysis of
the proposed algorithm. In section 5, the experimental results of comparison between proposed algorithm
and other compared algorithm are given. And section 6 is the conclusion.
2. PREVIOUS WORK
Many promising data structures and algorithms discovered by theoretical community are never
implemented or tested at all. This is because the actual performance of the algorithm is not analyzed as
only the working of the algorithm in practice is taken care. There have been several pattern matching
algorithms designed in the literature till date as discussed below.
In the naive or Brute Force technique [1] [2] the string to be searched is aligned to the left end of the text
and each pattern character is compared with the corresponding text character. In this process if a
mismatch occurs or the pattern character exhausts the pattern is shifted by one unit toward right. The
search is again resumed from the start of the pattern until the text is exhausted or match is found.
Naturally the number of comparisons being performed is very more as each time the pattern string is
shifted right by only one unit towards right. The worst case comparison of the algorithm is O (mn). The
number of comparisons can be reduced if we can move the pattern string by more than one unit. This
was the idea of KMP algorithm [1][2][9][10]. The KMP algorithm compares the pattern from left to right
with the text just as BF algorithm. When a mismatch occurs the KMP algorithm moves the pattern to the
right by maintaining the longest overlap of a prefix of the pattern with a suffix of a part of the text that
matched the pattern so far. The KMP algorithm does almost 2n text comparisons and the worst case
complexity of the algorithm is O (m+n).
Boyer-Moore algorithm [2][3] differs from other algorithms in one feature. Instead of comparing the pattern
characters from left to right the comparison is done from the right towards left by starting comparison from
the rightmost character of the pattern. In case of a mismatch it uses two functions last occurrence
function and good suffix function. If the text character does not exist in the pattern then the last
occurrence function returns m where m is the length of the pattern string. So the maximum shift possible
was m. The worst case running time of the algorithm is O (mn).
Quick Search (QS) [2][4] algorithm perform comparisons from left to right order, it's shifting criteria is by
looking at one character right to the pattern and by applying bad character shifting rule. The worst case
time complexity of QS is same as Horspool algorithm but it can take more steps in practice.
Berry Ravindran (BR) [2][7][11] algorithm proposed by Berry and Ravindran, it performs shifts by using
bad character shifting rule for two consecutive characters to the right of the partial text window of text
string. The preprocessing time complexity is O (m+ (|Σ|) 2) and the searching time complexity is O (mn).
In this paper the idea is to reduce the number of comparisons being performed by obtaining as much shift
as possible. A shift value of more than length of the pattern is obtained when the pattern gets mismatch
with the text for one instance.
3. PROPOSED ALGORITHM
The proposed algorithms works in two phases, the preprocessing phase and the searching phase.
3. K. K. Senapati, D. R. Das Adhikary & G. Sahoo
International Journal of Biometrics and Bioinformatics (IJBB), Volume (6), Issue (5): 2012. 137
3.1. Preprocessing phase
In Berry-Ravindran algorithm, two consecutive characters immediately to the right of the pattern window
are considered in the function brBc [7]. But in the proposed algorithm, four consecutive characters
immediately to the right of window are considered. Initially, the indexes of the four consecutive characters
in the text string from the left are (m), (m+1), (m+2) and (m+3) for a, b, c and d respectively. The MBrBc
(a, b, c, d) of the algorithm consists in computing for each four characters a, b, c, d for all a, b, c, d , in the
pattern.
The MBrBc(a, b, c, d) is defined as follows:
MBrBc[a,b,c,d]= min
This function finds the position of the right most occurrence of ‘abcd’ in the pattern and computes shift
value so that the ‘abcd’ in the pattern and the ‘abcd’ if any in the text immediately to the right of the
window are aligned. If ‘abcd’ does not exist in the pattern or a portion i.e. ‘a’ ‘ab’ ‘abc’ ‘bcd’ ‘cd’‘d’ presents
then other cases that are considered.
We improve the algorithm in programmatically by using a one dimensional array called shift array (SA).
The shift array is use for storing the shift value for all four character permutation of the given pattern. For
any four given character we generate a unique number and use this unique number as index to store the
shift value for that four character. For any “abcd” the value is cal calculated like this:
index=(a*1000)+( b*100)+( c*10)+d
SA [index] =shift value
We use this technique to store the shift value, which is a very efficient than other possible technique.
3.2. Searching Phase
In this proposed algorithm, after each attempt the window is shifted to the right using the shift value
computed for the four consecutive characters immediately right of the window. MBrBc function calculates
the shift value based on the right most occurrences of four consecutive characters, say abcd, which is
immediately to the right of the window. The probability occurrence of four consecutive characters, abcd, in
the pattern as compared to that of ab is less. Thus MBrBc always provides a better shift than brBc (Berry-
Ravindran Bad character function) [12].
The searching phase of the algorithm works in the following way.
Step1: Compare the characters of the windows with the corresponding text characters from left as well as
right [13][14]. If there is a mismatch during comparison, the algorithm goes to step2, otherwise the
comparison process continues until a complete match is found. The algorithm stops and displays the
corresponding position of the Pattern on the text string. If we search for all the pattern occurrences in the
text string, the algorithm continues to step2.
Step2: In this step, we use the shift values from the next arrays depending on the four text characters
placed immediately after the pattern window. The window is shifted to the correct positions based on the
shift value and the algorithm goes to step 1, this process continues till the end of the string. The searching
phase of the algorithm works in the following way. Suppose we have a text string T[O . . . n-l] and pattern
P[O . . . m-l] and starts searching of P in T. The algorithm compares the pattern with selected text window
from both (right and left) sides simultaneously. In case of match or mismatch MBrBc shift value is used to
4. K. K. Senapati, D. R. Das Adhikary & G. Sahoo
International Journal of Biometrics and Bioinformatics (IJBB), Volume (6), Issue (5): 2012. 138
shift the window to the right. This procedure is repeated until the window is placed beyond n-m+1, that is
the last character of the pattern placed beyond the last character of the text.
The Searching Phase of Motif Finding Algorithm
1. Search_Motif(T,P)
2. n length[T]
3. m length[p]
4. i 0
5. s n-m
6. while( i <= s) do
7. left 0
8. right m-1
9. while(left<=right)do
10. if (P[left]==T[i+left]&& P[right]==T[i+right)
11. left left+1
12. right right-1
13. else
14. break
15. end while
16. if(left>right)
17. print “pattern occurs at” T[i]to T[i+m-1]
18. i=i+ MbrBc(T[i+m],T[i+m+1],T[i+m+2],T[i+m+3])
19. end while
20. end procedur
3.3 Working Example
Consider the text and pattern string as shown below where text length n=15 and pattern length m=8
Text (T) = G C A T C G C A G A G A G T A
Pattern (P) = G C A G A G A G, m = 8
First Attempt G C A T C G C A G A G A G T A
1 2 Shift=1(MbrBc[G][A][G][A])
G C A G A G A G
Second
Attempt
G C A T C G C A G A G A G T A
1 2 Shift=2(MbrBc[A][G][A][G])
G C A G A G A G
Third Attempt G C A T C G C A G A G A G T A
1 2 Shift=2(MbrBc[A][G][T][A])
G C A G A G A G
5. K. K. Senapati, D. R. Das Adhikary & G. Sahoo
International Journal of Biometrics and Bioinformatics (IJBB), Volume (6), Issue (5): 2012. 139
First Attempt: In the first attempt, we align the sliding window with the text from the left. In this case, a
match occurs between text character (G) and pattern character (G)in the left side of window and a
mismatch occurs between text character (A) and pattern character (G)in the right side of window.
Therefore we take the immediate four characters following the text as (G,A,G, and A). We find according
to MBrBc function. That the shift=1, therefore the window is shifted to the right 1 step.
Second Attempt: In the second attempt, we align the sliding window with the text from the left. In this
case, a mismatch occurs between text character (C) and pattern character (G)in the left side of window
and a match occurs between text character (G) and pattern character (G)in the right side of window.
Therefore we take the immediate four characters following the text as (A, G, A and G). We find according
to MBrBc function. That the shift=2, therefore the window is shifted to the right 2 step.
Third Attempt: In the third attempt, we align the sliding window with the text from the left. In this case, a
mismatch occurs between text character (T) and pattern character (G)in the left side of window and a
match occurs between text character (G) and pattern character (G)in the right side of window. Therefore
we take the immediate four characters following the text as (A, G, T and A). We find according to MBrBc
function. That the shift=2, therefore the window is shifted to the right 2 step.
Fourth Attempt: In the fourth attempt, we align the sliding window with the text from the left after shift. In
this case all the character of the pattern matches with the text. So match performed. Next the window is
moved for the next pattern in the text string.
4. ANALYSIS
The space complexity is O((m/2+1)).where m is the pattern length..The pre-process time complexity is
O(m+|∑|^4). The worst case time complexity is O(nm). The worst case occurs when at each attempt; all
the compared characters and the text are matched and at the same time the shift value is equal to 1 i.e.
Last character of the pattern is equal to the first character present next to the window.
Example:
Text (T): AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Pattern (P): AAAAA
The best case time complexity is O (n/ (m+2)).The best case occurs when at each attempt; in the first
comparison a mismatch is found and at the same time the shift value is m+2.
Example:
Text (T): AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Pattern (P): SSSSS
5. EXPERIMENTAL EVALUATIONS
To assess the performance of my algorithm, we considered all the well-known algorithms stated before
for comparison with the proposed algorithm. The algorithms are compared with test cases and their
corresponding results are discussed.
5.1 Environment
In the experiments we used a PC with Intel(R) Core(TM) 2 Duo 2.10 and 1.96 GHz processor, with 2GB
of RAM. The host operating system is windows Xp. The source codes were compiled using the “gcc”
compiler without optimization. All the coding is done with the help of Dev C++ tool.
Fourth
Attempt
G C A T C G C A G A G A G T A
1 3 5 7 8 6 4 2
G C A G A G A G
6. K. K. Senapati, D. R. Das Adhikary & G. Sahoo
International Journal of Biometrics and Bioinformatics (IJBB), Volume (6), Issue (5): 2012. 140
5.2 Results &Discussion
The plant genome (Arabidopsis thaliana) consists of 27,242 gene sequences distributed over five
chromosomes(CHR_I to CHR_V) (NCBI site, ftp://ftp.ncbi.nih.gov/genomes/Arabidopsis_thaliana/CHR_I).
Part of a nucleotide sequence of a gene from Chromosome I (CHR_I) has been used (see below for
details) to test the proposed algorithm. Two types of data have been analyzed for comparisons; one with
small alphabet size, i.e. ∑= 4 (nucleotide sequences) and another with big alphabet size, i.e. ∑ = 20
(amino acid sequences). The algorithms are tested on pattern size 4, 8, 12, 16 and 20.
To assess the performance of our algorithm, we considered two well-known algorithms (i.e. Brute force
algorithm and Berry-Ravindran algorithm [5]), the improved version of these two well-known algorithms
(i.e. Improved Brute force algorithm and Improved Berry-Ravindran algorithm) for comparison with the
proposed algorithm. The Improved Brute Force algorithm is a variation of Brute force algorithm, in which
we implement our method of searching in the searching phase. In the same way, we implement our
method of searching in the searching phase of Berry-Ravindran [5] algorithm to get the Improved Berry-
Ravindran algorithm. The algorithms are compared with test cases and their corresponding results are
discussed.
0
50
100
150
200
4 8 12 16 20
Timeinmilisecond
Size of pattern
Brute Force
ImprovedBrute Force
Berry-Ravindran
ImprovedBerry-
Ravindran
proposed algorithm for
motif
Figure 1: Finding Motif in Nucleotide Sequence (time)
0
5000
10000
15000
20000
25000
30000
35000
4 8 12 16 20
noofcomparison
Size of pattern
Brute Force
ImprovedBrute Force
Berry-Ravindran
ImprovedBerry-
Ravindran
proposed algorithm
for motif
Figure 2: Finding Motif in Nucleotide Sequence (character comparison)
7. K. K. Senapati, D. R. Das Adhikary & G. Sahoo
International Journal of Biometrics and Bioinformatics (IJBB), Volume (6), Issue (5): 2012. 141
0
5000
10000
15000
20000
25000
4 8 12 16 20
noofattempt
Size of pattern
Brute Force
ImprovedBrute Force
Berry-Ravindran2
ImprovedBerry-
Ravindran
proposed algorithm
for motif
Figure 3: Finding Motif in Nucleotide Sequence (attempt)
0
20
40
60
80
100
4 8 12 16 20
Timeinmilisecond
Size of pattern
Brute Force
ImprovedBrute Force
Berry-Ravindran
ImprovedBerry-
Ravindran
proposed algorithm for
motif
Figure 4: Finding Motif in Amino acid Sequence (time)
0
5000
10000
15000
20000
25000
4 8 12 16 20
noofcomparison
Size of pattern
Brute Force
ImprovedBrute Force
Berry-Ravindran
ImprovedBerry-
Ravindran
proposedalgorithm for
motif
Figure 5: Finding Motif in Amino acid Sequence (character comparison)
8. K. K. Senapati, D. R. Das Adhikary & G. Sahoo
International Journal of Biometrics and Bioinformatics (IJBB), Volume (6), Issue (5): 2012. 142
0
5000
10000
15000
20000
25000
4 8 12 16 20
noofattempt
Size of pattern
Brute Force
ImprovedBrute Force
Berry-Ravindran2
ImprovedBerry-
Ravindran
proposed algorithm
for motif
Figure 6: Finding Motif in Amino acid Sequence (attempt)
From the above figures it is clear that the proposed algorithm takes less number of time, character
comparison and attempt in almost all cases, than other to find all the occurrences of the pattern.
As the experimental result shows the proposed algorithm is more efficient than other for small pattern and
the motif normally ranges between 8 and 20. So it is clear that it is more efficient in finding motif in amino
acid sequence as well as nucleotide sequence [15].
6. CONCLUSIONS
In this paper we present an efficient pattern matching algorithm based on preprocessing of the pattern
string by considering four consecutive characters of the text. The idea of considering four consecutive
characters is from the fact that occurrence of three successive characters is less frequent than the other
possibilities because of which even the shift value obtained is also more compared to Berry-Ravindran
and Brute Force algorithms. The concept of searching from both sides makes the algorithm efficient when
a mismatch present at the end of the pattern with that of align text window. Theoretically, we prove that
the proposed algorithms will shift the pattern faster than other compared algorithms. Experimentally, we
show that the proposed algorithms indeed significantly outperform the compared algorithm in almost all
cases.
7. REFERENCES
[1] T.H. Cormen, C.E. Leiserson, R.L. Rivest. Introduction to Algorithms, MIT Press, First Edition,
1990, pp. 853-885.
[2] C. Charras, T. Lecroq(1997). Handbook of Exact String Matching Algorithms. [online]. Available:
http://www-igm.univ-mlv.fr/~lecroq/string/string.pdf [Oct 08, 2012].
[3] R.S. Boyer, J.S. Moore. A Fast String Searching Algorithm, Communications of the ACM, vol. 20,
pp.762-772, 1977.
[4] D.M. Sunday. A Very Fast Substring Search Algorithm, Journal of Communication of the ACM,
vol. 33, pp. 132-142, 1990.
[5] R.M. Karp, M.O. Rabin. Efficient Randomized Pattern Matching Algorithms, IBM J. Res. Dev,
vol. 31, pp. 249-260, 1987.
[6] T.Raita.“Tuning the Boyer-Moore-Horspool string-searching algorithm”, Software – Practice
Experience, 1992, pp. 879–884.
9. K. K. Senapati, D. R. Das Adhikary & G. Sahoo
International Journal of Biometrics and Bioinformatics (IJBB), Volume (6), Issue (5): 2012. 143
[7] T. Berry, S. Ravindran. “A Fast String Matching Algorithm and Experimental Results”,
Proceedings of the Stringology Club Workshop’99, 1999, pp. 16-26.
[8] R. Thathoo,A. Virmani, S. Lakshmi, N. Balakrishnan, K. Sekar. TVSBS: A Fast Exact Pattern
Matching Algorithm for Biological Sequences, Current Science, vol. 91, pp. 47-53, Jul. 2006.
[9] D. E. Knuth, H. Morris, V. R. Pratt. Fast Pattern Matching in Strings, SIAM Journal of computing,
vol. 6, pp. 323-350, 1977.
[10]J. H. Morris(Jr), V. R. Pratt. “A Linear Pattern Matching Algorithm”, 40th Technical Report,
University of California, Berkeley, 1970.
[11]Y.Huang, L. Ping, X. Pan, G. Cai. “A Fast Exact Pattern Matching Algorithm for Biological
Sequences”, International Conference on Biomedical Engineering and Informatics, IEEE
computer Society, Feb. 2008, pp. 8-12.
[12]V. Radhakrishna, B. Phaneendra, V.S. Kumar. “A Two Way Pattern Matching Algorithm Using
Sliding Patterns”, 3rd International Conforence on Advanced Computer Theory and Engineering
(ICACTE), 2010, vol. 2, pp. 666-670.
[13]Hussain, M. Zubair, J. Ahmed, J. Zaffar. “Bidirectional Exact Pattern Matching Algorithm”,
TCSET’2010, Feb. 2010, pp. 295.
[14]S. S.Sheik,S. K. Aggarwal, A. Poddar, N. Balakrishnan, K. Sekar.A FAST Pattern Matching
Algorithm, Journal of Chemical Information and Computer Sciences, vol.44, pp. 1251–1256,
2004.
[15]M.Q. Zhang. “Computational prediction of eukaryotic protein-coding genes”, Nature Reviews
Genetics, vol. 3, Sep. 2002, pp. 698-709.