The document presents an algorithm for optimized searching using non-overlapping iterative neighbor intervals. The algorithm aims to reduce the number of checked conditions by saving the frequency of replicated words and using non-overlapping intervals based on the plane sweep algorithm. It does this by focusing the search on a smaller subspace of relevant intervals near the minimum frequency keyword. The algorithm iterates through ranges, eliminating unsatisfied keywords to detect relevant ranges. This improves efficiency and reduces the number of comparisons compared to previous methods.
An Application of Pattern matching for Motif IdentificationCSCJournals
Pattern matching is one of the central and most widely studied problem in theoretical computer science. Solutions to the problem play an important role in many areas of science and information processing. Its performance has great impact on many applications including database query, text processing and DNA sequence analysis. In general Pattern matching algorithms are based on the shift value, the direction of the sliding window and the order in which comparisons are made. The performance of the algorithms can be enhanced to a great extent by a larger shift value and less number of comparison to get the shift value. In this paper we proposed an algorithm, for finding motif in DNA sequence. The algorithm is based on preprocessing of the pattern string(motif) by considering four consecutive nucleotides of the DNA that immediately follow the aligned pattern window in an event of mismatch between pattern(motif) and DNA sequence .Theoretically, we found the proposed algorithms work efficiently for motif identification.
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
Specific objective to discover some novel information from a set of documents initially retrieved in response to some query. Clustering sentences level text, effective use and update is still an open research issue, especially in domain of text mining. Since most existing system uses pattern belong to a single cluster. But here we can use patterns belongs to all cluster with different degree of membership. Since sentences of those documents we would expect at least one of the clusters to be closely related to the concepts described by the query term. This paper presents a Novel Fuzzy Clustering Algorithm that operates on relational input data (i.e. data in the form of square matrix of pair wise similarities between data objects).
An Application of Pattern matching for Motif IdentificationCSCJournals
Pattern matching is one of the central and most widely studied problem in theoretical computer science. Solutions to the problem play an important role in many areas of science and information processing. Its performance has great impact on many applications including database query, text processing and DNA sequence analysis. In general Pattern matching algorithms are based on the shift value, the direction of the sliding window and the order in which comparisons are made. The performance of the algorithms can be enhanced to a great extent by a larger shift value and less number of comparison to get the shift value. In this paper we proposed an algorithm, for finding motif in DNA sequence. The algorithm is based on preprocessing of the pattern string(motif) by considering four consecutive nucleotides of the DNA that immediately follow the aligned pattern window in an event of mismatch between pattern(motif) and DNA sequence .Theoretically, we found the proposed algorithms work efficiently for motif identification.
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
Specific objective to discover some novel information from a set of documents initially retrieved in response to some query. Clustering sentences level text, effective use and update is still an open research issue, especially in domain of text mining. Since most existing system uses pattern belong to a single cluster. But here we can use patterns belongs to all cluster with different degree of membership. Since sentences of those documents we would expect at least one of the clusters to be closely related to the concepts described by the query term. This paper presents a Novel Fuzzy Clustering Algorithm that operates on relational input data (i.e. data in the form of square matrix of pair wise similarities between data objects).
This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
Vchunk join an efficient algorithm for edit similarity joinsVijay Koushik
Similarity join is most important technique to
involve many applications such as data integration, record
linkage and pattern recognition. Here we introduce new
algorithm for similarity join with edit distance constraints.
Currently extracting overlapping grams from string and consider
only string that share certain gram as candidate. Now we propose
extracting non-overlapping substring or chunk from string.
Chunk scheme based on tail-restricted chunk boundary
dictionary (CBD). This approach integrated existing approach
for calculating similarity with several new filters unique to chunk
based method. Greedy algorithm automatically select good
chunking scheme from given data set. Then show the result our
method occupies less space and faster performance to compute
the value
Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification? IJORCS
An algorithm for locating all occurrences of a finite number of keywords in an arbitrary string, also known as multiple strings matching, is commonly required in information retrieval (such as sequence analysis, evolutionary biological studies, gene/protein identification and network intrusion detection) and text editing applications. Although Aho-Corasick was one of the commonly used exact multiple strings matching algorithm, Commentz-Walter has been introduced as a better alternative in the recent past. Comments-Walter algorithm combines ideas from both Aho-Corasick and Boyer Moore. Large scale rapid and accurate peptide identification is critical in computational proteomics. In this paper, we have critically analyzed the time complexity of Aho-Corasick and Commentz-Walter for their suitability in large scale peptide identification. According to the results we obtained for our dataset, we conclude that Aho-Corasick is performing better than Commentz-Walter as opposed to the common beliefs.
Выступление Сергея Кольцова (НИУ ВШЭ) на International Conference on Big Data and its Applications (ICBDA).
ICBDA — конференция для предпринимателей и разработчиков о том, как эффективно решать бизнес-задачи с помощью анализа больших данных.
http://icbda2015.org/
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
A Survey of String Matching AlgorithmsIJERA Editor
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
In the classical model, the fundamental building block is represented by bits exists in two states a 0 or a 1. Computations are done by logic gates on the bits to produce other bits. By increasing the number of bits, the complexity of problem and the time of computation increases. A quantum algorithm is a sequence of operations on a register to transform it into a state which when measured yields the desired result. This paper provides introduction to quantum computation by developing qubit, quantum gate and quantum circuits.
This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
Vchunk join an efficient algorithm for edit similarity joinsVijay Koushik
Similarity join is most important technique to
involve many applications such as data integration, record
linkage and pattern recognition. Here we introduce new
algorithm for similarity join with edit distance constraints.
Currently extracting overlapping grams from string and consider
only string that share certain gram as candidate. Now we propose
extracting non-overlapping substring or chunk from string.
Chunk scheme based on tail-restricted chunk boundary
dictionary (CBD). This approach integrated existing approach
for calculating similarity with several new filters unique to chunk
based method. Greedy algorithm automatically select good
chunking scheme from given data set. Then show the result our
method occupies less space and faster performance to compute
the value
Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification? IJORCS
An algorithm for locating all occurrences of a finite number of keywords in an arbitrary string, also known as multiple strings matching, is commonly required in information retrieval (such as sequence analysis, evolutionary biological studies, gene/protein identification and network intrusion detection) and text editing applications. Although Aho-Corasick was one of the commonly used exact multiple strings matching algorithm, Commentz-Walter has been introduced as a better alternative in the recent past. Comments-Walter algorithm combines ideas from both Aho-Corasick and Boyer Moore. Large scale rapid and accurate peptide identification is critical in computational proteomics. In this paper, we have critically analyzed the time complexity of Aho-Corasick and Commentz-Walter for their suitability in large scale peptide identification. According to the results we obtained for our dataset, we conclude that Aho-Corasick is performing better than Commentz-Walter as opposed to the common beliefs.
Выступление Сергея Кольцова (НИУ ВШЭ) на International Conference on Big Data and its Applications (ICBDA).
ICBDA — конференция для предпринимателей и разработчиков о том, как эффективно решать бизнес-задачи с помощью анализа больших данных.
http://icbda2015.org/
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
A Survey of String Matching AlgorithmsIJERA Editor
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
In the classical model, the fundamental building block is represented by bits exists in two states a 0 or a 1. Computations are done by logic gates on the bits to produce other bits. By increasing the number of bits, the complexity of problem and the time of computation increases. A quantum algorithm is a sequence of operations on a register to transform it into a state which when measured yields the desired result. This paper provides introduction to quantum computation by developing qubit, quantum gate and quantum circuits.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
A survey of Stemming Algorithms for Information Retrievaliosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
The article presents Part of Speech Tagging for Nepali Text using three techniques of Artificial Neural networks. The novel algorithm for POS tagging is introduced .Features are extracted from the marginal probability of Hidden Markov Model. The extracted features are supplied to 3 different ANN architectures viz. Radial Basis Function (RBF) network, General Regression Neural Networks (GRNN) and Feed forward Neural network as an input vector for each word. Two different Annotated Tagged sets are constructed for training and testing purpose. Results are compared using all the 3 techniques and applied on both the sets. GRNN based POS tagging technique is found better as it produces 100% and 98.32% accuracies for both training and testing sets respectively.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n3), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
Similar to AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR INTERVALS (20)
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR INTERVALS
1. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
DOI : 10.5121/ijcsea.2012.2505 49
AN ALGORITHM FOR OPTIMIZED
SEARCHING USING NON-OVERLAPPING
ITERATIVE NEIGHBOR INTERVALS
Elahe Moghimi Hanjani and Mahdi Javanmard
Department of computer engineering, Payam Noor University, Tehran, Iran,
Po Box 19395-3697
elahe_moghimi@yahoo.com
info@javanmard.com
ABSTRACT
We have attempted in this paper to reduce the number of checked condition through saving frequency of the
tandem replicated words, and also using non-overlapping iterative neighbor intervals on plane sweep
algorithm. The essential idea of non-overlapping iterative neighbor search in a document lies in focusing
the search not on the full space of solutions but on a smaller subspace considering non-overlapping
intervals defined by the solutions. Subspace is defined by the range near the specified minimum keyword.
We repeatedly pick a range up and flip the unsatisfied keywords, so the relevant ranges are detected. The
proposed method tries to improve the plane sweep algorithm by efficiently calculating the minimal group of
words and enumerating intervals in a document which contain the minimum frequency keyword. It
decreases the number of comparison and creates the best state of optimized search algorithm especially in
a high volume of data. Efficiency and reliability are also increased compared to the previous modes of the
technical approach.
KEYWORDS
Plane sweep algorithm, Replicated data, Partial search, Text retrieval, Proximity search.
1. INTRODUCTION
The most exceptional search engine would not provide good quality results if the original
keywords selected by the user were not suitable. Therefore, the proposed algorithm aims at
searching on a set of alternative keywords generated based on a user’s original keywords to help a
user in his/her subsequent search activities and reduce the time with using the relative range in
searching.
Plane sweep algorithm considers that keywords which appear in the neighbourhood in a
document that are related. Therefore we use position of keywords in a document as the unit of
queries. By considering keyword positions we can find a paragraph or a sentence in a document
which describes what we want to know. We define ranks of regions in documents which contain
all specified keywords in order of their sizes. This is called proximity search [4,5,8]. Proximity
searching deals with the location and the distance relationship among keywords in a document.
The algorithm can find a target item in an offsetList that is selected according to the minimum
keyword and distance factor, one can trade accuracy for speed and find a part of the offsetList
containing the target item even faster. Moving iteratively builds a sequence of solutions generated
2. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
50
by the algorithm. It requires finding all the partial range for a given offset list using the technique
to find the keyword with minimum repeat.
We implemented and tested a unified approach to generating query suggestions based on phrases.
We perform a search for all subsets at query time, in a case which a word overlaps itself
repeatedly, we count the number of ordered pairs of symbols that are adjacent in the document
and by using the iterated partial Search, and we limit the search space. It will be performed in less
time especially at high data storage.
Running time of the proposed algorithm can be achieved in time )log)o(( kn − , where n is the
frequency of keywords occurrence in a document, is the frequency of tandem replicated data
and k is the number of query terms in a query.
2. RELATED WORKS
In string matching, there are some results on finding k keywords within a given maximum
distance d. Gonnet etal. proposed an algorithm for finding two keywords P1 and P2 within
distance d in ( )21o mm + time, where 21 mm < are the numbers of occurrences of the two keywords.
Baeza-Yates and Cunto proposed the abstract data type Proximity and an )o(logn time algorithm.
Manber and Baeza-Yates also proposed an )logo( nn time algorithm [4,5].
They assume that the maximum distance d is known in advance. Sadakane and Imai proposed an
)logo( kn time algorithm for a restricted version of the problem. Plane sweep algorithm achieves
the same time complexity for the k-keyword proximity problem while dealing with a generalized
version of the problem [4,5].
There are several algorithms for finding exact tandem repeats. Most of these algorithms have two
phases, a scanning phase that locates candidate tandem repeats, and an analysis phase that checks
the candidate tandem repeats found during the scanning phase [10]. We use the concept of finding
repeated data in pre-processing phase to decrease the running time of plane sweep algorithm. In
partial search of our algorithm, one is interested in finding the exact address of the target item and
search around it. So, the proposed algorithm limits the searched area to a subset of document with
escaping from some ineffective keywords, which reduce running time in plane sweep algorithm.
3. PROPOSED ALGORITHM
Keyword proximity searching in a document is the method to find the relevant document that all
the terms in a query appear with in a relatively small fixed-size window.
The notion of proximity search differs from text search with wildcards in three key ways:
1. The total length of the matched string is bounded, thus there is a cumulative bound on the
length of the arbitrary sequences.
2. The order of the search terms is not specified; thus, in the example, any permutation of A,
B, and C is permitted.
3. Search begins with an index (i.e., a list of occurrences for each term, often called a
position list) instead of the original text.[4]
3. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
51
Plane sweep algorithm search the inverted lists until the search range is detected, but the proposed
algorithm search in the neighbourhood of the minimum keyword then after scanning the search
range and getting the critical range which is minimal, the algorithm shifts to the next range. The
important point is that we remove the sequences which contain the replicated data and also allow
related ranges which is presented according to the algorithm and have significantly reduced the
expected conditions to the previous algorithms according to the above parameters.
We try to have a relationship based on query and keywords that we find in the search document,
and also we need to reduce the number of comparisons so that search operation performs faster.
The proposed algorithm reduces the searched area to a minimum and relies on an optimized
search algorithm for effectively pruning the search space.
Definition1. Given k keywords kWWW ,...,, 21 , a set of lists { }kKKKK ,...,, 21= where the i-th keyword
iW , and positive integers kfff ,...,, 21 , and )( '
kk ≤ , the generalized k keyword proximity problem is
to find the smallest range that contains k _ distinct keywords ( )kiWi ≤≤1 appear at least if times
each in the range.
Note that the problem becomes the basic plane sweep algorithm when all 1=if , for ki ≤≤1 .
If we show each offset list with w and frequency as f, we have, [ ] [ ]fiwiw += −≤≤ wi1, . The
offset of w for replicated tandem word is the offset of the first word location.
Definition2. Let ),( dX be a metric space and DR ∈*
the set of offsets, a range
query qQMinffset =)]([ , +
∈∈ RrDqrq ,),,( , reports all keywords that are within distance r to q, that is
}),(,{),( *
rqudRurq ≤∈= . The volume defined by ),( rq is the range space, and all the keywords
from *
R are reported. The proposed algorithm can be implemented using range queries.
Definition3. A range space is a set ),( *
RD=∑ , where D is the search space and *
R is the family
of subset of D. The elements of *
R are ranges of ∑ , where ∑ is a finite range. In optimization
queries, we want to return an object that satisfies certain condition with respect to the query
range. Ineffective searches have no effect on the result. So we have:
)()( *
*
RDF
DR ∈
= , },...,,|{)( **
2
*
1 jkjj RRRXDXDF =∈=
D is the search space which contains ranges that can be matched with chains of basic moves.
Definition4. Let DR be the precision of the ranges that are relevant to the query in the algorithm
and also retrieved from the plane sweep algorithm, in which a better solution has been found. DR
in a document can be defined as the conditional probability which denotes the probability that
ranges have within a document. This parameter shows the ratio of relevant ranges.
=
−
=
)..(#
).(Re#
rangesCandidatesweepPlane
rangestreived
RD (1)
rangesCandidatesweepPlane
rangesCandidatesweepPlanerangeslevent
..
...Re
−
−∩
First, we have tried to find replicated words in a list of tandem words with the specified offset
which is the output stage of the preprocessing and then we consider the keywords which have
minimum counter and limit our search area to the range around that keyword. We reduce the
4. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
52
number of comparisons with counting the number of tandem replicated words and also removing
unsatisfied searches. The main objective is to find the most efficient and relevant answer for the
query.
There are so many results which contain the query’s keywords but users are interested in a much
smaller subset. For this purpose we define a range as ∗
R where it contains nearest neighbours of
minimum counted keyword, given the proposed algorithm, we can perform a local search in ∗
R ,
as we should consider the range ( ),1][ −+ QMPos i ( )1][ −− QMPos i around the minimum keyword.
We want to explore ∗
R using a walk that steps from one ∗
R to a “nearby” one, with the list of
defined minimum keyword. This algorithm escapes from ineffective keywords by applying the
current partial range.
If we show distance factor with MinkeywordDF which is the distance between two minimum
keyword and defined as [ ]mnKKK DFDFDF ,3,22,1
,...,,DFMinkeyword = , where m=#ofMinKeyword,
according to our condition it might be greater than 1−Q , where Q is the length of query, and if
one of the distance factor is lower than 1−Q , we should escape the right range of 1-iMK or left
range of iMK to have the minimum comparison, in this situation the search range would not
overlap.
Fig.1 Distance factor between two minimum keywords.
It is possible that there are few keywords in the document with the minimum number. We
consider another factor as distance factor. The distance factor is the number of locations between
the two minimum keywords, the value must be greater than the Query length ( 1−Q ) otherwise
the search range may overlap And search results will not be optimal, if this factor is also the
same, we consider one of the keywords with minimum number randomly and the search range
limited to the distance around that keyword.
The implementation of this algorithm is intended to count the number of tandem repeat words in a
document using inverted file and also limit the search range using the technique of partial range.
In this algorithm, two pointers are used to search a document that Scan offset list from left to
right, Pointer lP ,which refers to the left of the offset and rP which is refer to the last offset in the
range.
As we defined ∗
R as a searched range, Iterated partial search achieved the results as follows:
Set rl PP , with the position of the first ∗
R , then iteratively moves to the next ∗
R that contains the
item of interest. It means the ranges that are irrelevant for the result is removed.
rP moves in a defined range, if its value is greater than the range size, it returns to the next
beginning of the interval which lP is pointed. After a critical range was set, lP also move forward
5. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
53
one place at the offset list. In this algorithm, counter is stored for each offset value. And if its
value is greater than one, compare operations do not need to do and we only move forward
according to the size of the defined range, the algorithm doesn’t consider all position of word in a
document. In fact, we skip the repetitive sequences. After each critical range is defined,
minimality is also checked. This continues until we reach end of the list.
Our work tries to improve the plane sweep algorithm by efficiently calculating the minimal group
of words as a result.
The most important aspect to measure, apart from the accuracy, would be the time required for
the algorithm to compute the recommendations. Apart from that we also need to measure the
number of iterations required by users to reach the desired results. This can be compared with the
number of iteration required by the user in an environment without removing ineffective search
and also with replicated data. In practice, there is a trade-off between the encoding overhead and
the amount of achievable reduction.
Number of comparison in a defined ranges and also number of comparison in replicated word is
defined below:
( )∑ ∑ ∑= = =
×=
rdofMinKeywo
k
D
j
Q
i
ii
#
1 1 1
, ( )∑ ∑= =
×=
1 1j
Q
i
ii
The number of comparison for replicated word is calculated from following equation:
( )∑=
−=
D
k
kctr
1
1 , if k>1. (2)
= Number of Replicated word
|D| = Length of Document offset list - |Q| -1
|Q| = Length of Query
=i
Let i denote the Availability factor of a word in a query, which is describing the number of
comparisons made. i set to one if position i considered with the algorithm, otherwise, the number
of comparison is zero and it is also set to zero, so only position i is applied in Formula(3).
As a consequence of the algorithm and basic plane sweep and number of tandem replicated word,
and also removing the comparison of ineffective keywords, number of comparisons related to the
algorithm can be formulated as follows:
−=nC (3)
As shown in formula (3) number of comparison in the proposed algorithm is the sum of ranges
which in each range number of comparison is the difference between the number of comparison
1 if match occurred in Doc list
0 O.W
6. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
54
in plane sweep algorithm and the number of comparison in replicated words. So according to
formula (3), we reduce the number of comparison in our proposed algorithm.
For example, consider the following document:
Document :{ ABCCCABCCBACBBBCBA}
ABCBCABCBACBAKeyword in Doc
1113111211311# of repeated
words
Fig. 2 searching on document { ABCCCABCCBACBBBCBA}
In this example number of A is 4, number of B is 5, and number of C is 4. The minimum number
of tandem repeated word is belonging to A and C.
We consider the distance factor as distance between two same keywords. The distance between
each A is [ ]4,3,2=AD and the distance between each C is shown in vector [ ]1,2,1=CD .
In the second case, the distance between 2,1CD and 4,3CD is lower than the query length ( 1−Q ) and
removed from the search so, search will continue around A. Ranges specified in fig 3, which is
shown as 4321 ,,, IIII .
Fig. 3 Example of searching with considering minimum repeat words
In this section we describe a proposed algorithm and show its improved average running time.
The algorithm is defined below:
(1) Sort offset of keyword ( )njPij ...1== in a document in increasing order, we also add counter for
replicated tandem words to the list.
(2) Add the number of each query’s keywords in a offsetList
(3) Get the minimum replicated keyword from the list considering the distance factor
(4) Repeatedly increase i by one until we get end of the Min-keyword offsetList
(5) If we have passed end of the list, sort interval in a heap with considering the tandem
replicated word counter and output them, finish.
(6) Considering the neighbor of the target minimum keyword according to distance factor so that
we have no overlap range.
(7) Repeatedly increase lP by one until the current range is a candidate range or we have passed
the end of the list.
(8) If we have passed end of the partial list, Go to step 4.
7. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
55
(9) Repeatedly increase rP by one until the current range is not a candidate range.
(10) The range ),( 1−rl PP is a critical range. Compare the size of range ),( 1−rl PP with the stored
minimum range and replace the minimum range with the current range if ),( 1−rl PP is smaller.
(11) Go to step 4.
Fig. 4 Proposed algorithm flowchart
8. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
56
Suppose we have a query of “BCA” that is searched in the following document offset list, the
algorithm needs to compute all the related keywords:
Document :{ CABABABCABBB}
A ‘tandem replicated word’ is a string of the form
qsys
WXZX ...
Where 1,, ≥ qys for some { }n,...,2,1∈ .
The concatenations of the tandem replicated word is as follows:
iiiiiiiii yrqyryryrq
i
ZXWZXZXZXW
13
1=
∏
Where {}1∈iq ; {}1∈ir ; { }3,1∈iy .
The frequency of replicated words in the list is: 14.0
14
2
===
∑
D
Wsim
.
The ratio of above example is calculated which is based on the solution ranges in a document in
the proposed algorithm and plane sweep algorithm. The less ratio value is shown the greater
efficiency of the proposed algorithm because it reduces the number of comparisons and as a result
the number of ranges.
Any offset in the following list are:
{ }9,6,4,21 ∈K , { }7,02 ∈K and { }8,5,3,13 ∈K
.
Fig.5 Searching ranges on 321 ,, KKK
Search has been defined from left to right with respect to the range for 3-keyword. There are two
partial range in the offset list, ∗
1R and ∗
2R .
Where :
ABCABRCABR == ∗∗
21 ,
Finding the solution for the above example:
9. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
57
Fig.6 Result of the algorithm is shown as 4321 ,,, IIII
∗∗
21 ,RR are the partial ranges that are the output of the algorithm and 4321 ,,, IIII
are the result range.
Definition5. Let simW be the repetition factor of the tandem replicated word. It is defined below:
10, ≤≤=
∑ simsim W
D
W
. (4)
Where D , is the size of the offset list, and is the frequency of replicated words in the list.
This parameter shows the ratio of similarity that was retrieved from the keywords in a document.
4. TEST RESULTS
In order to study the effect of tandem replicated word with iterative partial range, we ran some
experiment on a different lists size, and the result is shown below. The data sets and test results
are used to assess performance measures for the algorithm under test.
Experiments (Table1) views the sequence as it has been produced by a random file with the
different repetition factor of tandem replicated word on different file size. It must be noted that
the run time depends not only on repetition factor of the input document, but also on the number
of keywords in query. In fact, if repetition factor (it means the number of tandem repeat words in
the document), and also the number of keywords in question is greater, the proposed algorithm
work better.
Table1: Results of the tests on 3-keywords
simW
/ Data size 500 1000 2000 3000 4000
0.2 0.0135 0.0218 0.0337 0.0443 0.0535
0.4 0.0163 0.0234 0.0352 0.0474 0.0590
0.6 0.0171 0.0244 0.0381 0.0527 0.0668
10. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
58
Fig. 7 Number of Comparisons on 3-keywords query.
We compared the three search algorithm, plane sweep, WPSR and modified WPSR on the data
set and the result is shown on Table2.
Table2: Results of the tests on 3 algorithms with 3-keywords query and, 6.0=DR , 6.0=simW
.
.
Datasize Plane sweep WPSR M-WPSR
3000 0.1495 0.0790 0.0785
6000 0.1654 0.0809 0.0800
12000 0.3114 0.1500 0.1490
18000 0.4764 0.2282 0.2197
24000 0.594 0.2819 0.2801
30000 0.7161 0.3548 0.3448
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 5000 10000 15000 20000 25000 30000 35000
Plane sweep
WPSR
M-WPSR
Fig. 8 Number of Comparisons made by 3 algorithms with
3-keywords query and query and, 6.0=DR , 6.0=simW
.
.
0.0135 0.0163 0.0171
0.0218 0.0234 0.0244
0.0337
0.0352 0.0381
0.0443 0.0474 0.0527
0.0535 0.059
0.0668
0
0.2
0.4
0.6
0.8
0.2 0.4 0.6
500 1000 2000 3000 4000
Time
11. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
59
Table3: Results of the tests on offsets
Fig. 9 Number of Comparisons made by modified WPSR and plane sweep
algorithm with 3-keywords query and 4.0=simW .
The experiment is shown above for the three search algorithm on the different data sets and with
the different parameter of simW and DR , and also with changing the number of keywords in a
query, we see that it leads to a better running time especially in high volume repetitive data.
The proposed algorithm is expected to improve search accuracy and effectiveness, with the fast-
growing availability of information online, who users may not be aware of most the updated
critical keywords, the proposed system is also expected to improve search efficiency. Finally, we
observe that using tandem replicated word with iterative partial range make all size of the random
sample better, especially in a large size, and it leads to a better running time.
5. CONCLUSION
We remove the sequences which contain the replicated data and also allow related ranges which
are presented according to the proposed algorithm. This has significantly reduced the expected
conditions to the previous algorithms according to the above parameters.
The proposed algorithm will be more optimized when the number of data is increased, because in
this situation, number of retrieved ranges is decreased more by mentioned algorithm comparisons.
The effect of algorithm on high volume of data is more significant, and it is robust, and highly
effective. Experimental results show that the new algorithm performs well in practice.
Datasize Plane sweep M-WPSR
900 0.0239 0.0216
1800 0.0424 0.0350
3600 0.0775 0.0600
7200 0.1449 0.1101
14400 0.2751 0.2110
28800 0.5584 0.4202
0
0.2
0.4
0.6
0.8
0 5000 10000 15000 20000 25000 30000 35000
Time
# of data offset in list
Plane sweep
M-WPSR
12. International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.5, October 2012
60
REFERENCES
[1] Arash Ghorbannia Delavar, Elahe Moghimi Hanjani, (2012) "WPSR: Word Plane Sweep Replicated, Present a
Plane sweep Algorithm to optimize the use of Replicated data in a document", International Journal of Computer
Science Issues, Vol. 9, Issue 2, No 3.
[2] Alan Feuer, StefanSavev, JavedA. Aslam,(2009)"Implementing and evaluating phrasal query suggestions for
proximity search", Elsevier, College of Computer and Information Science.
[3] Feuer Alan, Savev Stefan, Javed A. Aslam,(2007) "Evaluation Of phrasal query suggestions", in:Proceedings of
the Sixteenth ACM Conference on Information and Knowledge Management (CIKM ‘07), Lisboa, Portugal.
[4] K. Sadakane, H. Imai, (2001) "Fast algorithms for k-word proximity search", IEICE Trans. Fundamentals E84-A
(9) 312–319.
[5] Chirag Gupta, Gultekin Özsoyoglu, Z. Meral Özsoyoglu.(2009)"Efficient k-word proximity search." In The
24th International Symposium on Computer and Information Sciences, ISCIS 2009, 14-16 September 2009,
North Cyprus. pages 123-128, IEEE.
[6] Zobel J.,Moffat A. (2006) "Inverted files for text Search Engines",ACM computing surveys(C SUR),v38-2.
[7] B.J. Jansen, A. Spink, T. Saracevic, (2000) " Real life, real users, and real needs: a study and analysis of user
queries on the web", Inf. Process. Manage. 36 (2) 207–227.
[8] S.Kim,I.Lee,K.Park, (2004) "A fast algorithm for the generalized keyword proximity problem given keyword
offsets", Inf.Process. Lett. 91(3) 115–120.
[9] S. Lawrence, C.L. Giles, (1999) "Searching the web: general and scientific information access", IEEE
Communications 37 (1) 116–122.
[10] Atheer A. Matroud1, M. D. Hendy and C. P. (2011) "Tuffley NTRFinder: a software tool to find nested tandem
repeats".
[11] R. Uricaru, A. Mancheron, E. Rivals, (2011) "Novel definition and algorithm for chaining fragments with
proportional overlaps", J. of Computational Biology, Vol. 18(9), p. 1141-54.
[12] E. Adebiyi, E. Rivals, (1999)"Detection of Recombination in Variable Number Tandem Repeat Sequences",
South African Computer Journal (SACJ), 39, p. 1–7.
[13] E. Rivals, (2004)"A Survey on Algorithmic Aspects of Tandem Repeat Evolution", International Journal on
Foundations of Computer Science, Vol. 15, No. 2, p. 225-257.
[14] Atheer A. Matroud , Michael D. Hendy , Christopher P. Tuffley , (2011)"An algorithm to solve the motif
alignment Problem for approximate nested tandem repeats", RECOMB-CG'10 Proceedings of the international
conference on Comparative genomics.
[15] Hongxia Zhou, Liping Du, Hong Yan, (2009)"Detection of tandem repeats in DNA sequences based on
parametric spectral estimation." IEEE transactions on information technology in biomedicine a publication of the
IEEE Engineering in Medicine and Biology Society.
[16] G M Landau, J P Schmidt, D Sokol,(2001)" An algorithm for approximate tandem repeats." Journal of
computational biology a journal of computational molecular cell biology.
[17] Rasolofo Yves, Savoy Jacques, (2003)"Term proximity scoring for keyword-based retrieval systems", 25th
European Conference on IR research, ECIR.
[18] Hao Yan, Shuming Shi, Fan Zhang, Torsten Suel, Ji-rong Wen, (2010)" Efficient Term Proximity Search with
Term-Pair Indexes", the 19th ACM conference on Conference on information and knowledge management
CIKM 10.
[19] Ralf Schenkel, Andreas Broschart, Seungwon Hwang, Martin Theobald, Gerhard Weikum, (2007) "Efficient
Text Proximity Search, String Processing and Information Retrieval".
AUTHORS
Elahe moghimi hanjani received her B.Sc. in computer engineering from Azad University central Tehran
branch, Tehran, IRAN, in 2008, and is a M.Sc. student in computer engineering in Payam Noor University
(PNU). Her research interests include optimizing text retrieval algorithm, data mining.
Mahdi javanmard received his M.Sc. degree in Electrical Engineering from the University of New
Brunswick, Canada and Ph.D. degree in Electrical and Computer Engineering from Queen’s University at
Kingston, Canada, in 2002 and 2007 respectively. He is a faculty member of Payam Noor University
(PNU) and currently Head of COMSTECH Inter Islamic Network on Virtual Universities (CINVU). He has
been teaching for many years at different universities where he has been involved in their course
development for the Computer Science Department. Additionally, he works as a System Development
Consultant for various companies. Dr Javanmard's research interests are in the areas of Information &
Communication Security, Speech Recognition, Signal Processing, Urban Management & ICT, and
Ultrasound Medical Imaging.