With more and more text data stored in databases, the problem of handling natural language query predicates becomes highly important. Closely related to query optimization for these predicates is the (sub)string estimation problem, i.e., estimating the selectivity of query terms before query execution based on small summary statistics. The Count Suffix Trees (CST) is the data structure commonly used to address this problem. While selectivity estimates based on CST tend to be good, they are computationally expensive to build and require a large amount of memory for storage. To fit CST into the data dictionary of database systems, they have to be pruned severely. Pruning techniques proposed so far are based on term (suffix) frequency or on the tree depth of nodes. In this paper, we propose new filtering and pruning techniques that reduce the building cost and the size of CST over natural-language texts. The core idea is to exploit the features of the natural language data over which the CST is built. In particular, we aim at regarding only those suffixes that are useful in a linguistic sense. We use (wellknown) IR techniques to identify them. The most important innovations are as follows: (a) We propose and use a new optimistic syllabification technique to filter out suffixes. (b) We introduce a new affix and prefix stripping procedure that is more aggressive than conventional stemming techniques, which are commonly used to reduce the size of indices. (c) We observe that misspellings and other language anomalies like foreign words incur an over-proportional growth of the CST. We apply state-of-the-art trigram techniques as well as a new syllable-based non-word detection mechanism to filter out such substrings. – Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80%, already during construction, and at the same time increase the accuracy of selectivity estimates computed from the final CST by up to 70%.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Query Distributed RDF Graphs: The Effects of Partitioning PaperDBOnto
Abstract: Web-scale RDF datasets are increasingly processed using distributed RDF data stores built on top of a cluster of shared-nothing servers. Such systems critically rely on their data partitioning scheme and query answering scheme, the goal of which is to facilitate correct and ecient query processing. Existing data partitioning schemes are
commonly based on hashing or graph partitioning techniques. The latter techniques split a dataset in a way that minimises the number of connections between the resulting subsets, thus reducing the need for communication between servers; however, to facilitate ecient query answering,
considerable duplication of data at the intersection between subsets is often needed. Building upon the known graph partitioning approaches, in this paper we present a novel data partitioning scheme that employs minimal duplication and keeps track of the connections between partition elements; moreover, we propose a query answering scheme that
uses this additional information to correctly answer all queries. We show experimentally that, on certain well-known RDF benchmarks, our data partitioning scheme often allows more answers to be retrieved without distributed computation than the known schemes, and we show that our query answering scheme can eciently answer many queries.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However, to measure the performance of each one, a retrieval system must be built to compare the results of using these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the there systems. We found out that the retrieval result for inverted files is better than the retrieval result for other indices.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
The document describes optimizations made to the Near-Synonym System (NeSS) to improve its performance and scalability. The key optimizations included building an index on the suffix array to reduce substring search time from O(L + logN) to O(L), parallelizing the system more efficiently, and keeping a single global suffix array to improve accuracy of results. These optimizations led to an approximately 20x-40x speedup of NeSS.
This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
The document discusses a neural model called Duet for ranking documents based on their relevance to a query. Duet uses both a local model that operates on exact term matches between queries and documents, and a distributed model that learns embeddings to match queries and documents in the embedding space. The two models are combined using a linear combination and trained jointly on labeled query-document pairs. Experimental results show Duet performs significantly better at document ranking and other IR tasks compared to using the local and distributed models individually. The amount of training data is also important, with larger datasets needed to learn better representations.
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Query Distributed RDF Graphs: The Effects of Partitioning PaperDBOnto
Abstract: Web-scale RDF datasets are increasingly processed using distributed RDF data stores built on top of a cluster of shared-nothing servers. Such systems critically rely on their data partitioning scheme and query answering scheme, the goal of which is to facilitate correct and ecient query processing. Existing data partitioning schemes are
commonly based on hashing or graph partitioning techniques. The latter techniques split a dataset in a way that minimises the number of connections between the resulting subsets, thus reducing the need for communication between servers; however, to facilitate ecient query answering,
considerable duplication of data at the intersection between subsets is often needed. Building upon the known graph partitioning approaches, in this paper we present a novel data partitioning scheme that employs minimal duplication and keeps track of the connections between partition elements; moreover, we propose a query answering scheme that
uses this additional information to correctly answer all queries. We show experimentally that, on certain well-known RDF benchmarks, our data partitioning scheme often allows more answers to be retrieved without distributed computation than the known schemes, and we show that our query answering scheme can eciently answer many queries.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However, to measure the performance of each one, a retrieval system must be built to compare the results of using these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the there systems. We found out that the retrieval result for inverted files is better than the retrieval result for other indices.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
The document describes optimizations made to the Near-Synonym System (NeSS) to improve its performance and scalability. The key optimizations included building an index on the suffix array to reduce substring search time from O(L + logN) to O(L), parallelizing the system more efficiently, and keeping a single global suffix array to improve accuracy of results. These optimizations led to an approximately 20x-40x speedup of NeSS.
This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
The document discusses a neural model called Duet for ranking documents based on their relevance to a query. Duet uses both a local model that operates on exact term matches between queries and documents, and a distributed model that learns embeddings to match queries and documents in the embedding space. The two models are combined using a linear combination and trained jointly on labeled query-document pairs. Experimental results show Duet performs significantly better at document ranking and other IR tasks compared to using the local and distributed models individually. The amount of training data is also important, with larger datasets needed to learn better representations.
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs.
We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
We benchmark Conformer-Kernel models under the strict blind evaluation setting of the TREC 2020 Deep Learning track. In particular, we study the impact of incorporating: (i) Explicit term matching to complement matching based on learned representations (i.e., the “Duet principle”), (ii) query term independence (i.e., the “QTI assumption”) to scale the model to the full retrieval setting, and (iii) the ORCAS click data as an additional document description field. We find evidence which supports that all three aforementioned strategies can lead to improved retrieval quality.
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
The document describes a framework for reconstructing hidden emails from email folders by identifying quoted fragments and using a precedence graph to represent relationships between emails. It introduces optimizations like email filtering using word indexing and LCS anchoring using indexing to handle large folders and long emails efficiently. An evaluation on the Enron dataset showed the framework could reconstruct hidden emails for many users, and optimizations improved effectiveness.
Automated building of taxonomies for search enginesBoris Galitsky
We build a taxonomy of entities which is intended to improve the relevance of search engine in a vertical domain. The taxonomy construction process starts from the seed entities and mines the web for new entities associated with them. To form these new entities, machine learning of syntactic parse trees (their generalization) is applied to the search results for existing entities to form commonalities between them. These commonality expressions then form parameters of existing entities, and are turned into new entities at the next learning iteration.
Taxonomy and paragraph-level syntactic generalization are applied to relevance improvement in search and text similarity assessment. We conduct an evaluation of the search relevance improvement in vertical and horizontal domains and observe significant contribution of the learned taxonomy in the former, and a noticeable contribution of a hybrid system in the latter domain. We also perform industrial evaluation of taxonomy and syntactic generalization-based text relevance assessment and conclude that proposed algorithm for automated taxonomy learning is suitable for integration into industrial systems. Proposed algorithm is implemented as a part of Apache OpenNLP.Similarity project.
The session focused on Data Mining using R Language where I analyzed a large volume of text files to find out some meaningful insights using concepts like DocumentTermMatrix and WordCloud.
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...TELKOMNIKA JOURNAL
Most scientific publishers encourage authors to provide keyphrases on their published article.
Hence, the need to automatize keyphrase extraction is increased. However, it is not a trivial task
considering keyphrase characteristics may overlap with the non-keyphrase’s. To date, the accuracy of
automatic keyphrase extraction approaches is still considerably low. In response to such gap, this paper
proposes two contributions. First, a feature called fact-based sentiment is proposed. It is expected to
strengthen keyphrase characteristics since, according to manual observation, most keyphrases are
mentioned in neutral-to-positive sentiment. Second, a combination of supervised and unsupervised
approach is proposed to take the benefits of both approaches. It will enable automatic hidden pattern
detection while keeping candidate importance comparable to each other. According to evaluation, factbased
sentiment is quite effective for representing keyphraseness and semi-supervised approach is
considerably effective to extract keyphrases from scientific articles.
This project report presents an analysis of compression algorithms like Huffman encoding, run length encoding, LZW, and a variant of LZW. Testing was conducted on these algorithms to analyze compression rates, compression and decompression times, and memory usage. Huffman encoding achieved around 40% compression while run length encoding resulted in negligible compression. LZW performed better than its variant in compression but the variant used significantly less memory. The variant balanced compression performance and memory usage better than standard LZW for large files.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
This document provides an overview of text mining techniques and processes for analyzing Twitter data with R. It discusses concepts like term-document matrices, text cleaning, frequent term analysis, word clouds, clustering, topic modeling, sentiment analysis and social network analysis. It then provides a step-by-step example of applying these techniques to Twitter data from an R Twitter account, including retrieving tweets, text preprocessing, building term-document matrices, and various analyses.
Topic detecton by clustering and text miningIRJET Journal
This document discusses topic detection from text documents using text mining and clustering techniques. It proposes extracting keywords from documents, representing topics as groups of keywords, and using k-means clustering on the keywords to group them into topics. The keywords are extracted based on frequency counts and preprocessed by removing stop words and stemming. The k-means clustering algorithm is used to assign keywords to topics represented by cluster centroids, and the centroids are iteratively updated until cluster assignments converge.
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATIONIJDKP
In our study we will use approach that combine Natural language processing NLP with Term occurrences to improve the quality of important sentences selection by thickening sentence score along with reducing the number of long sentences that would be included in the final summarization. There are sixteen known methods for automatic text summarization. In our paper we utilized Term frequency approach and built an algorithm to re filter sentences score.
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...Kripa (कृपा) Rajshekhar
Recent progress in incorporating word order and semantics to the decades-old, tried-and-tested bag-of-words representation of text meaning has yielded promising results in computational text classification and analysis. This development, and the availability of a large number of legal rulings from the PTAB (Patent Trial and Appeal Board) motivated us to revisit possibilities for practical, computational models of legal relevance - starting with this narrow and approachable niche of jurisprudence. We present results from our analysis and experiments towards this goal using a corpus of approximately 8000 rulings from the PTAB. This work makes three important contributions towards the development of models for legal relevance semantics: (a) Using state-of-art Natural Language Processing (NLP) methods, we characterize the diversity and types of semantic relationships that are implicit in practical judgements of legal relevance at the PTAB (b) We achieve new state-of-art results on practical information retrieval using our customized semantic representations on this corpus (c) We outline promising avenues for future work in the area - including preliminary evidence from human-in-loop interaction, and new forms of text representation developed using input from over a hundred interviews with practitioners in the field. Finally, we argue that PTAB relevance is a practical and realistic baseline for performance measurement - with the desirable property of evaluating NLP improvements against “real world” legal judgement.
Spatial database are becoming more and more popular in recent years. There is more and more
commercial and research interest in location-based search from spatial database. Spatial keyword search
has been well studied for years due to its importance to commercial search engines. Specially, a spatial
keyword query takes a user location and user-supplied keywords as arguments and returns objects that are
spatially and textually relevant to these arguments. Geo-textual index play an important role in spatial
keyword querying. A number of geo-textual indices have been proposed in recent years which mainly
combine the R-tree and its variants and the inverted file. This paper propose new index structure that
combine K-d tree and inverted file for spatial range keyword query which are based on the most spatial
and textual relevance to query point within given range.
Finding Similar Files in Large Document Repositoriesfeiwin
The document presents a method for finding similar files in large document repositories by breaking files into chunks, computing hashes for each chunk, and comparing chunk hashes across files to determine similarity. It involves chunking each file, hashing the chunks, constructing a bipartite graph between files and chunks, and building a file similarity graph based on shared chunks above a threshold. The method was implemented and tested on a repository of technical support documents, taking 25-39 minutes to process hundreds of documents totaling hundreds of megabytes in size.
"The ICAIL conference is the primary international conference addressing research in Artificial Intelligence and Law, and has been organized biennially since 1987"
#ICAIL2017, #ADAI
The purpose of this work is to outline our approach to the development and testing of several computational models for legal relevance in the narrow domain of patent law, specifically as documented through select proceedings of the USPTO PTAB cases.
Contribution #1: “real world” legal judgement
Patent Trial and Appeal Board (PTAB) publicly available dataset, as of Jan 2017 has about 100 zip files containing 10 GB of data (compressed).
PTAB data represent practitioner needs, better than the more commonly used citation graphs
Contribution #2: Doc semantics != Legal Relevance
Disproved prevailing notion that document semantics implies legal relevance or is at least correlated with it. e.g. Khoury and Bekkerman in 2016, “if a given document is not in the semantic neighborhood of the query document, it simply cannot be relevant for the query document"
Contribution #3: ~4X improvement in retrieval
Without subsector pre-processing: Recall@100 was 5%, After text pre-processing: Recall@100 was 19%
Contribution #4: Human-in-loop impact is dramatic
Potential for over 50X improvement, where a retrieval task failed Recall @ 5000 but passed Recall @100 with user feedback
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
This document discusses training a dependency parser using an unparsed corpus rather than a manually parsed training set. It develops an iterative training method that generates training examples using heuristics from past parsing decisions. The method is shown to produce parse trees qualitatively similar to conventionally trained parsers. Three avenues for future research using this corpus-based generation method are proposed.
Performance and Profit Evaluations of a Stochastic Model on Centrifuge System...Waqas Tariq
The paper formulates a stochastic model for a single unit centrifuge system on the basis of the real data collected from the Thermal Power Plant, Panipat (Haryana). Various faults observed in the system are classified as minor, major and neglected faults wherein the occurrence of a minor fault leads to degradation whereas occurrence of a major fault leads to failure of the system. Neglected faults are taken as those faults that are neglected /delayed for repair during operation of the system until the system goes to complete failure such as vibration, abnormal sound, etc. However these faults may lead to failure of the system. There is assumed to be single repair team that on complete failure of the system, first inspects whether the fault is repairable or non repairable and accordingly carries out repairs/replacements. Various measures of system performance are obtained using Markov processes and regenerative point technique. Using these measures profit of the system is evaluated. The conclusions regarding the reliability and profit of the system are drawn on the basis of the graphical studies.
The document describes several different ecosystems:
1) Tundra ecosystems are the coldest on Earth with little plant life due to freezing temperatures. Common animals include caribou, reindeer, arctic fox, polar bear, and white wolves.
2) Grasslands are dominated by grasses and support a variety of herbivores like antelopes, elephants, and zebras as well as predators such as lions and wolves.
3) Rainforests are warm and wet year-round, containing many species of trees, plants, monkeys, snakes, birds, and insects.
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs.
We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
We benchmark Conformer-Kernel models under the strict blind evaluation setting of the TREC 2020 Deep Learning track. In particular, we study the impact of incorporating: (i) Explicit term matching to complement matching based on learned representations (i.e., the “Duet principle”), (ii) query term independence (i.e., the “QTI assumption”) to scale the model to the full retrieval setting, and (iii) the ORCAS click data as an additional document description field. We find evidence which supports that all three aforementioned strategies can lead to improved retrieval quality.
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
The document describes a framework for reconstructing hidden emails from email folders by identifying quoted fragments and using a precedence graph to represent relationships between emails. It introduces optimizations like email filtering using word indexing and LCS anchoring using indexing to handle large folders and long emails efficiently. An evaluation on the Enron dataset showed the framework could reconstruct hidden emails for many users, and optimizations improved effectiveness.
Automated building of taxonomies for search enginesBoris Galitsky
We build a taxonomy of entities which is intended to improve the relevance of search engine in a vertical domain. The taxonomy construction process starts from the seed entities and mines the web for new entities associated with them. To form these new entities, machine learning of syntactic parse trees (their generalization) is applied to the search results for existing entities to form commonalities between them. These commonality expressions then form parameters of existing entities, and are turned into new entities at the next learning iteration.
Taxonomy and paragraph-level syntactic generalization are applied to relevance improvement in search and text similarity assessment. We conduct an evaluation of the search relevance improvement in vertical and horizontal domains and observe significant contribution of the learned taxonomy in the former, and a noticeable contribution of a hybrid system in the latter domain. We also perform industrial evaluation of taxonomy and syntactic generalization-based text relevance assessment and conclude that proposed algorithm for automated taxonomy learning is suitable for integration into industrial systems. Proposed algorithm is implemented as a part of Apache OpenNLP.Similarity project.
The session focused on Data Mining using R Language where I analyzed a large volume of text files to find out some meaningful insights using concepts like DocumentTermMatrix and WordCloud.
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...TELKOMNIKA JOURNAL
Most scientific publishers encourage authors to provide keyphrases on their published article.
Hence, the need to automatize keyphrase extraction is increased. However, it is not a trivial task
considering keyphrase characteristics may overlap with the non-keyphrase’s. To date, the accuracy of
automatic keyphrase extraction approaches is still considerably low. In response to such gap, this paper
proposes two contributions. First, a feature called fact-based sentiment is proposed. It is expected to
strengthen keyphrase characteristics since, according to manual observation, most keyphrases are
mentioned in neutral-to-positive sentiment. Second, a combination of supervised and unsupervised
approach is proposed to take the benefits of both approaches. It will enable automatic hidden pattern
detection while keeping candidate importance comparable to each other. According to evaluation, factbased
sentiment is quite effective for representing keyphraseness and semi-supervised approach is
considerably effective to extract keyphrases from scientific articles.
This project report presents an analysis of compression algorithms like Huffman encoding, run length encoding, LZW, and a variant of LZW. Testing was conducted on these algorithms to analyze compression rates, compression and decompression times, and memory usage. Huffman encoding achieved around 40% compression while run length encoding resulted in negligible compression. LZW performed better than its variant in compression but the variant used significantly less memory. The variant balanced compression performance and memory usage better than standard LZW for large files.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
This document provides an overview of text mining techniques and processes for analyzing Twitter data with R. It discusses concepts like term-document matrices, text cleaning, frequent term analysis, word clouds, clustering, topic modeling, sentiment analysis and social network analysis. It then provides a step-by-step example of applying these techniques to Twitter data from an R Twitter account, including retrieving tweets, text preprocessing, building term-document matrices, and various analyses.
Topic detecton by clustering and text miningIRJET Journal
This document discusses topic detection from text documents using text mining and clustering techniques. It proposes extracting keywords from documents, representing topics as groups of keywords, and using k-means clustering on the keywords to group them into topics. The keywords are extracted based on frequency counts and preprocessed by removing stop words and stemming. The k-means clustering algorithm is used to assign keywords to topics represented by cluster centroids, and the centroids are iteratively updated until cluster assignments converge.
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATIONIJDKP
In our study we will use approach that combine Natural language processing NLP with Term occurrences to improve the quality of important sentences selection by thickening sentence score along with reducing the number of long sentences that would be included in the final summarization. There are sixteen known methods for automatic text summarization. In our paper we utilized Term frequency approach and built an algorithm to re filter sentences score.
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...Kripa (कृपा) Rajshekhar
Recent progress in incorporating word order and semantics to the decades-old, tried-and-tested bag-of-words representation of text meaning has yielded promising results in computational text classification and analysis. This development, and the availability of a large number of legal rulings from the PTAB (Patent Trial and Appeal Board) motivated us to revisit possibilities for practical, computational models of legal relevance - starting with this narrow and approachable niche of jurisprudence. We present results from our analysis and experiments towards this goal using a corpus of approximately 8000 rulings from the PTAB. This work makes three important contributions towards the development of models for legal relevance semantics: (a) Using state-of-art Natural Language Processing (NLP) methods, we characterize the diversity and types of semantic relationships that are implicit in practical judgements of legal relevance at the PTAB (b) We achieve new state-of-art results on practical information retrieval using our customized semantic representations on this corpus (c) We outline promising avenues for future work in the area - including preliminary evidence from human-in-loop interaction, and new forms of text representation developed using input from over a hundred interviews with practitioners in the field. Finally, we argue that PTAB relevance is a practical and realistic baseline for performance measurement - with the desirable property of evaluating NLP improvements against “real world” legal judgement.
Spatial database are becoming more and more popular in recent years. There is more and more
commercial and research interest in location-based search from spatial database. Spatial keyword search
has been well studied for years due to its importance to commercial search engines. Specially, a spatial
keyword query takes a user location and user-supplied keywords as arguments and returns objects that are
spatially and textually relevant to these arguments. Geo-textual index play an important role in spatial
keyword querying. A number of geo-textual indices have been proposed in recent years which mainly
combine the R-tree and its variants and the inverted file. This paper propose new index structure that
combine K-d tree and inverted file for spatial range keyword query which are based on the most spatial
and textual relevance to query point within given range.
Finding Similar Files in Large Document Repositoriesfeiwin
The document presents a method for finding similar files in large document repositories by breaking files into chunks, computing hashes for each chunk, and comparing chunk hashes across files to determine similarity. It involves chunking each file, hashing the chunks, constructing a bipartite graph between files and chunks, and building a file similarity graph based on shared chunks above a threshold. The method was implemented and tested on a repository of technical support documents, taking 25-39 minutes to process hundreds of documents totaling hundreds of megabytes in size.
"The ICAIL conference is the primary international conference addressing research in Artificial Intelligence and Law, and has been organized biennially since 1987"
#ICAIL2017, #ADAI
The purpose of this work is to outline our approach to the development and testing of several computational models for legal relevance in the narrow domain of patent law, specifically as documented through select proceedings of the USPTO PTAB cases.
Contribution #1: “real world” legal judgement
Patent Trial and Appeal Board (PTAB) publicly available dataset, as of Jan 2017 has about 100 zip files containing 10 GB of data (compressed).
PTAB data represent practitioner needs, better than the more commonly used citation graphs
Contribution #2: Doc semantics != Legal Relevance
Disproved prevailing notion that document semantics implies legal relevance or is at least correlated with it. e.g. Khoury and Bekkerman in 2016, “if a given document is not in the semantic neighborhood of the query document, it simply cannot be relevant for the query document"
Contribution #3: ~4X improvement in retrieval
Without subsector pre-processing: Recall@100 was 5%, After text pre-processing: Recall@100 was 19%
Contribution #4: Human-in-loop impact is dramatic
Potential for over 50X improvement, where a retrieval task failed Recall @ 5000 but passed Recall @100 with user feedback
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
This document discusses training a dependency parser using an unparsed corpus rather than a manually parsed training set. It develops an iterative training method that generates training examples using heuristics from past parsing decisions. The method is shown to produce parse trees qualitatively similar to conventionally trained parsers. Three avenues for future research using this corpus-based generation method are proposed.
Performance and Profit Evaluations of a Stochastic Model on Centrifuge System...Waqas Tariq
The paper formulates a stochastic model for a single unit centrifuge system on the basis of the real data collected from the Thermal Power Plant, Panipat (Haryana). Various faults observed in the system are classified as minor, major and neglected faults wherein the occurrence of a minor fault leads to degradation whereas occurrence of a major fault leads to failure of the system. Neglected faults are taken as those faults that are neglected /delayed for repair during operation of the system until the system goes to complete failure such as vibration, abnormal sound, etc. However these faults may lead to failure of the system. There is assumed to be single repair team that on complete failure of the system, first inspects whether the fault is repairable or non repairable and accordingly carries out repairs/replacements. Various measures of system performance are obtained using Markov processes and regenerative point technique. Using these measures profit of the system is evaluated. The conclusions regarding the reliability and profit of the system are drawn on the basis of the graphical studies.
The document describes several different ecosystems:
1) Tundra ecosystems are the coldest on Earth with little plant life due to freezing temperatures. Common animals include caribou, reindeer, arctic fox, polar bear, and white wolves.
2) Grasslands are dominated by grasses and support a variety of herbivores like antelopes, elephants, and zebras as well as predators such as lions and wolves.
3) Rainforests are warm and wet year-round, containing many species of trees, plants, monkeys, snakes, birds, and insects.
Manish Koli has over 15 years of experience in facility management, commercial management, and customer service roles. He is currently a Senior Executive for facility management at Unitech Limited in Gurgaon, where he handles operations, billing, and collections. Previously he worked as a Commercial Manager for Pioneer Medialine Services, and held roles in customer service and activation for Idea Cellular. He has expertise in Microsoft Office, data analysis, and relationship management.
El documento habla sobre la importancia de la solidaridad y cómo se manifiesta a través de pequeñas acciones como ayudar a amigos y familiares. También menciona ejemplos de ser solidario en la clase o en casa mediante el respeto del turno de palabra, ayudar en casa y compartir con los demás, así como ayudar a personas necesitadas como mendigos.
En 2010 había aproximadamente 213.9 millones de migrantes internacionales en el mundo, lo que representa 3 de cada 100 personas viviendo fuera de su país de nacimiento. La migración internacional ha experimentado un crecimiento sostenido en las últimas décadas, aunque recientemente su tasa de crecimiento ha disminuido. Aproximadamente la mitad de los migrantes internacionales son hombres y la otra mitad mujeres, y su edad promedio es de 39 años, siendo relativamente jóvenes.
This document analyzes the use of functionalized magnetic nanoparticles (FMPs) to enhance oxygen transport in biopolymer production processes. FMPs consist of an iron oxide magnetic core and a coating that can provide additional properties like acting as oxygen carriers. Experiments show FMPs increase mass transfer by inducing microconvection, inhibiting bubble coalescence, and facilitating bubble breakup. For biopolymer fermentations, FMPs may decrease broth viscosity and increase oxygen uptake, allowing higher product concentrations and aeration/agitation rates. A chemical method is described to measure mass transfer rates using sodium sulfite oxidation reactions monitored with UV spectroscopy.
As a leading PHP development company we are offering PHP web and application development services. Hire our expert php developers for your projects, we are work in advanced PHP development services.
A Novel Method for Encoding Data Firmness in VLSI CircuitsEditor IJCATR
The number of tests, corresponding test data volume and test time increase with each new fabrication process technology.
Higher circuit densities in system-on-chip (SOC) designs have led to drastic increase in test data volume. Larger test data size demands
not only higher memory requirements, but also an increase in testing power and time. Test data compression method can be used to
solve this problem by reducing the test data volume without affecting the overall system performance. The original test data is
compressed and stored in the memory. Thus, the memory size is significantly reduced. The proposed approach combines the selective
encoding method and dictionary based encoding method that reduces test data volume and test application time for testing. The
experiment is done on combinational benchmark circuit that designed using Tanner tool and the encoding algorithm is implemented
using Model -Sim
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGEPrasadu Peddi
The document proposes a system for understanding short texts by exploiting semantic knowledge from external knowledge bases and web corpora. It discusses challenges in analyzing short texts using traditional NLP methods due to their ambiguous and non-standard language. The system addresses this by harvesting semantic knowledge and applying knowledge-intensive approaches to tasks like segmentation, POS tagging and concept labeling to better understand semantics. An evaluation shows these knowledge-based methods are effective and efficient for short text understanding.
This document summarizes an article from the International Journal of Electronics and Communication Engineering & Technology. The article describes modeling and simulating test data compression using Verilog. It discusses using efficient bitmask selection and dictionary selection techniques to significantly reduce testing time and memory requirements for system-on-chip designs. The techniques aim to generate maximum matching patterns for test data compression by developing a bitmask selection method and efficient dictionary selection method that considers bitmasks.
Column store databases approaches and optimization techniquesIJDKP
Column-Stores database stores data column-by-column. The need for Column-Stores database arose for
the efficient query processing in read-intensive relational databases. Also, for read-intensive relational
databases,extensive research has performed for efficient data storage and query processing. This paper
gives an overview of storage and performance optimization techniques used in Column-Stores.
Partitioning of Query Processing in Distributed Database System to Improve Th...IRJET Journal
This document discusses improving query processing throughput in distributed database systems through partitioning algorithms. It proposes using a graph partitioning algorithm called Congestion Avoidance (CA) to partition query tasks in a way that avoids system congestion and improves throughput. The CA algorithm iteratively identifies congestion points that reduce throughput and moves tasks between partitions to potentially increase throughput. It is evaluated as being faster than other partitioning algorithms while achieving comparable throughput improvements. A parallel execution algorithm is also used to concurrently execute partitioned query tasks across distributed nodes to minimize latency and further improve throughput.
Multikeyword Hunt on Progressive GraphsIRJET Journal
1. The document presents a visualization technique for organizing and exploring large document collections by topic. It uses techniques like removing stop words, stemming, identifying frequent words, generating synonyms, and force-directed layout to generate topics from document collections and visualize them in a graph.
2. Key steps include preprocessing the documents by removing stop words and stemming, identifying frequent words, generating synonyms to group related words into topics, using force-directed layout to position topics based on their relationships, and enabling interactive visualization of the topic graph and retrieval of related documents.
3. The proposed technique aims to help users efficiently browse, organize and find relevant information from large document collections with minimal human intervention through an interactive visualized topic graph.
This document proposes an adaptive algorithm called DyBBS that dynamically adjusts the batch size and execution parallelism in Spark Streaming to minimize end-to-end latency. The algorithm is based on two observations: 1) processing time increases monotonically with batch size, and 2) there is an optimal execution parallelism for a given batch size. DyBBS uses isotonic regression to learn and adapt batch size and parallelism as workload and conditions change. Experimental results show it significantly reduces latency compared to static configurations and other state-of-the-art approaches.
The document proposes using text distortion and algorithmic clustering based on string compression to analyze the effects of progressively destroying text structure on the information contained in texts. Several experiments are carried out on text and artificially generated datasets. The results show that clustering results worsen as structure is destroyed in strongly structural datasets, and that using a compressor that enables context size choice helps determine a dataset's nature. These results are consistent with those from a method based on multidimensional projections.
A Pilot Study On Computer-Aided Coreference AnnotationDarian Pruitt
This document describes a pilot study on using automatic coreference resolution to aid human annotation of coreference. It finds that pre-annotating data with the predictions of an existing coreference system can both reduce the time needed for human annotation and decrease error rates, by reducing the task to checking and modifying existing annotations rather than creating everything from scratch. The study uses the output of an automatic system for resolving nominal anaphora to pre-annotate German newspaper text, which is then manually edited using annotation software.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Fota Delta Size Reduction Using FIle Similarity AlgorithmsShivansh Gaur
This paper proposes algorithms to reduce the size of firmware updates transmitted over-the-air (FOTA). It uses a three stage approach of chunking files into variable sized pieces, hashing the chunks, and comparing hashes to find similar chunks between versions. This allows creating "delta" updates that only transmit changed parts, rather than full files. The algorithms were able to reduce FOTA delta sizes by up to 30% compared to existing tools on test firmware pairs, saving bandwidth. Experimental results on three firmware pairs demonstrate size reductions and performance compared to Google's existing FOTA update method.
Survey on Software Data Reduction Techniques Accomplishing Bug TriageIRJET Journal
This document discusses various techniques for software data reduction to improve the accuracy of bug triage. It first provides background on bug triage and the challenges it aims to address like large volumes of low quality bug data. It then surveys literature on related techniques like automated test generation and text mining approaches. The document describes various text mining methods like term-based, phrase-based, concept-based and pattern taxonomy methods. It also covers data reduction techniques and their benefits for bug triage. Different classification techniques for bug identification are explained, including decision trees, nearest neighbor classifier and artificial neural networks.
ADAPTIVE AUTOMATA FOR GRAMMAR BASED TEXT COMPRESSIONcsandit
The Internet and the ubiquitous presence of computing devices anywhere is generating a
continuously growing amount of information. However, the information entropy is not uniform.
It allows the use of data compression algorithms to reduce the demand for more powerful
processors and larger data storage equipment. This paper presents an adaptive rule-driven
device - the adaptive automata - as the device to identify repetitive patterns to be compressed in
a grammar based lossless data compression scheme.
Implementation of query optimization for reducing run timeAlexander Decker
This document discusses query optimization techniques to improve performance. It proposes performing query optimization at compile-time using histograms of data statistics rather than at run-time. Histograms are used to estimate selectivity of query joins and predicates at compile-time, allowing a query plan to be constructed in advance and executed without run-time optimization. The technique uses a split and merge algorithm to incrementally maintain histograms as data changes. Selectivity estimation with histograms allows join and predicate ordering to be determined at compile-time for query plan generation. Experimental results showed this compile-time optimization approach improved runtime performance over traditional run-time optimization.
Automated Essay Scoring Using Efficient Transformer-Based Language ModelsNat Rice
This summary provides an overview of an academic paper that evaluates the performance of efficient transformer-based language models on an automated essay scoring dataset:
The paper explores using smaller, more efficient transformer models rather than larger ones like BERT for automated essay scoring (AES). It evaluates several efficient models - Albert, Reformer, Electra, and MobileBERT - on the ASAP AES dataset. By ensembling multiple efficient models, the paper achieves state-of-the-art results on the dataset using far fewer parameters than typical transformer models, challenging the idea that bigger models are always better for AES. The efficient models show potential for extending the maximum text length they can analyze and reducing computational requirements for AES.
This document is a preprint that summarizes a paper on developing a spell checker model using automata. It discusses using fuzzy automata rather than finite automata for improved string comparison. The proposed spell checker would incorporate autosuggestion features into a Windows application. It would use techniques like edit distance and tries to store dictionaries and suggest corrections. The paper outlines the design of the spell checker, discussing functions like comparing words to a dictionary and considering morphology. Advantages like improved accuracy and speed are discussed along with potential disadvantages like inability to detect all errors.
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Embedded system software is highly constrained from performance, memory footprint, energy consumption and implementing cost view point. It is always desirable to obtain better Instructions per Cycle (IPC). Instruction cache has major contribution in improving IPC. Cache memories are realized on the same chip where the processor is running. This considerably increases the system cost as well. Hence, it is required to maintain a trade-off between cache sizes and performance improvement offered. Determining the number of cache lines and size of cache line are important parameters for cache designing. The design space for cache is quite large. It is time taking to execute the given application with different cache sizes on an instruction set simulator (ISS) to figure out the optimal cache size. In this paper, a technique is proposed to identify a number of cache lines and cache line size for the L1 instruction cache that will offer best or nearly best IPC. Cache size is derived, at a higher abstraction level, from basic block analysis in the Low Level Virtual Machine (LLVM) environment. The cache size estimated from the LLVM environment is cross validated by simulating the set of benchmark applications with different cache sizes in SimpleScalar’s out-of-order simulator. The proposed method seems to be superior in terms of estimation accuracy and/or estimation time as compared to the existing methods for estimation of optimal cache size parameters (cache line size, number of cache lines).
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Embedded system software is highly constrained from performance, memory footprint, energy consumption
and implementing cost view point. It is always desirable to obtain better Instructions per Cycle (IPC).
Instruction cache has major contribu
tion in improving IPC. Cache memories are realized on the same chip
where the processor is running. This considerably increases the system cost as well. Hence, it is required to
maintain a trade
-
off between cache sizes and performance improvement offered.
Determining the number
of cache lines and size of cache line are important parameters for cache designing. The design space for
cache is quite large. It is time taking to execute the given application with different cache sizes on an
instruction set simula
tor (ISS) to figure out the optimal cache size. In this paper, a technique is proposed to
identify a number of cache lines and cache line size for the L1 instruction cache that will offer best or
nearly best IPC. Cache size is derived, at a higher abstract
ion level, from basic block analysis in the Low
Level Virtual Machine (LLVM) environment. The cache size estimated from the LLVM environment is cross
validated by simulating the set of benchmark applications with different cache sizes in SimpleScalar’s out
-
of
-
order simulator. The proposed method seems to be superior in terms of estimation accuracy and/or
estimation time as compared to the existing methods for estimation of optimal cache size parameters (cache
line size, number of cache lines).
Indexing for Large DNA Database sequencesCSCJournals
Bioinformatics data consists of a huge amount of information due to the large number of sequences, the very high sequences lengths and the daily new additions. This data need to be efficiently accessed for many needs. What makes one DNA data item distinct from another is its DNA sequence. DNA sequence consists of a combination of four characters which are A, C, G, T and have different lengths. Use a suitable representation of DNA sequences, and a suitable index structure to hold this representation at main memory will lead to have efficient processing by accessing the DNA sequences through indexing, and will reduce number of disk I/O accesses. I/O operations needed at the end, to avoid false hits, we reduce the number of candidate DNA sequences that need to be checked by pruning, so no need to search the whole database. We need to have a suitable index for searching DNA sequences efficiently, with suitable index size and searching time. The suitable selection of relation fields, where index is build upon has a big effect on index size and search time. Our experiments use the n-gram wavelet transformation upon one field and multi-fields index structure under the relational DBMS environment. Results show the need to consider index size and search time while using indexing carefully. Increasing window size decreases the amount of I/O reference. The use of a single field and multiple fields indexing is highly affected by window size value. Increasing window size value lead to better searching time with special type index using single filed indexing. While the search time is almost good and the same with most index types when using multiple field indexing. Storage space needed for RDMS indexing types are almost the same or greater than the actual data.
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
This document summarizes a research paper that proposes a method for classifying news and research articles using text pattern mining. The method involves preprocessing text to remove stop words and perform stemming. Frequent and closed patterns are then discovered from the preprocessed text. These patterns are structured into a taxonomy and deployed to classify new documents. The method also involves evolving patterns by reshuffling term supports within patterns to reduce the effects of noise from negative documents. Over 80% of documents were successfully classified using this pattern-based approach.
Similar to Improved Count Suffix Trees for Natural Language Data (20)
The Use of Java Swing’s Components to Develop a WidgetWaqas Tariq
Widget is a kind of application provides a single service such as a map, news feed, simple clock, battery-life indicators, etc. This kind of interactive software object has been developed to facilitate user interface (UI) design. A user interface (UI) function may be implemented using different widgets with the same function. In this article, we present the widget as a platform that is generally used in various applications, such as in desktop, web browser, and mobile phone. We also describe a visual menu of Java Swing’s components that will be used to establish widget. It will assume that we have successfully compiled and run a program that uses Swing components.
3D Human Hand Posture Reconstruction Using a Single 2D ImageWaqas Tariq
Passive sensing of the 3D geometric posture of the human hand has been studied extensively over the past decade. However, these research efforts have been hampered by the computational complexity caused by inverse kinematics and 3D reconstruction. In this paper, our objective focuses on 3D hand posture estimation based on a single 2D image with aim of robotic applications. We introduce the human hand model with 27 degrees of freedom (DOFs) and analyze some of its constraints to reduce the DOFs without any significant degradation of performance. A novel algorithm to estimate the 3D hand posture from eight 2D projected feature points is proposed. Experimental results using real images confirm that our algorithm gives good estimates of the 3D hand pose. Keywords: 3D hand posture estimation; Model-based approach; Gesture recognition; human- computer interface; machine vision.
Camera as Mouse and Keyboard for Handicap Person with Troubleshooting Ability...Waqas Tariq
Camera mouse has been widely used for handicap person to interact with computer. The utmost important of the use of camera mouse is must be able to replace all roles of typical mouse and keyboard. It must be able to provide all mouse click events and keyboard functions (include all shortcut keys) when it is used by handicap person. Also, the use of camera mouse must allow users troubleshooting by themselves. Moreover, it must be able to eliminate neck fatigue effect when it is used during long period. In this paper, we propose camera mouse system with timer as left click event and blinking as right click event. Also, we modify original screen keyboard layout by add two additional buttons (button “drag/ drop” is used to do drag and drop of mouse events and another button is used to call task manager (for troubleshooting)) and change behavior of CTRL, ALT, SHIFT, and CAPS LOCK keys in order to provide shortcut keys of keyboard. Also, we develop recovery method which allows users go from camera and then come back again in order to eliminate neck fatigue effect. The experiments which involve several users have been done in our laboratory. The results show that the use of our camera mouse able to allow users do typing, left and right click events, drag and drop events, and troubleshooting without hand. By implement this system, handicap person can use computer more comfortable and reduce the dryness of eyes.
A Proposed Web Accessibility Framework for the Arab DisabledWaqas Tariq
The Web is providing unprecedented access to information and interaction for people with disabilities. This paper presents a Web accessibility framework which offers the ease of the Web accessing for the disabled Arab users and facilitates their lifelong learning as well. The proposed framework system provides the disabled Arab user with an easy means of access using their mother language so they don’t have to overcome the barrier of learning the target-spoken language. This framework is based on analyzing the web page meta-language, extracting its content and reformulating it in a suitable format for the disabled users. The basic objective of this framework is supporting the equal rights of the Arab disabled people for their access to the education and training with non disabled people. Key Words : Arabic Moon code, Arabic Sign Language, Deaf, Deaf-blind, E-learning Interactivity, Moon code, Web accessibility , Web framework , Web System, WWW.
Real Time Blinking Detection Based on Gabor FilterWaqas Tariq
The document proposes a new method for real-time blinking detection based on Gabor filters. It begins by reviewing existing methods and their limitations in dealing with noise, variations in eye shape, and blinking speed. The proposed method uses a Gabor filter to extract the top and bottom arcs of the eye from an image. It then measures the distance between these arcs and compares it to a threshold: a distance below the threshold indicates a closed eye, while a distance above indicates an open eye. The document claims this Gabor filter-based approach is robust to noise, variations in eye shape and blinking speed. It presents experimental results showing the method can accurately detect blinking across different users.
Computer Input with Human Eyes-Only Using Two Purkinje Images Which Works in ...Waqas Tariq
A method for computer input with human eyes-only using two Purkinje images which works in a real time basis without calibration is proposed. Experimental results shows that cornea curvature can be estimated by using two light sources derived Purkinje images so that no calibration for reducing person-to-person difference of cornea curvature. It is found that the proposed system allows usersf movements of 30 degrees in roll direction and 15 degrees in pitch direction utilizing detected face attitude which is derived from the face plane consisting three feature points on the face, two eyes and nose or mouth. Also it is found that the proposed system does work in a real time basis.
Toward a More Robust Usability concept with Perceived Enjoyment in the contex...Waqas Tariq
Mobile multimedia service is relatively new but has quickly dominated people¡¯s lives, especially among young people. To explain this popularity, this study applies and modifies the Technology Acceptance Model (TAM) to propose a research model and conduct an empirical study. The goal of study is to examine the role of Perceived Enjoyment (PE) and what determinants can contribute to PE in the context of using mobile multimedia service. The result indicates that PE is influencing on Perceived Usefulness (PU) and Perceived Ease of Use (PEOU) and directly Behavior Intention (BI). Aesthetics and flow are key determinants to explain Perceived Enjoyment (PE) in mobile multimedia usage.
Collaborative Learning of Organisational KnolwedgeWaqas Tariq
This paper presents recent research into methods used in Australian Indigenous Knowledge sharing and looks at how these can support the creation of suitable collaborative envi- ronments for timely organisational learning. The protocols and practices as used today and in the past by Indigenous communities are presented and discussed in relation to their relevance to a personalised system of knowledge sharing in modern organisational cultures. This research focuses on user models, knowledge acquisition and integration of data for constructivist learning in a networked repository of or- ganisational knowledge. The data collected in the repository is searched to provide collections of up-to-date and relevant material for training in a work environment. The aim is to improve knowledge collection and sharing in a team envi- ronment. This knowledge can then be collated into a story or workflow that represents the present knowledge in the organisation.
Our research aims to propose a global approach for specification, design and verification of context awareness Human Computer Interface (HCI). This is a Model Based Design approach (MBD). This methodology describes the ubiquitous environment by ontologies. OWL is the standard used for this purpose. The specification and modeling of Human-Computer Interaction are based on Petri nets (PN). This raises the question of representation of Petri nets with XML. We use for this purpose, the standard of modeling PNML. In this paper, we propose an extension of this standard for specification, generation and verification of HCI. This extension is a methodological approach for the construction of PNML with Petri nets. The design principle uses the concept of composition of elementary structures of Petri nets as PNML Modular. The objective is to obtain a valid interface through verification of properties of elementary Petri nets represented with PNML.
Development of Sign Signal Translation System Based on Altera’s FPGA DE2 BoardWaqas Tariq
The main aim of this paper is to build a system that is capable of detecting and recognizing the hand gesture in an image captured by using a camera. The system is built based on Altera’s FPGA DE2 board, which contains a Nios II soft core processor. Image processing techniques and a simple but effective algorithm are implemented to achieve this purpose. Image processing techniques are used to smooth the image in order to ease the subsequent processes in translating the hand sign signal. The algorithm is built for translating the numerical hand sign signal and the result are displayed on the seven segment display. Altera’s Quartus II, SOPC Builder and Nios II EDS software are used to construct the system. By using SOPC Builder, the related components on the DE2 board can be interconnected easily and orderly compared to traditional method that requires lengthy source code and time consuming. Quartus II is used to compile and download the design to the DE2 board. Then, under Nios II EDS, C programming language is used to code the hand sign translation algorithm. Being able to recognize the hand sign signal from images can helps human in controlling a robot and other applications which require only a simple set of instructions provided a CMOS sensor is included in the system.
An overview on Advanced Research Works on Brain-Computer InterfaceWaqas Tariq
A brain–computer interface (BCI) is a proficient result in the research field of human- computer synergy, where direct articulation between brain and an external device occurs resulting in augmenting, assisting and repairing human cognitive. Advanced works like generating brain-computer interface switch technologies for intermittent (or asynchronous) control in natural environments or developing brain-computer interface by Fuzzy logic Systems or by implementing wavelet theory to drive its efficacies are still going on and some useful results has also been found out. The requirements to develop this brain machine interface is also growing day by day i.e. like neuropsychological rehabilitation, emotion control, etc. An overview on the control theory and some advanced works on the field of brain machine interface are shown in this paper.
Exploring the Relationship Between Mobile Phone and Senior Citizens: A Malays...Waqas Tariq
There is growing ageing phenomena with the rise of ageing population throughout the world. According to the World Health Organization (2002), the growing ageing population indicates 694 million, or 223% is expected for people aged 60 and over, since 1970 and 2025.The growth is especially significant in some advanced countries such as North America, Japan, Italy, Germany, United Kingdom and so forth. This growing older adult population has significantly impact the social-culture, lifestyle, healthcare system, economy, infrastructure and government policy of a nation. However, there are limited research studies on the perception and usage of a mobile phone and its service for senior citizens in a developing nation like Malaysia. This paper explores the relationship between mobile phones and senior citizens in Malaysia from the perspective of a developing country. We conducted an exploratory study using contextual interviews with 5 senior citizens of how they perceive their mobile phones. This paper reveals 4 interesting themes from this preliminary study, in addition to the findings of the desirable mobile requirements for local senior citizens with respect of health, safety and communication purposes. The findings of this study bring interesting insight to local telecommunication industries as a whole, and will also serve as groundwork for more in-depth study in the future.
Principles of Good Screen Design in WebsitesWaqas Tariq
Visual techniques for proper arrangement of the elements on the user screen have helped the designers to make the screen look good and attractive. Several visual techniques emphasize the arrangement and ordering of the screen elements based on particular criteria for best appearance of the screen. This paper investigates few significant visual techniques in various web user interfaces and showcases the results for better understanding and their presence.
This document discusses the progress of virtual teams in Albania. It provides context on virtual teams and how they differ from traditional teams in their reliance on technology for communication across distances. The document then examines the use of virtual teams in Albania, noting the growing infrastructure and technology usage that enables virtual collaboration. It highlights some virtual team examples in Albanian government and academic projects.
Cognitive Approach Towards the Maintenance of Web-Sites Through Quality Evalu...Waqas Tariq
It is a well established fact that the Web-Applications require frequent maintenance because of cutting– edge business competitions. The authors have worked on quality evaluation of web-site of Indian ecommerce domain. As a result of that work they have made a quality-wise ranking of these sites. According to their work and also the survey done by various other groups Futurebazaar web-site is considered to be one of the best Indian e-shopping sites. In this research paper the authors are assessing the maintenance of the same site by incorporating the problems incurred during this evaluation. This exercise gives a real world maintainability problem of web-sites. This work will give a clear picture of all the quality metrics which are directly or indirectly related with the maintainability of the web-site.
USEFul: A Framework to Mainstream Web Site Usability through Automated Evalua...Waqas Tariq
A paradox has been observed whereby web site usability is proven to be an essential element in a web site, yet at the same time there exist an abundance of web pages with poor usability. This discrepancy is the result of limitations that are currently preventing web developers in the commercial sector from producing usable web sites. In this paper we propose a framework whose objective is to alleviate this problem by automating certain aspects of the usability evaluation process. Mainstreaming comes as a result of automation, therefore enabling a non-expert in the field of usability to conduct the evaluation. This results in reducing the costs associated with such evaluation. Additionally, the framework allows the flexibility of adding, modifying or deleting guidelines without altering the code that references them since the guidelines and the code are two separate components. A comparison of the evaluation results carried out using the framework against published evaluations of web sites carried out by web site usability professionals reveals that the framework is able to automatically identify the majority of usability violations. Due to the consistency with which it evaluates, it identified additional guideline-related violations that were not identified by the human evaluators.
Robot Arm Utilized Having Meal Support System Based on Computer Input by Huma...Waqas Tariq
A robot arm utilized having meal support system based on computer input by human eyes only is proposed. The proposed system is developed for handicap/disabled persons as well as elderly persons and tested with able persons with several shapes and size of eyes under a variety of illumination conditions. The test results with normal persons show the proposed system does work well for selection of the desired foods and for retrieve the foods as appropriate as usersf requirements. It is found that the proposed system is 21% much faster than the manually controlled robotics.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
Improved Count Suffix Trees for Natural Language Data
1. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 1
Improved Count Suffix Trees for Natural Language Data
Guido Sautter sautter@ipd.uka.de
Researcher
Department of Computer Science
Karlsruhe Institute of Technology
Am Fasanengarten 5, 76128 Karlsruhe, Germany
Klemens Böhm boehm@ipd.uka.de
Full Professor
Department of Computer Science
Karlsruhe Institute of Technology
Am Fasanengarten 5, 76128 Karlsruhe, Germany
Abstract
With more and more text data stored in databases, the problem of handling natural language
query predicates becomes highly important. Closely related to query optimization for these
predicates is the (sub)string estimation problem, i.e., estimating the selectivity of query terms
before query execution based on small summary statistics. The Count Suffix Trees (CST) is the
data structure commonly used to address this problem. While selectivity estimates based on
CST tend to be good, they are computationally expensive to build and require a large amount of
memory for storage. To fit CST into the data dictionary of database systems, they have to be
pruned severely. Pruning techniques proposed so far are based on term (suffix) frequency or on
the tree depth of nodes. In this paper, we propose new filtering and pruning techniques that
reduce the building cost and the size of CST over natural-language texts. The core idea is to
exploit the features of the natural language data over which the CST is built. In particular, we
aim at regarding only those suffixes that are useful in a linguistic sense. We use (well-known) IR
techniques to identify them. The most important innovations are as follows: (a) We propose and
use a new optimistic syllabification technique to filter out suffixes. (b) We introduce a new affix
and prefix stripping procedure that is more aggressive than conventional stemming techniques,
which are commonly used to reduce the size of indices. (c) We observe that misspellings and
other language anomalies like foreign words incur an over-proportional growth of the CST. We
apply state-of-the-art trigram techniques as well as a new syllable-based non-word detection
mechanism to filter out such substrings. – Our evaluation with large English text corpora shows
that our new mechanisms in combination decrease the size of a CST by up to 80%, already
during construction, and at the same time increase the accuracy of selectivity estimates
computed from the final CST by up to 70%.
Keywords: Selectivity Estimation, Count Suffix Tree, Pruning, Text Data
1. INTRODUCTION
With more and more natural language data stored in databases, query processing for this type of
data becomes highly important. To optimize queries over this kind of data, the (sub)string
estimation problem is vital. This is estimating the selectivity of natural language query
predicates, usually term-based before the actual query execution, based on small summary
statistics. The selectivity of a term (or of any substring) is the number of documents in the
underlying collection it appears in. To estimate the selectivity of predicates of the kind
considered here, Count Suffix Trees (CST) are commonly used [8]. According to [8], each CST
node stores the selectivity of the string along the path from the root to the node. The selectivity
of a string can be retrieved in a time linear to its length. A CST built over text data can efficiently
solve the selectivity-estimation problem.
2. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 2
However, CST are computationally expensive to build and have high memory requirements. The
space complexity of a CST is proportional to the size of the underlying alphabet and to the
number of strings stored in the CST1
. A CST built over a large amount of text data may well
exceed 1,000,000 nodes. Since the statistics data structures used by the query optimizers of
database systems have to fit in the data dictionary, i.e., in a very limited amount of memory,
CST used for query optimization need to be reduced in size [4]. This is because they always
have to be in physical main memory. If query optimization caused only a single page fault (i.e.,
the need to swap a memory page from on-disk virtual memory back into physical main memory),
this would annihilate the performance advantage database systems gain from query
optimization. One might think that the 1 KB limit from [8], published in 1996, is too strict
nowadays, and that modern database servers can have statistics that are larger by orders of
magnitude. However, not only the amount of memory available to database servers has rapidly
grown since 1996. The amount of data to be handled has grown at a similar rate, both regarding
the number of relations and of attributes per relation. Thus, the data dictionary has to
accommodate significantly more statistics. Consequently, it is not unrealistic to assume that the
statistics for an individual attribute still have to fit into 1 KB with today’s commercial database
servers [24].
To make the CST meet these restrictive memory requirements, a common solution is to apply
pruning rules. Discarding some nodes, for instance those with the lowest selectivities [7, 8],
saves space. But it also affects the accuracy of estimations. To deal with this problem, methods
for estimating the selectivity of strings that are not retained in the Pruned CST (PST) any more
have been proposed. Algorithms like KVI or MO [8, 7] alleviate estimation inaccuracies due to
pruning to some degree, but do not rule out the problem. Pruning becomes even more
problematic with non-static document collections, e.g., journal archives or the Blogosphere.
Estimation errors may arise, due to incorrect node counts [1]. The only solution currently known
is to rebuild the CST over the updated collection. Even though algorithms that reduce space and
time complexity have been proposed [17, 18], CST construction remains computationally
expensive and time consuming.
The goal of our work is finding other ways of reducing the size of the CST, i.e., filtering out
suffixes. We focus on natural-language texts. Our core idea is to find linguistic criteria that let us
decide, prior to insertion in the CST, which strings or suffixes are more likely to be queried. For
the CST, we keep only the latter, and we deal with the rest separately to reduce the size of the
CST during construction already, before the actual pruning. In particular, we apply syllabification,
stemming, and non-word detection. The combination of these mechanisms allows for building a
tree that requires significantly less memory than state-of-the-art CST. More specifically, the
contributions of this paper are as follows:
Design of a New Approximate Syllabification Algorithm for CST-specific Data
Preprocessing
We observe that letter-wise suffixes that do not start at a syllable border carry little semantic
meaning. To filter out these suffixes we propose a fast approximate syllabification routine, based
on the morphological structure of words. In order to avoid filtering too many suffixes, however,
we have to find every syllable boundary, even at the cost of some false positives. Therefore, our
routine is more aggressive than conventional ones. We will show that avoiding the insertion of
suffixes that do not start at syllable borders reduces the size of the CST by much, not only for
storage, but also during construction.
1
Note that this mostly holds for languages in which words tend to consist of many letters, like the
languages written in the Latin, Greek, Cyrillic, Arabic, Hebrew, or Thai writing system – these are the ones
we deal with here For languages written in symbolic writing systems, like Chinese, Japanese, or Korean,
words consist of far fewer symbols, and other data structures are better suited for selectivity estimation
altogether, e.g. histograms.
3. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 3
Design of an Optimistic Suffix and Affix Stripping Algorithm
Stemming, that is conflating different inflections of a term to the same root form, further reduces
the number of suffixes. We observe that traditional stemming algorithms like Porter’s stemmer
[16] are rather conservative, i.e., omit some conflations to avoid errors. We propose a new, more
optimistic stemming procedure. It conflates more terms and thus reduces the number of suffixes
to store. This procedure also includes prefix stripping, which moves the stem up front. The
rationale is that the stem carries the most semantic meaning, and the nodes closest to the root of
a CST are the last ones to be pruned. Note that linguistic errors may occur with this approach.
But we show that their effect on estimation quality is likely to be insignificant.
Non-word Filtering
Experience shows that non-words and foreign words incur an over-proportional number of nodes
in the CST. They also increase the memory consumption in the building phase of the tree. We
therefore deploy a q-gram based algorithm for non-word detection to prevent inserting non-words
in the CST. In particular, we exclude terms including very rare q-grams. To estimate the
selectivity of non-words, we use a variant of what [4] refers to as q-gram estimator instead.
Because we apply this technique only to words with highly distinctive q-grams, we do not
encounter the errors reported in [4]. In contrast, our evaluation shows that the q-gram technique
yields very good selectivity estimates for non-words.
New Node Labeling sStrategy
Traditional CST construction mechanisms suggest labeling each node with a single character
and applying path compression when the tree is completely built. We propose a new node
labeling mechanism which is syllable-based. It results in a more compact CST. In particular, we
use syllables as the atomic node labels, instead of individual characters. While this increases the
fan-out of the tree, it reduces its depth significantly, resulting in more compact CST.
Extensive Performance Experiments and Feasibility Demonstration.
We run a series of evaluations over large English document corpora. It turns out that syllable
CST require significantly less space. For instance, the size of the CST built over datasets from
the Aquaint Corpus [6] is reduced by up to 80%. We also show that the benefit of Syllable CST
over traditional ones grows with the number of non-words and misspellings in the corpus, by
introducing errors into it in a controlled way. Thanks to the reduced memory occupation,
frequency-based pruning can then use lower thresholds. This results in more accurate
estimations. We experimentally verify that, when pruned to meet the same size, Syllable CST
provide significantly better selectivity estimates than standard ones. On average, the relative
estimation error is reduced by up to 80%.
Although we use English language corpora for our experiments, the techniques presented in this
paper are not restricted to English: They are applicable to any character-based language,
provided that a small reference dictionary (for the non-word filtering) and a stemming and
syllabification routine (for building the Syllable CST) are available.
Paper Outline
Section 2 reviews related work. Section 3 describes the syllabification and non-word detection
techniques, Section 4 the design of the Syllable CST, and how to use it for selectivity estimation.
Section 5 features our evaluation. Section 6 concludes. [23] is an shorter, preliminary version of
this paper. This current article features a new node-labeling strategy, a more detailed description
and discussion of our algorithms, and more extensive experiments which assess our techniques
with documents containing spelling errors.
2. RELATED WORK
The Count Suffix Tree (CST) [8] is a data structure commonly used to estimate the selectivity of
string predicates. Given a collection of documents, all strings and their suffixes are stored in the
CST. Each node is labeled with a string and has a counter that stores the number of occurrences
4. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 4
in the collection. Since CST tend to grow quickly in size when built on large text datasets,
pruning strategies are essential to make the data structure meet memory constraints. Pruning
requires estimating the selectivity of those terms whose node has been discarded. Krishnan et al.
[8] has proposed three families of estimation methods. Among these, the KVI algorithm yields
the most accurate results according to experiments. The MO (Maximal Overlap) algorithm,
proposed by Jagadish et al. [7], outperforms KVI when the statistical short memory property
holds for sequences of symbols. According to MO, the searched pattern is parsed in overlapping
(when existing) substrings, which are considered statistically dependent. Since both KVI and MO
tend to underestimate selectivities, Chaudhuri et al. [4] propose a new estimation model based
on q-gram Markov tables and a regression tree. Bae [1] describes which estimation inaccuracies
may arise in the presence of pruning and tries to overcome the problem by building a Count Q-
gram tree. While it is useful for DNA data (alphabet size 5), [1] also shows that it is worse when
the alphabet size increases. It is hardly applicable for natural language data (alphabet size 26).
Statistics on collections of text data [2] must be updated when new document are added to the
collection. Once nodes have been pruned, however, there is no information left on the previous
selectivity of removed strings. If they appear to be frequent in newly added documents, the node
counts are incorrect. This is because they do not include the selectivities before pruning. As
mentioned in Section 1, the only way to solve this problem is to rebuild the CST over the
updated collection. Algorithms have been developed for constructing in-memory suffix trees in
linear time, proportional to the number of strings stored in it, by Weiner [19] and McCreight [15].
Ukkonen [18] later designed an online version. The ever-increasing amount of available text data
however calls for disk-based construction algorithms. The “Top Down Disk-based” strategy by
Tian et al. [17], despite the fact that it runs in quadratic time, is faster than the linear in-memory
alternatives, and its space consumption is lower than that of other algorithms described in
literature. Even if disk-based strategies have significantly decreased time and space building
overhead, the computational effort is still in the way of rebuilding the tree frequently. On the
other hand, the final CST has to fit in main memory to let query optimizers estimate selectivities
in short time. Thus, pruning and its drawbacks cannot be avoided. The goal of our work therefore
is to find pruning strategies that incur less inaccuracy than the existing standard ones.
We will refer to further related work, in particular to algorithms from computational linguistics
which we adapt and deploy in our context, in Sections 3 and 4.
3. SYLLABLE COUNT SUFFIX TREE
This section proposes a new variant of the CST data structure for selectivity estimation on
collections of natural-language texts, the Syllable CST. Section 3.1 explains how syllabification
reduces the size of the suffix tree. Section 3.2 presents our new syllabification routine. Sections
3.3 and 3.4 explain why Porter’s algorithm is not sufficient for our purpose, and why we
implement another stemming routine. Section 3.5 illustrates the drawbacks of inserting non-
words in the data structure and proposes a solution.
3.1 Syllabification
According to the original definition of suffix tree [19], inserting an index term in the tree implies
generating all of its suffixes and inserting them as well. Given a string σ of length n, defined over
the alphabet Σ and a string terminator symbol $ (not in Σ and lexicographically subsequent to
any symbol in it), the i-th suffix is the substring starting with the i-th character of σ and
terminated by $.
Example 1. Given σ = information$, its suffixes are: (information$, nformation$, formation$,
ormation$, rmation$, mation$, ation$, tion$, ion$, on$, n$, $). □
Some of these suffixes are very unlikely to be ever addressed in a user query. The reason is that
they convey very little semantic meaning, e.g., -rmation. Syllables are natural word building
5. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 5
blocks in many languages. Using syllable-division points to compute suffixes reduces their
number. The remaining ones carry an enhanced semantic message at the same time.
Example 2. Given the syllabified string σ = in-for-ma-tion-$, the set of its syllable suffixes is:
(in-for-ma-tion-$, for-ma-tion-$, ma-tion-$, tion-$, $). □
Syllabification proves to be convenient to filter out suffixes that start with unusual combinations
of letters. Since the number of suffixes that need to be stored in the tree decreases, its size is re-
duced as well. Figure 1 contrasts memory requirements of a standard CST built over the string
‘information’ with the equivalent Syllable CST. The number of nodes is more than halved. We
expect a similar size reduction for larger datasets as well. Experimental evaluations will confirm
this hypothesis (see Section 5).
Discussion. Syllabification-based filtering is not limited to English; it is applicable to any
character-based language, provided that there is a syllabification routine. In this paper, we use
English text for the examples and for the experiments.
Clearly, filtering out suffixes with the mechanism described here affects selectivity estimation for
substrings that do not start at syllable boundaries. For instance, the Syllable CST in Figure 1
would not be able to estimate the selectivity of the predicate LIKE ‘%nfo%’. This is not a severe
drawback (we argue). Namely, queries over natural language text, even if it is dirty, are likely to
contain rather “natural” text fragments, e.g., LIKE ‘%info%’ or LIKE ‘%inform%’, as opposed to
‘%nfo%’.
FIGURE 1: Syllable CST on the String information$
3.2 The Syllabification Routine
The problem of syllabification of written words is strictly tied to the hyphenation (or justification)
task [13]. Algorithms for splitting words at syllable boundaries can be classified as rule-based or
dictionary-based. The latter ones provide orthographically correct syllable division points by
performing a lookup in a dictionary. Although they guarantee greater accuracy, there are several
reasons why we did not pursue this option. First, since we want to minimize space requirements,
the overhead of a dictionary is not tolerable. The dictionary size strongly depends on the content
6. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 6
and the size of the corpus. The New York Times dataset of the Aquaint Corpus, for instance,
contains 352404 different terms. The size of the dictionary should be of the same order of
magnitude. Second, no matter the size of the dictionary, there will always be words that are not
contained in it (foreign language words, proper names, etc.). Furthermore, a language keeps
evolving, and new terms constantly enrich the dictionary. Third, since many syllabification rules
are based on how the word is pronounced, there are differences regarding syllabification of the
same term according to different English dictionaries.
Rule-based hyphenation systems, such as the LATEX algorithm [13], are an alternative. They
are typically faster and require less storage space, but they are inherently subject to errors. The
reason is that, even if a set of general rules has been defined, based on sophisticated linguistic
literature, syllabification is not an “exact” process. Most of the rules are based on the sound of
the spoken word and are not easy to implement. The so-called VV rule is an example [5]. It
suggests separating two consecutive vowels in two distinct syllables if they do not form a
diphthong, or considering them as part of the same syllable otherwise. Without any further
phonological information, it is impossible to tell if adjacent vowels form a diphthong or not. Since
our goal is exploiting syllabification to reduce memory requirements of the CST, however, we
have adopted a rule-based solution anyway. It trades grammatical accuracy for a reduction of
computational overhead and extra data required.
Example 3. The diphthong ie is split in sci-ence, since the vowels are pronounced separately.
The same diphthong in re-triev-al belongs to one syllable. □
To achieve high-quality results, the hyphenation routine of LATEX [13] uses close to 5.000 rules
in 5 levels. The representation of these rules alone would consume more than half of the
memory available for a selectivity-estimation data structure. In addition, processing a term
through 5 levels of rules causes more computational effort than acceptable to obtain selectivity
estimates. Further, the LATEX hyphenation algorithm is trimmed towards missing some division
points rather than dividing terms at incorrect positions. Because we remove suffixes not starting
at syllable boundaries, this would result in too many suffixes sorted out. Consequently, we favor
faulty hyphenation over missed division points to some extent. In other words, the design goal
behind our hyphenation routine is different from the one of the routine used in LATEX.
Because we wanted to exploit syllabification to reduce memory requirements of the suffix tree,
we choose a rule-based approach. To minimize the computation effort, we use a very small set
of rules. To miss as few division points as possible, our rules are more aggressive than the ones
in [5]. The basic idea of our syllabification routine is to determine syllabification points matching
regular expressions over the consonant-vowel structure of the word. Function 1 lists the pseudo-
code of the function computing the word structure. The output string is constructed by mapping
each character to V, in case of a vowel (Line 3), and to C otherwise (Line 4). Function 2 contains
the pseudo-code of the syllabification routine. First the algorithm checks the word length (Line 1):
Words shorter than four characters are left unchanged (e.g., box, cat).
Function 1: computeWordStructure
Input: String word
Output: String wordStructure
1 for (I = 0; i < word.length; i++)
2 if (word[i] is vowel)
3 wordStructure[i] = ’V’;
4 else wordStructure[i] = ’C’;
5 return wordStructure;
In English, the number of syllables equals the number of vowel sounds. We assume that a vowel
sound corresponds to a sequence of consecutive vowels. We compute the number of vowel
sounds (Line 2). If it is less than two, the word is returned (Line 3).
7. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 7
Function 2: getSyllables
Input: String word
Output: String syllabifiedWord
1 if (word.length < 4) return word;
2 int vowelSounds = countVowelSounds(word);
3 if (vowelSounds < 2) return word;
4 wordStructure = computeWordStructure(word);
5 if (wordStructure contains consonantBlends) {
6 replace ”CC” with ”DD”;// Check patterns with blends
7 if (wordStructure.match(”VCDDV”))
8 replace ”VCDDV” with “VC-DDV”;
9 if (wordStructure.match(”CVDDV”))
10 replace ”CVDDV” with “CV-DDV”;
11 }
12 // Check common patterns
13 if (wordStructure.match(”VCCV”)
14 replace ”VCCV” with ”VC-CV”;
15 if (wordStructure.match(”VCCCV”)
16 replace ”VCCCV” with ”VC-CCV”;
17 if (wordStructure.match(”CVVCV”)
18 replace ”CVVCV” with ”CVV-CV”;
19
20 sylabifiedWord = getDivisionPoints(wordStructure);
21 return syllabifiedWord;
The following step is the construction of the string describing the structure of the word (Line 4).
To find syllable boundaries, we test the word structure against 5 basic patterns (Lines 13-19).
Table 1 shows the patterns and how we derive syllabification points from them; Table 1a
illustrates the process for one word.
Vowel-Consonant-Vowel Structure Derived Syllabification Examples
VCV V-CV mo-tor, lu-nar
VCCV VC-CV un-der, sub-way
VCCCV VC-CCV im-print, pil-grim
VCCCCV VCC-CCV sand-stone
VCCCCCV VCC-CCCV high-street
TABLE 1: The Basic Syllabification Patterns
Step Word Status Word Structure
Input word undoubtedly -
Line 4 word structure undoubtedly VCCVVCCVCCV
Line 13 found division point (pattern VCCV) un-doubtedly VC-CVVCCVCCV
Line 13 found division point (pattern VCCV) un-doub-tedly VC-CVVC-CVCCV
Line 13 found division point (pattern VCCV) un-doub-ted-ly VC-CVVC-CVC-CV
Line 20 Transfer division points from structure to word un-doub-ted-ly VC-CVVC-CVC-CV
TABLE 1a: Syllabification Algorithm Applied to undoubtedly
Consonant blends and digraphs are couples of consonants that are sounded together and
therefore should not belong to different syllables. However, the common syllabification pattern
VC-CV would incorrectly syllabify words like migration to mig-ra-tion if we did not take the
8. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 8
presence of the consonant blend gr into account. Therefore, we define a set of common English
blends and bigraphs2
and check if any of them appear in the word (Line 5). In case of a match,
we modify the word structure (Line 6) by replacing the bigraph’s consonants CC with DD. We
then test the word structure against 14 specifically designed patterns (Lines 7-10), see Table 1b.
If none of these patterns matches, we use the five basic patterns described above (Lines 13-19).
Finally, we retrieve the division points from the position of the dashes among the word structure
string (Line 20) and return the syllabified word.
Vowel-Consonant-Vowel Structure Derived Syllabification Examples
VDDV V-DDV fa-ther
VCDDV VC-DDV im-print
VDDCV VDD-CV bank-rupt
VDDCCV VDD-CCV -
VCDDCV VC-DDCV un-thrilled
VCCDDV VCC-DDV off-shore
VDDDDV VDD-DDV bank-crush
VDDCCCV VDD-CCCV
VCDDCCV VCDD-CCV
VCCDDCV VCC-DDCV
VCCCDDV VCC-CDDV off-spring
VDDDDCV VDD-DDCV
VDDCDDV VDD-CDDV high-school
VCDDDDV VCDD-DDV worth-while
TABLE 1b: The Special Syllabification Patterns
This approach has a few weaknesses. Compound words, for instance, are potential error
sources. They should be divided between the words that make them up, but the sole analysis of
the word structure cannot tell where the exact division point is located. The word sandbox, for
instance, will be divided according to the syllabification rule that suggests splitting a sequence of
three consonants after the first one (Line 16). This produces the incorrect san-dbox.
However, even if the algorithm does not always detect correct division points, we will
experimentally verify that these inaccuracies hardly affect the quality of selectivity estimates. In
Section 4.3, we will discuss the effect of inaccuracies in more detail.
3.3 Stemming
Morphological variants of the same term (plural, past tense, third person singular, etc.) are
responsible for a considerable growth of the data structure. Figure 2 shows how the size of a
Syllable CST built on the string connect increases when its past tense and continuous forms are
included. A common practice to reduce the size of an index is stemming, i.e., conflating inflected
terms to their root form. Conflating connected and connecting with connect is reasonable since
they convey the same semantic message. Several stemming algorithms have been proposed in
literature [9, 16, 14]. Porter’s stemmer [16] probably is the most popular one. It adopts a longest-
match suffix stripping strategy, through a series of linear steps. Since it is fast and does not
require additional storage, we use Porter’s algorithm as a preprocessing step. However, the
number of stems can be further reduced. Porter’s algorithm does not deal with some common
suffixes (e.g., -less, -ution, -ary, etc.). [9] features a detailed description of errors and wrong
conflations made by Porter’s stemmer. Furthermore, it does not deal well with compound
2
In particular, we check the following bigraphs: bl, cl, dl, fl, pl, gl, br, cr, dr, fr, pr, tr, ch, gh, ph, sh, th, wh,
kn, ck
9. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 9
suffixes, namely suffixes obtained by concatenating more than one suffix. The adverb
increasingly, for instance, is not stemmed. The reason is that the suffix –ly is removed only when
it inflects adjectives ending with -ent or -al. Not conflating increasingly with its stem implies the
generation of more suffixes, i.e., nodes.
FIGURE 2: Effect of Morphological Variants of the same Word on the Syllable CST’s Size
Traditionally, stemming only deals with inflectional or derivational suffixes, but rarely attempts to
remove prefixes [11]. However, we observe that removal of prefixes further reduces the number
of nodes of a suffix tree. If we conflated disconnect with its stem connect, we would save the
space required by the additional suffix dis-con-nect. In Section 3.4, we will illustrate how we
manage to reach this size reduction, without losing the information carried by the prefix. We
propose a more aggressive stemming routine based on Porter’s stemmer that lets us decrease
the CST size, without significantly reducing the accuracy of selectivity estimations.
3.4 Stemming Routine
Function 3 lists the pseudo-code of our stemming routine. Our algorithm invokes Porter’s
algorithm (Line 3), but only as a preprocessing step. It then removes iteratively common English
prefixes and suffixes that Porter’s stemmer may not have stripped. The algorithm places a
condition on the minimum length of the final stem. The affix stripping process stops if no affix is
found, or if the removal of one affix would result in a stem shorter than three characters (Line 4).
We divide suffixes in four classes, depending on their inflectional role (adverbial, noun,
adjective, verbal), and use an out-of-alphabet symbol to identify each of them (Line 6,
suffixCategoryID). Each removed suffix is coded as a string obtained by concatenation of the
category identifier and its length (Line 8-9). The reason is that a code is typically shorter than the
suffix itself. This reduces the number of bytes to store it. We observe that there are not many
suffixes belonging to the same category that can be attached to the same stem, therefore the
probability of conflations due to the same code is low. To reduce it further, we use the
information on the length. This way, even if -less and -ful are both adjective suffixes, we can
distinguish them based on their length.
In contrast, we do not code prefixes, for the following reason. Dividing prefixes in classes to take
advantage of a codified representation is not trivial. Prefixes add a specific connotation to the
meaning of the word. Numerical prefixes, for instance bi-, tri-, give quantitative information;
negative prefixes express the opposite meaning of a term. However, it is more difficult to state
which message is carried by other prefixes and use it to divide them in classes, as we did with
suffixes.
10. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 10
Function 3: Stemming algorithm
Input: word, suffixes[4], prefixes
Output: stemmedWord
1 Array suffixCodes; // Store coded suffixes
2 Array prefixes; // Store removed prefixes
3 word = porterStem(word)
4 while(affixFound and word.length > 3) {
5 if (affix is suffix and word.length-affix.length>2)
6 if (affix in any suffixes[]) {
7 // Check suffix category and code suffix
8 suffixCode = suffixCategoryID + suffixLength;
9 suffixCodes.add(suffixCode);
10 } else {
11 prefixes.add(affix);
12 }
13 word = affix; // Affix stripping
14 }
15 stemmedWord = word + prefixes + suffixes;
16 return stemmedWord;
By reordering prefixes and suffixes, we can further reduce the size of the CST. The algorithm
changes the order of affixes as follows: We syllabify the stem and attach the prefixes and
codified suffixes (Line 15).
Example 4. Suppose that adverbial and adjective suffixes are associated with the respective
symbols. Given the word undoubtedly, Table 2 reports the steps of the algorithm and its output.
Figure 3 shows how the number of nodes of the CST built on un-doubt-ed-ly decreases thanks to
this strategy. We can omit the tree branch generated by the prefix. □
FIGURE 3: Moving the Prefix to the End of the Stem
Moving prefixes behind the stem shows another benefit. In case of pruning, the stem is last to be
pruned. This preserves its distinctive semantics as long as possible. We do not move the
prefixes behind the suffixes, however, because the prefix carries more semantic meaning than
the suffixes and should thus not be pruned before them.
11. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 11
Step Word Status Word Parts Split
Input word undoubtedly
Line 3 Porter stem undoubtedli
Line 6 Suffix found undoubted α2
Line 6 Suffix found undoubt α2, β2
Line 11 Suffix found doubt α2, β2 un
Line 14 String stemmed doubtunα2β2
TABLE 2: Stemming Algorithm Applied to undoubtedly
3.5 Non-Word Detection
Typographical errors are a serious problem regarding CST space requirements. Typos and
misspelled words result in undesirable suffixes, i.e., CST nodes. Figure 4 shows how the CST
grows if the incorrect term developement is added to a CST built on development. We observe
that we can save space by not inserting mistyped variants of index terms in the CST, but storing
them in another way, as explained below. The benefit of non-word detection grows with the
number of misspellings in the text collection. This is because the more misspellings are present
in the document collection, the more nodes are created while building the CST, similarly to the
example in Figure 4. An observation of ours, based on our experiments, is that the probability of
identical accidental misspellings is very low. Consequently, the CST nodes created for the
suffixes of misspelled terms are relatively useless; Due to their low frequency, they will most
likely be pruned after the CST is built. Nevertheless, they require a considerable amount of
memory while building the CST, so it is beneficial to exclude misspelled words right away.
A common technique for detecting misspellings is based on n-gram analysis [10]. Given a string,
an n-gram is any substring of length n. n-gram analysis requires a set of training words which
must be sufficiently representative of the language. From these words, n-grams are extracted
and inserted into a table (Dictionary Table). We investigate four strategies to detect non-words.
Two of them, Trigram Analysis (TA) and Positional Trigram Analysis (PTA), are based on
conventional trigram analysis, the other ones, Syllable Analysis (SA) and Positional Syllable
Analysis (PSA), are more recent and are similar to [3].
FIGURE 4: Inserting developement in the CST
Example 5. Table 3 illustrates which strings will be inserted in the Dictionary Table from the
word inform according to each strategy. □
12. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 12
Trigram analysis (TA) inf, nfo, for, orm
Positional trigram analysis (PTA) inf_0, nfo_1, for_2, orm_3
Syllable analysis (SA) in, form
Positional syllable analysis (PSA) in_0, form_1
TABLE 3: N-Grams Generated from the Stem inform
In our case, the Dictionary Table is a simple set data structure. With TA, the Dictionary Table
contains trigrams. In the positional case (PTA), we attach the information on the position of the
trigrams within the words to the trigram string. Syllable analysis (SA) generates non-overlapping
syllables, not n-grams; in the positional case (PSA), we store the syllable string together with its
position within the word. In order to determine if a given term is a non-word, we extract from it
the trigrams (in TA and PTA) or syllables (in SA and PSA). We then look up these parts in the
Dictionary Table. We consider a term a non-word if it contains at least one trigram (or syllable,
respectively) that is not present in the dictionary. The idea of using syllables to detect non-words
is borrowed from [3]. They demonstrate that syllables are superior to state-of-the-art character n-
grams, using Indonesian texts. However, the effectiveness of the method on English text is
currently unclear. In our evaluation (Section 5), we will therefore test the strength of syllables in
detecting non-words on English texts and compare it to trigram analysis.
n-grams characterize the morphological structure of a language well [20]. However, out-of-
dictionary n-grams do not necessarily identify a mistyped word. Foreign language words, for
instance, show a different morphological structure and could go as errors. Terms like
Albuquerque or Afghanistan, which contain the uncommon trigrams uqu and fgh respectively, are
considered invalid, therefore would not be inserted in the CST, no matter their selectivity. We
therefore propose to deal with non-words as follows. We introduce the so-called Invalid N-gram
Table, to store invalid n-grams and their selectivity. We use the information contained in this
table to estimate the selectivity of non-words, as we will explain in Section 4. Memory
requirements of this additional structure are significantly lower than the overhead of storing non-
words and their suffixes in the CST. Note that the Dictionary Table does not require any memory
after the building phase: It is a temporary data structure for testing the validity of index terms and
is discarded after the CST has been completely built.
4. SYLLABLE CST CONSTRUCTION AND SELECTIVITY ESTIMATION
This section describes how we build the Syllable CST and the additional structures used for
selectivity estimation. Section 4.1 says how we build the Syllable CST and the Invalid N-gram
Table. Section 4.2 presents a new node-labeling strategy, which yields a more concise
representation of the CST. Section 4.3 demonstrates how to estimate the selectivity of strings
with our model. Finally, Section 4.4 presents selectivity estimation in case of pruning.
4.1 Building the Syllable CST
Function 4 is the procedure that inserts index terms in the Syllable CST (SylCST). Prior to the
insertion in the SylCST, every term is decomposed in its trigrams or syllables, according to one
of the strategies described in Section 3.5 (Line 1). The Dictionary Table is checked for the
presence of each n-gram. If at least one is not in the Dictionary Table (Line 3), it is marked as
invalid (Line 4). All invalid n-grams are stored in the Invalid N-gram Table, together with their
selectivity (Line 5). The rationale is that we can identify a non-word with its invalid n-grams and
use their selectivity to estimate the selectivity of the entire word. This is similar to the data
structure [4] refers to as a q-gram estimator. As opposed to [4], however, we do not expect
severe overestimations. This is because we expect the invalid n-grams to be rather distinctive. In
particular, the more characteristic the n-grams of a non-word are, the more accurate is the
estimation. The out-of-dictionary trigram fgh, for instance, strongly identifies Afghanistan,
especially in combination with positional analysis. We can reasonably suppose that its selectivity
will be close to the one of the word itself. At the expense of little estimation inaccuracies, we can
13. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 13
save the space required by suffixes of non-words. This approach also prevents the generation of
isolated tree nodes. Non-words suffixes are unlikely to share nodes with English words (e.g., no
English word starts with fghanistan). Only words that contain no invalid n-grams (valid words) are
inserted in the Syllable CST (Line 8). After stemming, they are syllabified, and syllable suffixes
are generated (Lines 9-11). Finally, all suffixes are inserted in the SylCST (Line 12). This
reduces the size of the CST already during construction. Thus, building the CST becomes less
resource-intensive.
Function 4: BuildCST
Input: word, dictionaryTable
Output: invalidTable, SylCST
1 wordNGrams = generateNgrams(word);
2 for each nGram in wordNGrams {
3 if (nGram not in dictionaryTable) {
4 validWord = false;
5 invalidTable.add(nGram, wordSelectivity);
6 }
7 }
8 if (validWord = true) {
9 stem = stem(word);
10 syllabifiedStem = syllabify(stem);
11 syllableSuffixes = syllableSuffixes(syllabifiedStem);
12 SylCST.add(syllableSuffixes)
13 }
14 return SylCST, invalidTable;
4.2 Node Labeling Strategies
Standard CST construction mechanisms label each node with a single character and then apply
path compression [19], i.e., collapsing unary children with their parent node. We refer to this
mechanism as Standard Labeling (SL). We introduce a new node labeling strategy. When
inserting syllable suffixes in the tree, we label each node with a syllable, instead of a single
character, and then apply path compression. We will refer to our labeling mechanism as Atomic
Labeling (AL). As we will explain later, this approach yields a more compact representation of the
Syllable CST. Since the two structures have the same content, the difference in size is due
exclusively to how syllables drive the path-compression mechanism.
FIGURE 5: Labeling each Node with a Syllable Instead of a Single Character
Example 6. Consider a SylCST that is built over the strings in-ter-net, in-te-ger. Standard
Labeling ignores internal syllable division points of syllable suffixes and inserts the strings
14. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 14
internet, ternet, net, integer, teger, ger. Conversely, nodes created by Atomic Labeling store
syllables and not single characters. Note that we use our optimization introduced before that
uses only syllable suffixes instead of all suffixes, with both Standard Labeling and Atomic
Labeling. The result is that path compression produces two different tree structures. Figure 5
illustrates this. □
The SylCST obtained following standard labelling has a small number of nodes in its first level,
because the fan-out of a node is at most the alphabet size. On the other hand, it is relatively
deep, since the depth is at most the length of the second longest suffix in the CST, and thus has
many internal nodes, such as the one labeled with te. Conversely, Atomic Labeling produces a
Syllable CST which is broader in its first level. The fan-out of the root is equal to the number of
syllables in the tree, the fan-out of other nodes is at most this number. This CST is also
shallower. This is because the depth is at most the number of syllables in the suffix with the
second-most syllables in the CST. Thus, we hypothesize that Atomic Labeling can further reduce
CST size, without affecting selectivity estimation. The reason is that, with Atomic Labeling, the
paths of suffixes with common prefixes, but different syllabification points (e.g. in-te-ger, in-ter-
net) split at the start of the first syllable they do not have in common. This leads to internal nodes
with a high fan-out. With standard labeling, in contrast, the paths split at the first character they
do not have in common, resulting in internal nodes with a lower fan-out. Mathematically, for a
fixed number of leaves (each leaf represents one suffix), the number of internal nodes is
negatively correlated with the average fan-out of the inner nodes: A lower fan-out results in a
deeper tree with more internals nodes, a higher fan-out results in a shallower tree with less
internal nodes. Experimental results will confirm this hypothesis.
Clearly, the atomic labeling strategy prevents estimating substrings that do not start and end at
syllable borders. However, this is not a problem if queries use keywords as predicates. This is
likely for textual data. If other wildcard predicates were frequent, the standard labeling strategy
would be advantageous.
4.3 Selectivity Estimation
Once the CST has been built, it can be used for selectivity estimation. The string in question is
first decomposed in its n-grams. This is to determine if its structure respects the morphological
profile described by n-gram analysis. This means searching for the presence of any of its n-
grams in the Invalid N-gram Table. If no match is found, then the string, if present, must have
been stored in the CST. The tree is traversed from the root to the node labeled with the string,
and its count stores the selectivity sought. Conversely, if the string contains at least an invalid n-
gram, then its selectivity estimate is the minimum of the selectivities of its invalid n-grams.
We expect some estimation errors due to syllabification errors. To illustrate, consider again the
incorrect syllabification of the word sandbox (Section 3.2), which is divided according to the
syllabification rule that suggests splitting a sequence of three consonants after the first one. This
produces the incorrect san-dbox. However, this has no effect at all when estimating the
selectivity of the term sandbox. But the selectivity of the term box will be slightly underestimated.
This is because the occurrences of the suffix box in the term sandbox are not counted due to the
incorrect syllabification. However, we do not expect this effect to induce severe errors. This is
because we expect the selectivity of a basic (non-compound) word w to be much larger than the
selectivity of a specific compound word w is part of. In addition, our experiments will show that,
even if the impact of incorrect syllabifications is non-negligible, the benefit of syllabification
outweighs it by far.
4.4 Pruning
Since both the Invalid N-gram Table and the Syllable CST still have high memory requirements
when built on large text corpora, we cannot do without pruning. We have implemented the
common frequency-based pruning strategy. Given the maximum size of a CST (maximum
number of nodes), we iteratively remove nodes whose count is under a threshold T. In each
iteration, we increase T until the CST has the desired size. To estimate the selectivity of a valid
15. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 15
string that is no longer in the PST, we introduce a syllable-based variant of the MO estimator [7].
If the searched string σ is syllabified as σA-σB-σC, its estimated selectivity (ESel) is:
)Sel(σ
)Sel(σ
)Sel(σESel
B
BC
AB ⋅=
where σAB = σAσB, σBC = σBσC. If any of the previous terms is not in the CST because it has been
pruned, then the selectivity of the string is estimated as the value of the pruning threshold T.
Given a non-word, if the Invalid Table has to be pruned as well, and none of its invalid n-grams
is found, its selectivity is set to the pruning threshold.
5. EXPERIMENTAL EVALUATION
We evaluate the performance of our Syllable CST. In particular, we first assess how
syllabification, atomic node labeling and non-word filtering, reduce the size of the CST. Second,
we compare the selectivity-estimation accuracy of the standard CST and the different variants of
the Syllable CST. Third, we study how pruning affects the accuracy of selectivity estimates for
the different CST types. Finally, we measure the impact of the number of errors in the data; In
particular, we wonder if our linguistics-based optimizations still work well in the presence of many
errors.
For our experiments, we use four English newswire text corpora, Reuters-21578 (Reuters) [12]
and three datasets of the Aquaint Corpus (APW, XIE, NYT) [6]. Since the collections contain
data in SGML format, we first parse them to extract text fields only. We then tokenize them to
extract single words. Stop words are filtered out, and all terms are converted to lowercase. Table
4 contains statistics on the number of documents, the number of distinct terms, and the size of a
complete CST (number of nodes) built on the collection.
Documents Tokens CST size
Reuters 21578 32554 86772
APW 239576 207616 558633
XIE 479433 243932 633899
NYT 314452 352404 979383
TABLE 4: Corpora Statistics
5.1 Effect of Syllabification
The Syllable CST built on the collections significantly reduces memory requirements compared
to the standard version. Table 5 shows that the size of the statistics data structure is more than
halved. Note that the figures quantify size as the number of nodes in the CST. The memory
footprint resulting from the number of nodes is implementation-specific; the currently optimal
implementation [22] takes 8.5 KB per node.
CST’s size SL SylCST AL SylCST
Reuters 86772 41565 (52.1%) 37298 (57.0%)
APW 558633 308764 (44.7%) 234047 (58.1%)
XIE 633899 307001 (51.6%) 271018 (57.2%)
NYT 979383 526955 (46.2%) 399432 (59.2%)
TABLE 5: CST size (in nodes) and size reduction
Figure 6 shows graphically that a Syllable CST constructed according to the atomic mechanism
(AL SylCST, see 4.2) always has smaller memory requirements than the tree built according to
the standard algorithm (SL SylCST, see 4.2). This confirms the hypotheses from Section 4.1.
16. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 16
Memory requirements of a CST built on the NYT corpus are 40% of the initial size. This means
both that we need less memory to build the tree, and that we can prune the tree at a lower
threshold. The latter results in higher estimation accuracy.
FIGURE 6: Atomic Labeling Yields a Syllable CST with Reduced Memory Requirements
5.2 Effect of N-gram Analysis
n-gram analysis is initialized on a small reference dictionary of common English words (69004
terms, 650 KB). Each dictionary entry is Porter stemmed, and its n-grams are computed
according to one of the strategies from Section 3.5 and stored in the Dictionary Table. Each
index term is then processed, and out-of-dictionary n-grams are inserted in the Invalid N-gram
Table. Table 6 reports the number of entries of each table.
TA SA PTA PSA
Dictionary 5888 10305 22880 15101
Invalid Table Reuters 3954 9240 11868 11071
Invalid Table APW 6873 39728 41803 51726
Invalid Table XIE 7517 49179 45601 62277
Invalid Table NYT 8421 68914 63951 88623
TABLE 6: Dictionary and Invalid N-gram Table Size
The Invalid Table in turn is retained since we use it to estimate the selectivity of non-words.
Table 6 shows that the greater the corpus size, the larger is the Invalid N-gram Table, and its
memory requirements can become non-negligible. To limit its size, we set its maximum number
of entries to an eighth of the tree size. This is roughly the acceptable size ratio proposed in [4] for
the n-gram table. We followed the frequency-based approach proposed in [4] to prune the Invalid
Table. I.e., we remove the entries with the lowest frequencies until the n-gram table has at most
one eighth the number of entries as the CST has nodes. This increases the estimation error only
insignificantly (by less than 0.1%) because the pruning threshold is very low, compared to that of
the CST. The reason is that most entries in the Invalid Table are due to misspellings, which
rarely have a frequency above 2. Only few entries represent proper names and therefore have a
higher frequency. By keeping these high-selectivity n-grams, we can compute better estimates
for non-words that turn out to be frequent. This is because of our hypothesis that the selectivity
of invalid n-grams is rather strictly related to the selectivity of the words they have been
generated from (see Section 4.1).
We evaluate the strength of each strategy (trigram and syllable analysis, with and without
considering in-word position) regarding non-word detection and the impact on the size of the
statistics data structure. Figure 7 shows that n-gram analysis alone considerably reduces the size
of the CST, without syllabification in this experiment. An inspection of the corpora reveals that
the size reduction increases with the number of non-words in the corpus: The reduction is highest
for the Aquaint XIE corpus, which contains the most non-English words, and lowest for the
17. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 17
Reuters corpus, which is the cleanest. Thus, non-word filtering is particularly beneficial if the data
is not very clean.
FIGURE 7: Size Reduction due to N-Gram Analysis
SL SylCST AL SylCST
Reuters
TA 34538 (-60.2%) 30514 (-64.8%)
SA 29191 (-66.4%) 25630 (-70.5%)
PTA 26454 (-69.5%) 23374 (-73.1%)
PSA 25847 (-70.2%) 22721 (-73.8%)
APW
TA 239898 (-57.1%) 178799 (-68.0%)
SA 197059 (-64.7%) 147959 (-73.5%)
PTA 153005 (-72.6%) 110260 (-80.3%)
PSA 154910 (-72.3%) 113454 (-79.7%)
XIE
TA 216907 (-65.8%) 190864 (-69.9%)
SA 179375 (-71.7%) 156724 (-75.3%)
PTA 126221 (-80.1%) 111010 (-82.5%)
PSA 132886 (-79.0%) 116124 (-81.7%)
NYT
TA 419359 (-57.2%) 313492 (-68.0%)
SA 340327 (-73.9%) 256467 (-73.8%)
PTA 255629 (-65.3%) 183295 (-81.3%)
PSA 261281 (-73.3%) 190746 (-80.5%)
TABLE 7: Syllable CST Size (in Nodes) and Size Reduction
We observe that the positional variants are better at detecting and removing non-words. In
particular, positional trigram analysis performs better than the corresponding syllable strategy
and also has a smaller Invalid Table (see Table 6). Figure 7 shows that, for the Aquaint corpora,
more than half of the size of the CST without non-word filtering is attributable to non-words. The
size of the Syllable CST built exclusively over valid words is reported in Table 7. These results
demonstrate that syllable analysis is superior to state-of-the-art techniques in the non-positional
case. It filters out more words and yields a smaller CST. In the positional case, it does not
improve the results obtained with positional trigram analysis.
Figure 8 shows that positional trigram analysis and Atomic Labeling yield a very compact tree.
More specifically, the size of CST built over all three Aquaint corpora is reduced to 20% of its
initial size using positional trigram analysis to filter out non-words. With Standard Labeling, non-
word filtering and syllabification still shrink the CST to at most 35% of its original size (see Table
7).
18. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 18
FIGURE 8: Size Reduction Obtained by Building the Syllable CST over Valid Words only
This means that, compared to existing techniques, (a) building the CST requires significantly less
memory, and (b) for a given memory size, we can significantly lower the pruning threshold,
compared to existing techniques. The latter lets the MO algorithm better estimate the selectivity
of pruned suffixes. The experiments in the next sections will demonstrate this.
5.3 Accuracy of Estimations
The following sections report on experimental results on the accuracy of selectivity estimates
computed on the Syllable CST. Section 5.3.1 presents the metrics used. We follow the approach
adopted in [8, 7, 4] and evaluate positive queries (Section 5.3.2) and negative queries (Section
5.3.3). Positive queries are those that contain terms contained in the corpus, i.e., queries with a
selectivity greater than 0, negative queries, in turn, have a 0 selectivity. Finally, Section 5.4
demonstrates that estimation inaccuracies due to pruning are less severe on the Syllable CST.
5.3.1 Evaluation Metrics
For positive queries, we evaluate the accuracy of our estimation model based on the average
relative error, as suggested in [4]. It is defined as the ratio:
Sel'
|SelESel|
(RE)ErrorRelativeAverage
−
= (ARE)
where ESel is the estimated selectivity and Sel the real selectivity of a given string. This
definition of the average relative error metric includes the correction suggested in [4] to
overcome the penalizing effect on low selectivity strings. More specifically, given a corpus of
size C, if the actual selectivity of a string is smaller than 100/|C|, then the denominator is set to
100/|C|, formally:
Sel’ := max(Sel, 100/|C|)
We consider the quartile distributions introduced by the same authors [4] to show how the
accuracy of the estimator is biased. We bucketize the error distribution over the intervals
[-100%,-75%), [-75%,-50%), [-50%,-25%), [-25%,0%), [0,25%), [25%,50%), [50%,75%),
[75%,100%), [100%,∞).
Estimates that fall in the interval [0,25%) are exact estimations and small overestimations,
whereas the ones that fall in the first four buckets are underestimations.
Following again [4], we use the average absolute error and its percentage of the corpus size as
evaluation metric for negative queries.
5.3.2 Positive Queries
Testing the CST against positive queries means estimating the selectivity of strings that are
present in the collection. Unless the tree has been pruned, these strings are in the CST. To
evaluate the accuracy of our estimator for positive queries, we take the corpus terms and
estimate their selectivity as described in Section 4.3. Figure 9 shows the results. The average
19. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 19
relative error for the Syllable CST, without introducing n-gram analysis to skip non-words, is
minimal for Reuters (3.5%) and maximal for Aquaint NYT, where it is slightly over 10%. The
different average errors for the different corpora are due to different average document sizes:
The larger the documents, the more terms are frequent, and the higher are the document
frequencies of the terms, for the same number of documents. This incurs larger absolute errors
(numerator of the relative error) for the same number of documents (a hundredth of it is a lower
bound of the denominator of the relative error, cf. Formula ARE) and yields a higher relative
error. Average document size is highest in the Aquaint NYT corpus and lowest in the Reuters
corpus, Aquaint XIE and APW lie in between. The average relative estimation errors behave
accordingly.
FIGURE 9: Average Relative Error
The experimental results indicate that conflations due to our stemming algorithm do not
introduce significant selectivity-estimation errors. The benefits gained from non-word detection in
turn come at the cost of some errors: The average relative error for the SylCST without non-word
filtering (the leftmost data points in Figure 9) is lower than with any of the non-word filtering
strategies enabled (the TA, SA, PTA, and PSA data points). Further, there are more errors with
n-gram analysis (the TA and PTA data points) than with syllable analysis (the SA and PSA data
points). Overestimations, due to invalid words identified by the same invalid n-gram, penalize the
estimation of non-words (see Example 7). However, the average relative error is always under
20% even for Aquaint NYT, which contains the biggest percentage of non-words. For Reuters,
errors are almost negligible.
Example 7. Consider the two terms Albuquerque and Unterbauquerträger (the latter being a
German word from the civil engineering domain). They both will be identified as non-words due
to trigram uqu. Non-positional trigram analysis conflates these terms in the uqu bucket. In
consequence, the selectivity of both words is over-estimated as the sum of their individual
selectivities. Positional trigram analysis avoids this by taking the in-word position into account:
Albuquerque belongs to the uqu_3 bucket, while Unterbauquerträger belongs to uqu_7. □
Quartile distributions are shown in Figure 10. Each graph refers to a corpus and plots the
distribution of estimation errors according to each n-gram strategy. We normalize the absolute
frequency to the total number of patterns tested and plot the distribution using a logarithmic scale
for the y-axis. A linear scale would not reveal the differences between non-word detection
strategies, since the number of overestimations and underestimations is always negligible. All
the strategies produce an error that is much lower than 10%. The worst case is Aquaint NYT with
non-positional trigram analysis. This is due to two reasons: (a) Non-positional trigram analysis
incurs the most errors of all the non-word filtering strategies, and (b) the Aquaint NYT corpus has
the largest average document size, which generally results in a higher error rate (see above).
20. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 20
However, 92% of estimations still fall in the central bucket. Non-positional trigram analysis yields
less accurate results because it conflates more invalid trigrams (see Example 7). Their number
is insufficient to identify non-words, and estimation errors are more likely. The number of
estimations in the [100, ∞) bucket is interesting: It shows that if overestimations occur, they are
likely to bear a very large error. We think that this effect has to do with the MO algorithm. It
overestimates rare combinations of frequent substrings.
5.3.3 Negative Queries
We test our estimation model with negative patterns, which are strings that are not present in the
indexed collection. The estimation should return selectivities close to zero. In our experiments,
we generate a set of negative strings by randomly introducing errors into corpus words. Table 8
presents the results.
Reuters Aquaint APW
Aquaint XIE Aquaint NYT
FIGURE 10: Quartile Distribution of Estimation Accuracy
For Reuters, the error is under 0.02%. We observe that errors tend to become larger the larger
the documents in the corpus, for the reasons explained above. However, they remain below
0.15% even with non-positional trigram analysis to filter non-words. This is four times less than
the 0.6% worst case reported in [4]. This demonstrates that, even though our model does not
return a selectivity of zero, the error induced is not significant.
21. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 21
Terms SylCST TA SA PTA PSA
Reuters 32,554 2.2 (0.01%) 5.5 (0.02%) 4.2 (0.01%) 4.5 (0.01%) 4.0 (0.01%)
APW 207,616 49.5 (0.02%) 185.6 (0.09%) 126.4 (0.06%) 133.4 (0.06%) 123.6 (0.06%)
XIE 243,932 47.1 (0.02%) 349.7 (0.14%) 211.9 (0.09%) 265.5 (0.11%) 237.0 (0.05%)
TABLE 8: Absolute Error and Percentage of Corpus Size for Negative Queries
Corpus
CST Type / Non-Word Filtering Strategy
CST size
(nodes)
CST CST-TA CST-PTA SylCST SylCST-TA
SylCST-
PTA
Reuters
32,000 17,4% (7) 18,1% (6) 17,3% (5) 6,56% (1) 7,48% (1) 6,93% (0)
16,000 17,8% (29) 18,2% (27) 17,3% (23) 6,56% (4) 7,47% (3) 6,92% (2)
8,000 19,7% (109) 19,4% (104) 17,9% (97) 6,56% (14) 7,46% (12) 6,91% (10)
4,000 29,4% (332) 26,1% (323) 21,8% (310) 6,56% (53) 7,42% (50) 6,87% (46)
APW
32,000 12,5% (61) 15,6% (55) 11,1% (44) 8,89% (8) 14,8% (6) 13,4% (4)
16,000 48,3% (214) 42,2% (202) 26,5% (179) 17,7% (31) 21,1% (28) 16,4% (22)
8,000
163,0%
(666)
129,0%
(645)
80,1% (607) 49,9% (113) 44,6% (106) 30,0% (95)
4,000
452,0%
(1726)
352,1%
(1688)
227,3%
(1635)
148,8%
(355)
120,1%
(341)
77,2% (319)
XIE
32,000 4,53% (22) 12,5% (18) 14,3% (14) 4,59% (3) 14,0% (2) 16,3% (1)
16,000 19,0% (90) 21,8% (80) 19,1% (68) 7,66% (11) 15,8% (9) 17,0% (6)
8,000 71,6% (313) 57,3% (293) 38,7% (266) 20,7% (44) 24,1% (39) 21,2% (31)
4,000
216,6%
(863)
159,1%
(828)
100,2%
(779)
62,7% (152) 52,7% (144) 37,5% (127)
NYT
32,000 27,6% (122) 27,2% (115) 18,7% (101) 16,1% (16) 21,0% (14) 19,5% (10)
16,000 98,4% (413) 81,3% (396) 50,8% (364) 35,6% (65) 35,7% (61) 27,5% (54)
8,000
303,4%
(1220)
242,6%
(1196)
152,8%
(1142)
99,4% (223) 84,8% (215) 56,5% (199)
4,000
833,5%
(3016)
659,3%
(2981)
413,6%
(2917)
281,2%
(663)
228,6%
(649)
147,4%
(628)
TABLE 9: Average Relative Errors and Pruning Threshold by CST Size (Best Value of Line in
Bold)
5.4 Pruning
Despite all reductions, the Syllable CST for larger corpora still requires too much memory to fit in
the data dictionary, see Table 7 for the exact numbers One might think that the memory
available to database servers nowadays can easily accommodate the complete CST, and that
limits such as the 1KB limit from [8] from 1996 is obsolete. However, not only the amount of
physical memory has grown since 1996, but also the number of relations and attributes that
database servers must handle. The memory available for the data dictionary has to
accommodate statistics for significantly more attributes. Consequently, the memory available for
an individual statistics data structure has grown less than the physical memory available in total.
In commercial database servers, assuming 1 KB as a limit for the statistics for an individual
attribute is not unrealistic [24]. All this means that we cannot do without pruning. Our
experiments will show that the Syllable CST can be pruned at a lower threshold, compared to the
standard CST, because of its inherently reduced size. As a result, the estimations are
significantly more accurate.
We have pruned the CST and the Syllable CST iteratively to meet the same final size of 4,000
nodes. Table 9 contains the average relative error and the respective pruning threshold for each
tree size. For readability, we restrict this table to the standard CST and the Syllable CST with
standard labeling, and we give only the results for trigram analysis. Appendix A provides the
complete results. The Syllable CST is always more accurate than the corresponding standard
22. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 22
version when the tree is pruned. The atomic node labeling strategy gives an additional slight
advantage.
The SylCST provides good estimation results even with the minimum required size for Reuters:
The average relative error is slightly over 40%. In general, the Syllable CST always gives the
best estimations. This is due to considerably lower pruning thresholds: The figures show that the
value of the pruning threshold decreases by up to 80%, compared to standard CST. This leaves
a more accurate basis for the MO algorithm that computes the estimates for pruned strings: The
relative estimation error is reduced by up to 70%, compared to the technique from [4].
5.5 Noisy Data
Because the Syllable CST relies on linguistic features of the documents, it is susceptible to
misspellings. It is unclear how the Syllable CST performs if the documents contain significantly
more errors than a newswire corpus. Such noisy data occurs in the Blogosphere, for instance. To
assess the behavior of the Syllable CST in the presence of many errors, we run experiments on
documents containing errors. In order to control the error rate at arbitrarily fine granularity and to
measure its effect, we use the same corpora as above and introduce artificial errors. To stress
the algorithm, we describe experiments in which we have introduced random misspellings in
10% of the terms. The misspellings we introduce are equally distributed among removal,
insertion, and replacement of a random character within the term.
The standard CST turns out impossible to build over the complete Aquaint corpora with so many
misspellings: The CST grows out of memory due to the suffixes caused by the misspellings. In
particular, the standard CST without non-word filtering grows so large that it exceeds the
memory limits of a JVM running with 1.5 GB. This is by far more than a database server can
allocate for building statistics data structures. The use of another programming language, e.g.,
C++ instead of Java, and a highly memory-optimized CST implementation might mitigate this
problem, but will not do away with it. Thus, for this series of experiments, we only use the first
50,000 documents of each of our test corpora. Table 10 shows the results of a comparison
between the standard CST and the Syllable CST with standard node labeling, each without non-
word filtering and with non-positional and positional trigram analysis.
The numbers show that, somewhat expected, the benefit of n-gram analysis is very high when
the data contains many misspellings: Positional trigram analysis reduces the size of the un-
pruned CST by about 65%, i.e., 65% less memory is required to build the CST. Further, the
results show that misspellings do not affect the benefit of syllabification to a significant degree:
Syllabification still reduces the average relative error of selectivity estimates by about 50% in
most of the cases. Only the error of the 4,000 node SylCST for the XIE branch is in the order of
magnitude of the corresponding standard CST. Furthermore, n-gram analysis affects the
average relative estimation error only negligibly.
In general, the average relative estimation error shows the same tendencies as for the complete
Aquaint corpora without artificial misspellings: (a) Both syllabification and non-word filtering
reduce the size of the un-pruned CST, (b) syllabification reduces the pruning threshold for the
pruned CST, (c) due to (b), syllabification improves selectivity-estimation accuracy significantly,
(d) non-word filtering can incur some estimation errors, but does not decrease overall accuracy
by much. However, the advantage of the Syllable CST over the standard CST is higher for the
16,000 and 8,000 node trees, and lower for the 4,000 node tree. We attribute this effect to local
skews in the distributions of the document frequencies of the terms. With a larger number of
documents, as in the previous experiments, those skews level out and yield a rather predictable
development of the relative estimation error. Conversely, these skews have a more significant
effect in this current experiment, due to the considerably smaller number of documents.
23. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 23
Corpus CST Type CST Size
Average relative error at different CST sizes
(pruning threshold)
16000 8000 4000
APW
CST 1710729 137.9% (1004) 219.3% (2405) 392.3% (4847)
CST-TA 921349 125.6% (904) 186.7% (2231) 303.9% (4607)
CST-PTA 491327 113.7% (795) 174.5% (2058) 263.0% (4389)
SylCST 807641 28.4% (273) 65.8% (774) 201.7% (1887)
SylCST-TA 449642 28.0% (219) 64.5% (677) 171.8% (1753)
SylCST-PTA 227461 26.4% (156) 68.9% (572) 189.0% (1598)
XIE
CST 1065209 146.1% (522) 237.0% (1355) 419.1% (2897)
CST-TA 586189 134.1% (464) 193.9% (1277) 310.3% (2759)
CST-PTA 322911 127.5% (385) 178.2% (1169) 274.4% (2648)
SylCST 493905 34.1% (116) 126.5% (378) 401.4% (1013)
SylCST-TA 282024 26.5% (87) 86.3% (320) 225.1% (947)
SylCST-PTA 148285 26.9% (56) 90.6% (252) 205.9% (854)
NYT
CST 2649161 125.2% (1925) 203.9% (4462) 362.5% (8482)
CST-TA 1436417 118.4% (1799) 182.9% (4282) 310.7% (8212)
CST-PTA 762014 109.2% (1655) 167.9% (4079) 283.2% (7908)
SylCST 1261474 30.8% (558) 51.8% (1534) 115.2% (3688)
SylCST-TA 704858 34.0% (486) 57.5% (1408) 130.7% (3505)
SylCST-PTA 354332 34.4% (390) 54.6% (1272) 143.9% (3356)
TABLE 10: Experiment Results with Noisy Data
6. CONCLUSIONS
Estimating the selectivity of query terms is essential for query optimization and in other contexts.
The estimates have to be available before the actual query processing and need to be based on
small summary statistics. The memory limitations result from the need to permanently hold the
statistics used for query optimization in physical main memory. If query optimization caused only
a single page fault (i.e., the need to swap a memory page from on-disk virtual memory back into
physical main memory), this would annihilate the performance advantage a database system
gains from optimizing query execution.
Selectivity estimation for string predicates frequently relies on Count Suffix Trees (CST) [4, 7, 8].
While they provide good estimates, their storage requirements are prohibitively high. Pruning
tries to solve this problem, by trading estimation accuracy for reduced memory needs. So far,
pruning strategies are mostly based on frequency and tree depth. In this paper, we have
proposed new techniques that reduce the size of CST over natural-language texts. We exclude
suffixes that do not make sense from a linguistic point of view, regardless of their frequency.
Syllabification has proven to be a suitable tool for generating suffixes that carry an enhanced
semantic message, compared to letter-wise suffixes. A more aggressive stemming routine lets
us further reduce the CST size, without affecting the quality of selectivity estimates by much.
Further, a very concise n-gram data structure allows (a) for filtering out non-words during CST
construction already, and (b) for estimating their selectivity precisely.
The various filtering techniques described here are mutually independent. They are applicable to
other languages as well, provided that there is a stemming procedure, a syllabification routine, or
a dictionary of the language for the n-gram filtering. Since all the filtering takes place during CST
construction, significantly less memory is required to build the CST. The combination of these
approaches, together with a new node labeling strategy, yields a much more compact CST: For
English text, estimation accuracy is the same as with a classical CST, with only 20-30% of the
nodes. From another perspective, with the same number of nodes, the new techniques reduce
the average estimation error by up to 70%.
24. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 24
7. REFERENCES
[1] J. Bae and S. Lee. “Substring count estimation in extremely long strings”. IEICE –
Transactions in Information Systems, E89-D(3):1148–1156, 2006.
[2] R. Baeza-Yates and B. Ribeiro-Neto. “Modern information retrieval”. Addison-Wesley
Longman, 1. print. edition, 1999.
[3] S. Bressan and R. Irawan. “Morphologic non-word error detection”. In Proceedings of the
15th International Workshop on Database and Expert Systems Applications (DEXA ’04),
pages 31–35, 2004.
[4] S. Chaudhuri, V. Ganti, and L. Gravano. “Selectivity estimation for string predicates:
Overcoming the underestimation problem”. In Proceedings of ICDE 2004, Boston, MA,
USA, 2004.
[5] D. W. Cummings. “American English spelling: an informal description”. Johns Hopkins
University Press, 1988.
[6] D. Graff. “The Aquaint corpus of english news text”. Linguistic Data Consortium,
Philadelphia, 2002.
[7] H. Jagadish, O. Kapitskaia, and D. Srivastava. “One-dimensional and multi-dimensional
substring selectivity estimation”. The International Journal on Very Large Data Bases, 9(3):
214–230, 2000.
[8] P. Krishnan, J. S. Vitter, and B. Iyer. “Estimating alphanumeric selectivity in the presence
of wildcards”. In ACM SIGMOD International Conference on Management of Data, pages
12–13. ACM, 1996.
[9] R. Krovetz. “Viewing morphology as an inference process”. In Proceedings of the
Sixteenth Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, pages 191–203, 1993.
[10] K. Kukich. “Techniques for automatically correcting words in text”. ACM Computer
Surveys, 24:379–439, 1992.
[11] M. Lennon, D. Pierce, B. Tarry, and P. Willett. “An evaluation of some conflation
algorithms for information retrieval”. Journal of Information Science, 3(177–183), 1981.
[12] D. D. Lewis. Reuters-21578 [online] Available at:
http://www.daviddlewis.com/resources/testcollections/reuters21578/.
[13] F. M. Lian. “Word hy-phen-a-tion by com-put-er”. PhD thesis, Stanford University,
Stanford, August 1983.
[14] J. B. Lovins. “Development of a stemming algorithm”. Mechanical Translation and
Computational Linguistics, 11:22–31, 1968.
[15] E. M. McCreight. ”A space-economical suffix tree construction algorithm”. J. Assoc.
Comput. Mach., 23(2): 262–272, 1976.
[16] M. F. Porter. ”An algorithm for suffix stripping”. Program, 14(3):130–137, 1980.
[17] Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. “Practical methods for constructing suffix
trees”. The International Journal on Very Large Data Bases, 14(3): 281–299, 2005.
25. G. Sautter & K. Böhm
International Journal of Data Engineering (IJDE), Volume (3) : Issue (1) : 2012 25
[18] E. Ukkonen. “On-line construction of suffix trees”. Algorithmica, 14(3): 249–260, 1995.
[19] P. Weiner. “Linear pattern matching algorithms”. In Proceedings of the 14th Annual
Symposium on Switching and Automata Theory, pages 1–11, 1973.
[20] E. M. Zamora, J. Pollock, and A. Zamora. “The use of trigram analysis for spelling error
detection”. Information of Processing and Management, 17: 305–316, 1981.
[21] Z. Chen, F. Korn, N. Koudas, S. Muthukrishnan. “Selectivity Estimation for Boolean
Queries”. In Proceedings of PODS 2000, Dallas, TX, USA, 2000
[22] R. Giegerich, S. Kurtz, J. Stoye. “Efficient Implementation of Lazy Suffix Trees”. Software:
Practice and Experience, Volume 33, No 11, John Wiley & Sons Ltd., 2003
[23] G. Sautter, C. Abba, K. Böhm. “Improved Count Suffix Trees for Natural Language Data”.
In Proceedings of IDEAS 2008, Coimbra, Portugal, 2008
[24] Personal communication with Torsten Grabs, Microsoft SQL Server development team.