1. The document discusses query optimization techniques to improve the performance of object querying in Java.
2. It presents the Java Query Language (JQL) which allows programmers to express queries over object collections in Java through a declarative syntax.
3. The key aspects of JQL implementation include a compiler that compiles JQL queries to Java code and a query evaluator that applies optimizations like hash joins and nested loops joins to efficiently evaluate the queries.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
Threshold benchmarking for feature ranking techniquesjournalBEEI
In prediction modeling, the choice of features chosen from the original feature set is crucial for accuracy and model interpretability. Feature ranking techniques rank the features by its importance but there is no consensus on the number of features to be cut-off. Thus, it becomes important to identify a threshold value or range, so as to remove the redundant features. In this work, an empirical study is conducted for identification of the threshold benchmark for feature ranking algorithms. Experiments are conducted on Apache Click dataset with six popularly used ranker techniques and six machine learning techniques, to deduce a relationship between the total number of input features (N) to the threshold range. The area under the curve analysis shows that ≃ 33-50% of the features are necessary and sufficient to yield a reasonable performance measure, with a variance of 2%, in defect prediction models. Further, we also find that the log2(N) as the ranker threshold value represents the lower limit of the range.
Context-Based Diversification for Keyword Queries over XML Data1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
Threshold benchmarking for feature ranking techniquesjournalBEEI
In prediction modeling, the choice of features chosen from the original feature set is crucial for accuracy and model interpretability. Feature ranking techniques rank the features by its importance but there is no consensus on the number of features to be cut-off. Thus, it becomes important to identify a threshold value or range, so as to remove the redundant features. In this work, an empirical study is conducted for identification of the threshold benchmark for feature ranking algorithms. Experiments are conducted on Apache Click dataset with six popularly used ranker techniques and six machine learning techniques, to deduce a relationship between the total number of input features (N) to the threshold range. The area under the curve analysis shows that ≃ 33-50% of the features are necessary and sufficient to yield a reasonable performance measure, with a variance of 2%, in defect prediction models. Further, we also find that the log2(N) as the ranker threshold value represents the lower limit of the range.
Context-Based Diversification for Keyword Queries over XML Data1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
From last few years online information is growing tremendously on World Wide Web or on user’s desktops and thus online information gains much more attention in the field of automatic text summarization. Text mining has become a significant research field as it produces valuable data from unstructured and large amount of texts. Summarization systems provide the possibility of searching the important keywords of the texts and so the consumer will expend less time on reading the whole document. Main objective of summarization system is to generate a new form which expresses the key meaning of the contained text. This paper study on various existing techniques with needs of novel Multi-Document summarization schemes. This paper is motivated by arising need to provide high quality summary in very short period of time. In proposed system, user can quickly and easily access correctly-developed summaries which expresses the key meaning of the contained text. The primary focus of this paper lies with thef_β-optimal merge function, a function recently presented here, that uses the weighted harmonic mean to discover a harmony in the middle of precision and recall. Proposed system utilizes Bisect K-means clustering to improve the time and Neural Networks to improve the accuracy of summary generated by NEWSUM algorithm.
Conceptual similarity measurement algorithm for domain specific ontology[Zac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of
ontology based data integration sy
stem. In this paper, we focus
o
n the web
query service to apply
this proposed algorithm
. Concepts similarity is important for web query service
because the words in user input query are not
same wholly with the concepts in
ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that canno
t be
confirmed the similarity of these words
by WordNet. We prove the effect
of this
algorithm with two degree semantic result of web minin
g by generating
the concepts results obtained form
the input query
Very few research works have been done on XML security over relational databases despite that XML became the de facto standard for the data representation and exchange on the internet and a lot of XML documents are stored in RDBMS. In [14], the author proposed an access control model for schema-based storage of XML documents in relational storage and translating XML access control rules to relational access control rules. However, the proposed algorithms had performance drawbacks. In this paper, we will use the same access control model of [14] and try to overcome the drawbacks of [14] by proposing an efficient technique to store the XML access control rules in a relational storage of XML DTD. The mapping of the XML DTD to relational schema is proposed in [7]. We also propose an algorithm to translate XPath queries to SQL queries based on the mapping algorithm in [7].
Enhancing Keyword Query Results Over Database for Improving User Satisfaction ijmpict
Storing data in relational databases is widely increasing to support keyword queries but search results does not gives effective answers to keyword query and hence it is inflexible from user perspective. It would be helpful to recognize such type of queries which gives results with low ranking. Here we estimate prediction of query performance to find out effectiveness of a search performed in response to query and features of such hard queries is studied by taking into account contents of the database and result list. One relevant problem of database is the presence of missing data and it can be handled by imputation. Here an inTeractive Retrieving-Inferring data imputation method (TRIP) is used which achieves retrieving and inferring alternately to fill the missing attribute values in the database. So by considering both the prediction of hard queries and imputation over the database, we can get better keyword search results.
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...ijseajournal
With the emergence of XML as de facto format for storing and exchanging information over the Internet, the search for ever more innovative and effective techniques for their querying is a major and current concern of the XML database community. Several studies carried out to help solve this problem are mostly oriented towards the evaluation of so-called exact queries which, unfortunately, are likely (especially in the case of semi-structured documents) to yield abundant results (in the case of vague queries) or empty results (in the case of very precise queries). From the observation that users who make requests are not necessarily interested in all possible solutions, but rather in those that are closest to their needs, an important field of research has been opened on the evaluation of preferences queries. In this paper, we propose an approach for the evaluation of such queries, in case the preferences concern the structure of the document. The solution investigated revolves around the proposal of an evaluation plan in three phases: rewriting-evaluation-merge. The rewriting phase makes it possible to obtain, from a partitioningtransformation operation of the initial query, a hierarchical set of preferences path queries which are holistically evaluated in the second phase by an instrumented version of the algorithm TwigStack. The merge phase is the synthesis of the best results.
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
Information extraction (IE) has been an active research area that seeks techniques to uncover information from a large collection of text. IE is the task of automatically extracting structured information from unstructured and/or semi structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in document processing like automatic annotation and content extraction could be seen as information extraction. Many applications call for methods to enable automatic extraction of structured information from unstructured natural language text. Due to the inherent challenges of natural language processing, most of the existing methods for information extraction from text tend to be domain specific. In this project a new paradigm for information extraction. In this extraction framework, intermediate output of each text processing component is stored so that only the improved component has to be deployed to the entire corpus. Extraction is then performed on both the previously processed data from the unchanged components as well as the updated data generated by the improved component. Performing such kind of incremental extraction can result in a tremendous reduction of processing time and there is a mechanism to generate extraction queries from both labeled and unlabeled data. Query generation is critical so that casual users can specify their information needs without learning the query language.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
As the amount of online document increases, the demand for document classification to aid the analysis and management of document is increasing. Text is cheap, but information, in the form of knowing what classes a document belongs to, is expensive. The main purpose of this paper is to explain the expectation maximization technique of data mining to classify the document and to learn how to improve the accuracy while using semi-supervised approach. Expectation maximization algorithm is applied with both supervised and semi-supervised approach. It is found that semi-supervised approach is more accurate and effective. The main advantage of semi supervised approach is “DYNAMICALLY GENERATION OF NEW CLASS”. The algorithm first trains a classifier using the labeled document and probabilistically classifies the
unlabeled documents. The car dataset for the evaluation purpose is collected from UCI repository dataset in which some changes have been done from our side.
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Survey on scalable continual top k keyword search in relational databaseseSAT Journals
Abstract Keyword search in relational database is a technique that has higher relevance in the present world. Extracting data from a large number of sets of database is very important .Because it reduces the usage of man power and time consumption. Data extraction from a large database using the relevant keyword based on the information needed is a very interactive and user friendly. Without knowing any database schemas or query languages like sql the user can get information. By using keyword in relational database data extraction will be simpler. The user doesn’t want to know the query language for search. But the database content is always changing for real time application for example database which store the data of publication data. When new publications arrive it should be added to database so the database content changes according to time. Because the database is updated frequently the result should change. In order to handle the database updation takes the top-k result from the currently updated data for each search. Top-k keyword search means take greatest k results based on the relevance of document. Keyword search in relational database means to find structural information from tuples from the database. Two types of keyword search are schema-based method and graph based approach. Using top-k keyword search instead of executing all query results taking highest k queries. By handling database updation try to find the new results and remove expired one
International Refereed Journal of Engineering and Science (IRJES) is a peer reviewed online journal for professionals and researchers in the field of computer science. The main aim is to resolve emerging and outstanding problems revealed by recent social and technological change. IJRES provides the platform for the researchers to present and evaluate their work from both theoretical and technical aspects and to share their views.
www.irjes.com
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
From last few years online information is growing tremendously on World Wide Web or on user’s desktops and thus online information gains much more attention in the field of automatic text summarization. Text mining has become a significant research field as it produces valuable data from unstructured and large amount of texts. Summarization systems provide the possibility of searching the important keywords of the texts and so the consumer will expend less time on reading the whole document. Main objective of summarization system is to generate a new form which expresses the key meaning of the contained text. This paper study on various existing techniques with needs of novel Multi-Document summarization schemes. This paper is motivated by arising need to provide high quality summary in very short period of time. In proposed system, user can quickly and easily access correctly-developed summaries which expresses the key meaning of the contained text. The primary focus of this paper lies with thef_β-optimal merge function, a function recently presented here, that uses the weighted harmonic mean to discover a harmony in the middle of precision and recall. Proposed system utilizes Bisect K-means clustering to improve the time and Neural Networks to improve the accuracy of summary generated by NEWSUM algorithm.
Conceptual similarity measurement algorithm for domain specific ontology[Zac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of
ontology based data integration sy
stem. In this paper, we focus
o
n the web
query service to apply
this proposed algorithm
. Concepts similarity is important for web query service
because the words in user input query are not
same wholly with the concepts in
ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that canno
t be
confirmed the similarity of these words
by WordNet. We prove the effect
of this
algorithm with two degree semantic result of web minin
g by generating
the concepts results obtained form
the input query
Very few research works have been done on XML security over relational databases despite that XML became the de facto standard for the data representation and exchange on the internet and a lot of XML documents are stored in RDBMS. In [14], the author proposed an access control model for schema-based storage of XML documents in relational storage and translating XML access control rules to relational access control rules. However, the proposed algorithms had performance drawbacks. In this paper, we will use the same access control model of [14] and try to overcome the drawbacks of [14] by proposing an efficient technique to store the XML access control rules in a relational storage of XML DTD. The mapping of the XML DTD to relational schema is proposed in [7]. We also propose an algorithm to translate XPath queries to SQL queries based on the mapping algorithm in [7].
Enhancing Keyword Query Results Over Database for Improving User Satisfaction ijmpict
Storing data in relational databases is widely increasing to support keyword queries but search results does not gives effective answers to keyword query and hence it is inflexible from user perspective. It would be helpful to recognize such type of queries which gives results with low ranking. Here we estimate prediction of query performance to find out effectiveness of a search performed in response to query and features of such hard queries is studied by taking into account contents of the database and result list. One relevant problem of database is the presence of missing data and it can be handled by imputation. Here an inTeractive Retrieving-Inferring data imputation method (TRIP) is used which achieves retrieving and inferring alternately to fill the missing attribute values in the database. So by considering both the prediction of hard queries and imputation over the database, we can get better keyword search results.
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...ijseajournal
With the emergence of XML as de facto format for storing and exchanging information over the Internet, the search for ever more innovative and effective techniques for their querying is a major and current concern of the XML database community. Several studies carried out to help solve this problem are mostly oriented towards the evaluation of so-called exact queries which, unfortunately, are likely (especially in the case of semi-structured documents) to yield abundant results (in the case of vague queries) or empty results (in the case of very precise queries). From the observation that users who make requests are not necessarily interested in all possible solutions, but rather in those that are closest to their needs, an important field of research has been opened on the evaluation of preferences queries. In this paper, we propose an approach for the evaluation of such queries, in case the preferences concern the structure of the document. The solution investigated revolves around the proposal of an evaluation plan in three phases: rewriting-evaluation-merge. The rewriting phase makes it possible to obtain, from a partitioningtransformation operation of the initial query, a hierarchical set of preferences path queries which are holistically evaluated in the second phase by an instrumented version of the algorithm TwigStack. The merge phase is the synthesis of the best results.
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
Information extraction (IE) has been an active research area that seeks techniques to uncover information from a large collection of text. IE is the task of automatically extracting structured information from unstructured and/or semi structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in document processing like automatic annotation and content extraction could be seen as information extraction. Many applications call for methods to enable automatic extraction of structured information from unstructured natural language text. Due to the inherent challenges of natural language processing, most of the existing methods for information extraction from text tend to be domain specific. In this project a new paradigm for information extraction. In this extraction framework, intermediate output of each text processing component is stored so that only the improved component has to be deployed to the entire corpus. Extraction is then performed on both the previously processed data from the unchanged components as well as the updated data generated by the improved component. Performing such kind of incremental extraction can result in a tremendous reduction of processing time and there is a mechanism to generate extraction queries from both labeled and unlabeled data. Query generation is critical so that casual users can specify their information needs without learning the query language.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
As the amount of online document increases, the demand for document classification to aid the analysis and management of document is increasing. Text is cheap, but information, in the form of knowing what classes a document belongs to, is expensive. The main purpose of this paper is to explain the expectation maximization technique of data mining to classify the document and to learn how to improve the accuracy while using semi-supervised approach. Expectation maximization algorithm is applied with both supervised and semi-supervised approach. It is found that semi-supervised approach is more accurate and effective. The main advantage of semi supervised approach is “DYNAMICALLY GENERATION OF NEW CLASS”. The algorithm first trains a classifier using the labeled document and probabilistically classifies the
unlabeled documents. The car dataset for the evaluation purpose is collected from UCI repository dataset in which some changes have been done from our side.
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Survey on scalable continual top k keyword search in relational databaseseSAT Journals
Abstract Keyword search in relational database is a technique that has higher relevance in the present world. Extracting data from a large number of sets of database is very important .Because it reduces the usage of man power and time consumption. Data extraction from a large database using the relevant keyword based on the information needed is a very interactive and user friendly. Without knowing any database schemas or query languages like sql the user can get information. By using keyword in relational database data extraction will be simpler. The user doesn’t want to know the query language for search. But the database content is always changing for real time application for example database which store the data of publication data. When new publications arrive it should be added to database so the database content changes according to time. Because the database is updated frequently the result should change. In order to handle the database updation takes the top-k result from the currently updated data for each search. Top-k keyword search means take greatest k results based on the relevance of document. Keyword search in relational database means to find structural information from tuples from the database. Two types of keyword search are schema-based method and graph based approach. Using top-k keyword search instead of executing all query results taking highest k queries. By handling database updation try to find the new results and remove expired one
International Refereed Journal of Engineering and Science (IRJES) is a peer reviewed online journal for professionals and researchers in the field of computer science. The main aim is to resolve emerging and outstanding problems revealed by recent social and technological change. IJRES provides the platform for the researchers to present and evaluate their work from both theoretical and technical aspects and to share their views.
www.irjes.com
Query expansion using novel use case scenario relationship for finding featur...IJECEIAES
Feature location is a technique for determining source code that implements specific features in software. It developed to help minimize effort on program comprehension. The main challenge of feature location research is how to bridge the gap between abstract keywords in use cases and detail in source code. The use case scenarios are software requirements artifacts that state the input, logic, rules, actor, and output of a function in the software. The sentence on use case scenario is sometimes described another sentence in other use case scenario. This study contributes to creating expansion queries in feature locations by finding the relationship between use case scenarios. The relationships include inner association, outer association and intratoken association. The research employs latent Dirichlet allocation (LDA) to create model topics on source code. Query expansion using inner, outer and intratoken was tested for finding feature locations on a Java-based open-source project. The best precision rate was 50%. The best recall was 100%, which was found in several use case scenarios implemented in a few files. The best average precision rate was 16.7%, which was found in inner association experiments. The best average recall rate was 68.3%, which was found in all compound association experiments.
This paper presents a natural language processing based automated system called DrawPlus for generating UML diagrams, user scenarios and test cases after analyzing the given business requirement specification which is written in natural language. The DrawPlus is presented for analyzing the natural languages and extracting the relative and required information from the given business requirement Specification by the user. Basically user writes the requirements specifications in simple English and the designed system has conspicuous ability to analyze the given requirement specification by using some of the core natural language processing techniques with our own well defined algorithms. After compound analysis and extraction of associated information, the DrawPlus system draws use case diagram, User scenarios and system level high level test case description. The DrawPlus provides the more convenient and reliable way of generating use case, user scenarios and test cases in a way reducing the time and cost of software development process while accelerating the 70 of works in Software design and Testing phase Janani Tharmaseelan ""Cohesive Software Design"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd22900.pdf
Paper URL: https://www.ijtsrd.com/computer-science/other/22900/cohesive-software-design/janani-tharmaseelan
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
11.query optimization to improve performance of the code execution
1. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.1, 2012
Query Optimization to Improve Performance of the Code
Execution
Swati Tawalare * S.S Dhande
Dept of CSE, SIPNA’s College of Engineering & Technology, Amravati, INDIA
* E-mail of the corresponding author : swatitawalare18@rediffmail.com
Abstract
Object-Oriented Programming (OOP) is one of the most successful techniques for abstraction. Bundling
together objects into collections of objects, and then operating on these collections, is a fundamental part of
main stream object-oriented programming languages. Object querying is an abstraction of operations over
collections, whereas manual implementations are performed at low level which forces the developers to
specify how a task must be done. Some object-oriented languages allow the programmers to express
queries explicitly in the code, which are optimized using the query optimization techniques from the
database domain. In this regard, we have developed a technique that performs query optimization at
compile-time to reduce the burden of optimization at run-time to improve the performance of the code
execution.
Keywords- Querying; joins; compile time; run-time; histograms; query optimization
Introduction
Query processing is the sequence of actions that takes as input a query formulated in the user language and
delivers as result the data asked for. Query processing involves query transformation and query execution.
Query transformation is the mapping of queries and query results back and forth through the different levels
of the DBMS. Query execution is the actual data retrieval according to some access plan, i.e. a sequence of
operations in the physical access language. An important task in query processing is query optimization.
Usually, user languages are high-level, declarative languages allowing to state what data should be
retrieved, not how to retrieve them. For each user query, many different execution plans exist, each having
its own associated costs. The task of query optimization ideally is to find the best execution plan, i.e. the
execution plan that costs the least, according to some performance measure. Usually, one has to accept just
feasible execution plans, because the number of semantically equivalent plans is to large to allow for
enumerative search A query is an expression that describes information that one wants to search for in a
database.
For example, query optimizers select the most efficient access plan for a query based on the estimated costs
of competing plans. These costs are in turn based on estimates of intermediate result sizes. Sophisticated
user interfaces also use estimates of result sizes as feedback to users before a query is actually executed.
Such feedback helps to detect errors in queries or misconceptions about the database. Query result sizes are
usually estimated using a variety of statistics that are maintained for relations in the database. These
statistics merely approximate the distribution of data values in attributes of the relations. Consequently,
they represent an inaccurate picture of the actual contents of the database. The resulting size-estimation
errors may undermine the validity of the optimizer’s decisions or render the user interface application
unreliable. Earlier work has shown that errors in query result size estimates may increase exponentially
with the number of joins. This result, in conjunction with the increasing complexity of queries,
demonstrates the critical importance of accurate estimation .Several techniques have been proposed in the
literature to estimate query result sizes, including histograms, sampling, and parametric techniques. Of
these, histograms approximate the frequency distribution of an attribute by grouping attribute values into
44
2. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.1, 2012
“buckets” (subsets) and approximating true attribute values and their frequencies in the data based on
summary statistics maintained in each bucket. The main advantages of histograms over other techniques are
that they incur almost no run-time overhead, they do not require the data to fit a probability distribution or a
polynomial and, for most real-world databases, there exist histograms that produce low-error estimates
while occupying reasonably small space Although histograms are used in many systems, the histograms
proposed in earlier works are not always effective or practical. For example, equidepth histograms work
well for range queries only when the data distribution has low skew, while serial histograms have only been
proven optimal for equality joins and selections when a list of all the attribute values in each bucket is
maintained. Implementing operations over these collections with conventional techniques severely lacks in
abstraction. Step-by-step instructions must be provided as to how to iterate over the collection, select
elements, and operate on the elements. Manually specifying the implementation of these operations fixes
how they are to be evaluated, rendering it impossible to accommodate changes in the state of the program
that could make another approach superior. This is especially problematic when combining two collections
of objects together, frequently an expensive operation. Object querying is a way to select an object, or set of
objects, from a collection or set of collections.
JQL is an addition to Java that provides the capability for querying collections of objects. These queries can
be applied on objects in collections in the program or can be used for checking expressions on all instances
of specific types at run-time. Queries allow the query engine to take up the task of implementation details
by providing abstractions to handle sets of objects thus making the code smaller and permitting the query
evaluator to choose the optimization approaches dynamically even though the situation changes at run-time.
The Java code and the JQL query will give the same set of results but the JQL code is elegant, brief, and
abstracts away the accurate method of finding the matches. Java Query Language (JQL) by generating the
dynamic join ordering strategies. The Java Query Language provides straightforward object querying for
Java. JQL designed as an extension to Java, and implemented a prototype JQL system, in order to
investigate the performance and usability characteristics of object querying. Queries can be evaluated over
one collection of objects, or over many collections, allowing the inspection of relationships between objects
in these collections. The Java Query Language (JQL) provides first-class object querying for Java.
For example A difficulty with this decomposition is representing students who are also teachers. One
solution is to have separate Student and Teacher objects, which are related by name.
The following code can then be used to identify students who are teachers:
List<Tuple2<Faculty, Student>>
matches = new Array List<..>();
for(Faculty f : all Faculty) { for(Student s : all Students) {
if(s.name.equals(f.name)) {
matches. add(new Tuple2<Faculty, Student>(f,s));
}}}
In database terms, this code is performing a join on the name field for the allFaculty and allStudent
collections. The code is cumbersome and can be Replaced with the following object query, which is more
succinct and, potentially, more efficient:
List<Tuple2<Faculty, Student>> matches;
Matches = selectAll (Faculty f=allFaculty, Student s=allStudents: f.name.equals (s.name));
This gives exactly the same set of results as the loop code. The selectAll primitive returns a list of tuples
containing all possible instantiations of the domain variables (i.e. those declared before the colon) where
the query expression holds (i.e. that after the colon). The domain variables determine the set of objects
which the query ranges over: they can be initialized from a collection (as above); or, left uninitialized to
range over the entire extent set (i.e. the set of all instantiated objects) of their type. Queries can define as
45
3. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.1, 2012
many domain variables as necessary and can make use of the usual array of expression constructs found in
Java. One difference from normal Java expressions is that Boolean operators, such as && and ||, do not
imply an order of execution for their operands. This allows flexibility in the order they are evaluated,
potentially leading to greater efficiency. As well as its simplicity, there are other advantages to using this
query in place of the loop code. The query evaluator can apply well-known optimizations which the
programmer might have missed. By leaving the decision of which optimization to apply until runtime, it
can make a more informed decision based upon dynamic properties of the data itself (such as the relative
size of input sets) something that is, at best, difficult for a programmer to do. A good example, which
applies in this case, is the so-called hash-join The idea is to avoid enumerating all of all Faculty × all
Students when there are few matches. A hash-map is constructed from the largest of the two collections
which maps the value being joined upon (in this case name) back to its objects. This still requires O(sf)
time in the worst-case, where s = |all Students| and f = |all Faculty|, but in practice is likely to be linear in
the number of matches (contrasting with a nested loop which always takes O(sf) time).
Implementation
We have prototyped a system, called the Java Query Language (JQL), which permits queries over object
extents and collections in Java. The implementation consists of three main components: a compiler, a query
evaluator and a runtime system for tracking all active objects in the program. The latter enables the query
evaluator to range over the extent sets of all classes. Our purpose in doing this is twofold: firstly, to assess
the performance impact of such a system; secondly, to provide a platform for experimenting with the idea
of using queries as first-class language constructs.
JQL Query Evaluator
The core component of the JQL system is the query evaluator. This is responsible for applying whatever
optimizations it can to evaluate queries efficiently. The evaluator is called at runtime with a tree
representation of the query (called the query tree). The tree itself is either constructed by the JQL Compiler
(for static queries) or by the user (for dynamic queries).
Evaluation Pipeline.
The JQL evaluator evaluates a query by pushing tuples through a staged pipeline. Each stage, known as a
join in the language of databases, corresponds to a condition in the query. Only tuples matching a join’s
condition are allowed to pass through to the next. Those tuples which make it through to the end are added
to the result set. Each join accepts two lists of tuples, L(eft) and R(ight), and combines them together
producing a single list. We enforce the restriction that, for each intermediate join, either both inputs come
from the previous stage or one comes directly from an input collection and the other comes from the
previous stage. This is known as a linear processing tree and it simplifies the query evaluator, although it
can lead to inefficiency in some cases. The way in which a join combines tuples depends upon its type and
the operation (e.g. ==, < etc) it represents. JQL currently supports two join types:
nested-loop join and hash join. A nested-loop join is a two-level nested loop which iterates each of L × R
and checks for a match. A hash join builds a temporary hash table which it uses to check for matches. This
provides the best performance, but can be used only when the join operator is == or equals(). Future
implementations may take advantage of B-Trees, for scanning sorted ranges of a collection.
JQL Implementation
Our implementation of JQL has two main components. The first is a frontend, which compiles JQL
expressions into Java code. The second component is the query evaluation back-end, which is responsible
for evaluating the query and choosing the most optimal evaluation strategy it can.
JQL Compiler
The JQL compiler is relatively straightforward — it compiles JQL expressions into equivalent Java code,
which can then be compiled by a normal Java compiler. This Java code binds domain variables to
46
4. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.1, 2012
collections, and builds a tree representation of the query expression, which the evaluator uses at runtime.
The compiler ‘inlines’ queries which only use one domain variable; they are compiled into normal loops,
but with hooks to allow for query caching. Inlined queries cannot have their stages dynamically ordered,
but dynamic ordering is of almost no importance to single-variable queries (whilst different orderings may
‘short-circuit’ evaluation for individual objects more quickly, the gain is minute compared to the
performance gain of inlining the query). Using the compiler also allows for checking of the
well-formedness of the query — for example, that query expressions only use domain variables that are
declared for that query.
Query Evaluator
The core of query evaluation is carried out through the query’s ‘Query Pipeline. A query pipeline consists
of a series of stages, each corresponding to a part of the query expression. Each stage can take either one or
two input sets of tuples, and for all stages except the very first, one of these input sets is the result set of the
previous stage in the pipeline. This structure is known, in the database community, as a ‘left-branching tree’,
and we impose it to simplify the problem of ordering stages. In each stage, if there is more than one input
set, the input tuples are combined (we describe the join techniques used for this combination shortly), and
the stage’s expression is evaluated using the values in the combined tuple in the place of their
corresponding domain variables. If the expression evaluates to true, the combined tuple is added to the
stage’s result set. The set of tuples which pass the final stage are returned to the program as the results of
the query
Literature Review & Related Work
The Program Query Language (PQL) (S. Chiba 1995) is a similar system which allows the programmer to
express queries capturing erroneous behavior over the program trace . A key difference from other systems
is that static analysis was used in an effort to answer some queries without needing to run the program. As a
fallback, queries which could not be resolved statically are compiled into the program’s byte code and
checked at runtime. (C. Hobatr & B. A. Malloy 2001)Hobatr and Malloy present a query-based debugger
for C++ that uses the Open++ Meta-Object Protocol and the Object Constraint Language(OCL) (Ihab F.
Ilyas et al. 2003) This system consists of a frontend for compiling OCL queries to C++, and a backend that
uses Open C++ to generate the instrumentation code necessary for evaluating the queries.JQL is the recent
development of Microsoft’s Language Integrated Query (LINQ) project LINQ aims to add first class
querying support to .NET languages (in particular Visual Basic .NET and C# )
(Y.E. Ioannidis,R. Ng, K. Shim & T.K. Selis 1992) LINQ operates by translating queries into additional
methods on collections of objects, which then perform filtering and mapping on the collection. LINQ’s
scope is wider than JQL’s, providing integrated querying for object collections, XML structures and SQL
databases. It is unclear at present what optimizations LINQ provides for object querying, or if it provides
incrementalized caching Finally, there are a number of APIs available for Java which provide access to
SQL databases. These are quite different from the approach we have presented in this work, as they do not
support querying over collections and/or object extents. Furthermore, they do not perform any query
optimizations, instead relying on the database back end to do this.(Darren Willis & David J. Pearce, James
Noble 2008),Our main contribution is the design and implementation of JQL, a prototype object querying
system that allows developers to easily write object queries in a Java-like language. This prototype provides
automatic optimization of operations that join multiple collections together.
Analysis of Problem
In our research, we prefer static query optimization at compile-time over dynamic query optimization
because it reduces the query run-time. In this method they are using selectivity estimate based on sampling
some number of tuples, but that does not lead to efficient ordering of joins and predicates in a query.
Therefore, we propose using the estimates of selectivities of joins and the predicates from histograms to
provide us an efficient ordering of joins and predicates in a query. Once we collect this information, we can
form the query plan by having the order of joins and predicates in a query. After we get the query plan at
47
5. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.1, 2012
compile-time, we execute that plan at run-time to reduce the execution time. Experimental results indicate
that our approach reduces run-time execution less than the existing JQL code’s run-time due to our
approach of optimizing the query and handling data updates using histograms.
Given a query Q,
1. we use the histogram H to get the estimate of the selectivity of the query predicates and the selectivities of
the joins
2. We have the join order and predicate order in a query which will be used to construct a query plan.
A.The first execution of Query Q uses the histogram H1 to estimate the selectivity. Then the result of
the query is computed. But for the subsequent execution of the same query Q after a time T, the same
histogram H can be left invalid. This situation arises because there is a possibility that the underlying data
has been updated between the first and the second executions of the same query.
B. Firstly, we check if the query is present in the log then the time period difference between consecutive
executions of the same query Q from the query log is computed and if that value is greater than a pre
specified time interval then we directly recomputed the histogram because we have assumed the data may
be modified within a pre specified time interval.
C. we first compute the error through error estimate function and then based on the error estimate we
decide whether to recomputed the histogram or not . If the query is not present in the log, then we execute
the query based upon the initial histogram that reduces the overhead cost of incremental maintenance of
histogram the experimental results of how our approach various types of queries the comparison of
run-times of our approach and the JQL approach for all the four benchmark queries. The difference in
run-times has occurred because in our approach, we have estimated selectivities using histograms and these
histograms are incrementally maintained at compile time which provide the optimal join order strategy
most of the times faster than the exhaustive join order strategy used by JQL. we can see that as the number
of increase in a query, JQL’s approach becomes more expensive and our approach performs much better
than the exhaustive join ordering strategy of JQL.
Proposed Work
We can form the query plan by having the order of joins and predicates in a query. After we get the query
plan at compile time, we execute that plan at run-time to reduce the execution time. Experimental results
indicate that our approach reduces run-time execution less than the existing JQL code’s run-time due to our
approach of optimizing the query and handling data updates using histograms. We intend to have the query
plans generated at compile time Query plans are a step by step ordered procedure describing the order in
which the query predicates need to be executed. Thus, at run- time, the time required for plan construction
is omitted. So we need to have the code working in static mode, i.e., without knowing the inputs at
compile-time, we need to be able to derive some information about inputs like sizes of relations by
estimating them to generate the query plan. Given a join query ,its selectivity needs to be estimated to
design plans. A histogram is one of the basic quality tools. It is used to graphically summarize and
display the distribution and variation of a process data set. A frequency distribution shows how often each
different value in a set of data occurs. The main purpose of a histogram is to clarify the presentation of data.
You can present the same information in a table; however, the graphic presentation format usually makes it
easier to see relationships. It is a useful tool for breaking out process data into regions or bins for
determining frequencies of certain events or categories of data. The approach followed for maintenance in
nearly all commercial systems is to recomputed histograms periodically (e.g., every night), regardless of the
number of updates performed on the database. This approach has two disadvantages. First, any significant
updates to the data since the last recomputation could result in poor estimations by the optimizer. Second,
because the histograms are recomputed from scratch by discarding the old histograms, the recomputation
phase for the entire database can be computationally very intensive and may have to be performed when the
48
6. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.1, 2012
system is lightly loaded
4.1. The Split & Merge Algorithm
The split and merge algorithm helps reduce the cost of building and maintaining histograms for large tables.
The algorithm is as follows: When a bucket count reaches the threshold, T, we split the bucket into two
halves instead of recomputing the entire histogram from the data. To maintain the number of buckets (β)
which is fixed, we merge two adjacent buckets whose total count is least and does not exceed threshold T,
if such a pair of buckets can be found. Only when a merge is not possible, we recomputed the histogram
from data. The operation of merging two adjacent buckets merely involves adding the counts of the two
buckets and disposing of the boundary between them. To split a bucket, an approximate median value in
the bucket is selected to serve as the bucket boundary between the two new buckets using the backing
sample. As new tuples are added, we increment the counts of appropriate buckets. When a count exceeds
the threshold T, the entire histogram is recomputed or, using split merge, we split and merge the buckets.
The algorithm for splitting the buckets starts with iterating through a list of buckets, and splitting the
buckets which exceed the threshold and finally returning the new set of buckets.After splitting is done, we
try to merge any two buckets that add up to the least value and whose count is less than a certain threshold.
Then we merge those two buckets. If we fail to find any pair of buckets to merge then we recomputed the
histogram from data. Finally, we return the set of buckets at the end of the algorithm. Thus, the problem of
incrementally maintaining the histograms has been resolved. Having estimated the selectivity of a join and
predicates, we get the join and the predicate ordering at compile-time
4.2 Incremental Maintenance of Histograms
we propose an incremental technique, which maintains approximate histograms within specified errors
bounds at all times with high probability and never accesses the underlying relations for this purpose. There
are two components to our incremental approach: (i) maintaining a backing sample, and (ii) a framework
for maintaining an approximate histogram that performs a few program instructions in response to each
update to the database, and detects when the histogram is in need of an adjustment of one or more of its
bucket boundaries. Such adjustments make use of the backing sample. There is a fundamental distinction
between the backing sample and the histogram it supports: the histogram is accessed more frequently than
the sample and uses less memory, and hence it can be stored in main memory while the sample is likely
stored on disk. A backing sample is a uniform random sample of the tuples in a relation that is kept up to-
date in the presence of updates to the relation. For each tuple, the sample contains the unique row id and
one or more attribute values. We argue that maintaining a backing sample is useful for histogram
computation, selectivity estimation, etc. In most sampling-based estimation techniques, whenever a sample
of size _ is needed, either the entire relation is scanned to extract the sample, or several random disk blocks
are read. In the latter case, the tuples in a disk block may be highly correlated, and hence to obtain a truly
random sample, _ disk blocks must be read. In contrast, a backing sample can be stored in consecutive disk
blocks, and can therefore be scanned by reading sequential disk blocks. Moreover, for each tuple in the
sample, only the unique row id and the attribute(s) of interest are retained. Thus the entire sample can be
stored in only a small number of disk blocks, for even faster retrieval. Finally, an indexing structure for the
sample can be created, maintained and stored; the index enables quick access to sample values within any
desired range.
4.3 Estimating Selectivity Using Histogram
The selectivity of a predicate in a query is a decisive aspect for a query plan generation. The ordering of
predicates can considerably affect the time needed to process a join query. To have the query plan ready at
compile-time, we need to have the selectivities of all the query predicates. To calculate these selectivities,
we use histograms. The histograms are built using the number of times an object is called. For this, we
partition the domain of the predicate into intervals called windows. With the help of past queries, the
selectivity of a predicate is derived with respect to its window. That is, if a table T has 100,000 rows and a
query contains a selection predicate of the formT.a=10 and a histogram shows that the selectivity of T.a=10
49
7. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.1, 2012
is 10% then the cardinality estimate for the fraction of rows of T that must be considered by the query is 10%
x 100,000= 10,000. This histogram approach would help us in Estimating the selectivity of a join and hence
decide on the order in which the joins have to be executed. So, we get the join ordering and the predicate
ordering in the query expression at compile-time itself. Thus, from this available information, we can
construct a query plan.
4.4 Building the Histogram
From the data distribution, we build the histogram that contains the frequency of values assigned to
different buckets. If the data is numerical, we can easily assign some ranges and assign the values to
buckets accordingly. If the data is categorical then we have to partition the data into ranges with respect to
the letter they start with and assign the appropriate values to buckets. Next, we perform some sample query
executions.
4.5 Method Outline for Error Estimation
For each attribute in the database table, we compute the error estimate by using standard deviation between
updated data values and old data values in the histogram buckets. Then, for every table, we have error
estimates for all the attributes. Then, we take a weighted average of all the attributes error estimates. The
underlying data could be mutable. For such mutable data, we need a technique by which we can restructure
the histograms accordingly. Thus, in between multiple query executions if the database is updated, then we
compute the estimation error of the histogram by using the equations (1).
Conclusion
In this paper we have presented the query optimization strategies from database domain can be used in
improving the run time executions in programming language .We proposed a technique for query
optimization at compile-time by reducing the burden of optimization at run-time. We proposed using
histograms to get the estimates of selectivity of joins and predicates in a query and then based on those
estimates, to order query joins and predicates in a query. From the join and predicate order, we have obtained
the query plan at compile-time and then we executed the query plan at run-time from this we can improve the
performance of the code execution.
References
C. Hobatr and B. A. Malloy(2001), “Using OCL-queries for debugging C++” ,In Proceedings of the IEEE
International Conference on Software Engineering (ICSE), pages 839–840.IEEE Computer Society Press,
2001.
Darren Willis ,David J. Pearce & James Noble(2008), “Caching and Incrementalisation in the Java Query
Language”, Proceedings of the 2008 ACM SIGPLAN conference on Object-oriented programming systems
languages and applications, pp. 1-18, 2008.
Ihab F. Ilyas et al. (2003), “Estimating Compilation Time of a Query Optimizer”, Proceedings of the 2003
ACM SIGMOD international conference on Management of data, pp. 373 –384, 2003.
S. Chiba(1995), “A meta object protocol for C++”,In Proceedings of the ACM conference on
Object-Oriente Programming, Systems, Languages and Applications (OOPSLA), pages 285–299. ACM
Press, 1995.
Y.E. Ioannidis,R. Ng, K. Shim & T.K. Selis(1992), “Parametric Query Optimization”, In Proceedings of
the Eighteenth International Conference on Very Large Databases (VLDB), pp. 103-114, 1992.
50
9. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.1, 2012
-----------(1)
where compute the estimation error of the histogram by using the equations..
µa is the estimation error for every attribute
• β is the number of buckets ,
• N is the number of tuples in R
• S is the number of selected tuples
• Fi is the frequency of bucket i as in the histogram
• qf= S/N is the query frequency
• Bi is the observed frequency
• Ti is the error estimate for each individual table
• Wi are the weights with respect to every attribute depending on the rate of change
52
10. International Journals Call for Paper
The IISTE, a U.S. publisher, is currently hosting the academic journals listed below. The peer review process of the following journals
usually takes LESS THAN 14 business days and IISTE usually publishes a qualified article within 30 days. Authors should
send their full paper to the following email address. More information can be found in the IISTE website : www.iiste.org
Business, Economics, Finance and Management PAPER SUBMISSION EMAIL
European Journal of Business and Management EJBM@iiste.org
Research Journal of Finance and Accounting RJFA@iiste.org
Journal of Economics and Sustainable Development JESD@iiste.org
Information and Knowledge Management IKM@iiste.org
Developing Country Studies DCS@iiste.org
Industrial Engineering Letters IEL@iiste.org
Physical Sciences, Mathematics and Chemistry PAPER SUBMISSION EMAIL
Journal of Natural Sciences Research JNSR@iiste.org
Chemistry and Materials Research CMR@iiste.org
Mathematical Theory and Modeling MTM@iiste.org
Advances in Physics Theories and Applications APTA@iiste.org
Chemical and Process Engineering Research CPER@iiste.org
Engineering, Technology and Systems PAPER SUBMISSION EMAIL
Computer Engineering and Intelligent Systems CEIS@iiste.org
Innovative Systems Design and Engineering ISDE@iiste.org
Journal of Energy Technologies and Policy JETP@iiste.org
Information and Knowledge Management IKM@iiste.org
Control Theory and Informatics CTI@iiste.org
Journal of Information Engineering and Applications JIEA@iiste.org
Industrial Engineering Letters IEL@iiste.org
Network and Complex Systems NCS@iiste.org
Environment, Civil, Materials Sciences PAPER SUBMISSION EMAIL
Journal of Environment and Earth Science JEES@iiste.org
Civil and Environmental Research CER@iiste.org
Journal of Natural Sciences Research JNSR@iiste.org
Civil and Environmental Research CER@iiste.org
Life Science, Food and Medical Sciences PAPER SUBMISSION EMAIL
Journal of Natural Sciences Research JNSR@iiste.org
Journal of Biology, Agriculture and Healthcare JBAH@iiste.org
Food Science and Quality Management FSQM@iiste.org
Chemistry and Materials Research CMR@iiste.org
Education, and other Social Sciences PAPER SUBMISSION EMAIL
Journal of Education and Practice JEP@iiste.org
Journal of Law, Policy and Globalization JLPG@iiste.org Global knowledge sharing:
New Media and Mass Communication NMMC@iiste.org EBSCO, Index Copernicus, Ulrich's
Journal of Energy Technologies and Policy JETP@iiste.org Periodicals Directory, JournalTOCS, PKP
Historical Research Letter HRL@iiste.org Open Archives Harvester, Bielefeld
Academic Search Engine, Elektronische
Public Policy and Administration Research PPAR@iiste.org Zeitschriftenbibliothek EZB, Open J-Gate,
International Affairs and Global Strategy IAGS@iiste.org OCLC WorldCat, Universe Digtial Library ,
Research on Humanities and Social Sciences RHSS@iiste.org NewJour, Google Scholar.
Developing Country Studies DCS@iiste.org IISTE is member of CrossRef. All journals
Arts and Design Studies ADS@iiste.org have high IC Impact Factor Values (ICV).