IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
International Journal of Computational Engineering Research(IJCER) ijceronline
nternational Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESIJCSEIT Journal
Keyword search in relational databases allows user to search information without knowing database
schema and using structural query language (SQL). In this paper, we address the problem of generating
and evaluating candidate networks. In candidate network generation, the overhead is caused by raising the
number of joining tuples for the size of minimal candidate network. To reduce overhead, we propose
candidate network generation algorithms to generate a minimum number of joining tuples according to the
maximum number of tuple set. We first generate a set of joining tuples, candidate networks (CNs). It is
difficult to obtain an optimal query processing plan during generating a number of joins. We also develop a
dynamic CN evaluation algorithm (D_CNEval) to generate connected tuple trees (CTTs) by reducing the
size of intermediate joining results. The performance evaluation of the proposed algorithms is conducted
on IMDB and DBLP datasets and also compared with existing algorithms.
International Journal of Computational Engineering Research(IJCER) ijceronline
nternational Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESIJCSEIT Journal
Keyword search in relational databases allows user to search information without knowing database
schema and using structural query language (SQL). In this paper, we address the problem of generating
and evaluating candidate networks. In candidate network generation, the overhead is caused by raising the
number of joining tuples for the size of minimal candidate network. To reduce overhead, we propose
candidate network generation algorithms to generate a minimum number of joining tuples according to the
maximum number of tuple set. We first generate a set of joining tuples, candidate networks (CNs). It is
difficult to obtain an optimal query processing plan during generating a number of joins. We also develop a
dynamic CN evaluation algorithm (D_CNEval) to generate connected tuple trees (CTTs) by reducing the
size of intermediate joining results. The performance evaluation of the proposed algorithms is conducted
on IMDB and DBLP datasets and also compared with existing algorithms.
Using Page Size for Controlling Duplicate Query Results in Semantic WebIJwest
Semantic web is a web of future. The Resource Description Framework (RDF) is a language
to represent resources in the World Wide Web. When these resources are queried the problem of duplicate
query results occurs. The present techniques used hash index comparison to remove duplicate query
results. The major drawback of using the hash index to remove duplicate query results is that, if there is a
slight change in formatting or word order, then hash index is changed and query results are no more
considered as duplicate even though they have same contents. We presented an algorithm for detection and
elimination of duplicate query results from semantic web using hash index and page size comparisons.
Experimental results showed that the proposed technique removed duplicate query results from semantic
web efficiently, solved the problems of using hash index for duplicate handling and could be embedded in
existing SQL-Based query system for semantic web. Research could be carried out for certain flexibilities
in existing SQL-Based query system of semantic web to accommodate other duplicate detection techniques
as well.
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
The first element of the search process is the query.
The user query being on an average restricted to two or three
keywords makes the query ambiguous to the search engine.
Given the user query, the goal of an Information Retrieval
[IR] system is to retrieve information which might be useful
or relevant to the information need of the user. Hence, the
query processing plays an important role in IR system.
The query processing can be divided into four categories
i.e. query expansion, query optimization, query classification and
query parsing. In this paper an attempt is made to evaluate the
performance of query processing algorithms in each of the
category. The evaluation was based on dataset as specified by
Forum for Information Retrieval [FIRE15]. The criteria used
for evaluation are precision and relative recall. The analysis is
based on the importance of each step in query processing. The
experimental results show that the significance of each step
in query processing and also the relevance of web semantics
and spelling correction in the user query.
Architecture of an ontology based domain-specific natural language question a...IJwest
Question answering (QA) system aims at retrieving precise information from a large collection of
documents against a query. This paper describes the architecture of a Natural Language Question
Answering (NLQA) system for a specifi
c domain based on the ontological information, a step towards
semantic web question answering. The proposed architecture defines four basic modules suitable for
enhancing current QA capabilities with the ability of processing complex questions. The first m
odule was
the question processing, which analyses and classifies the question and also reformulates the user query.
The second module allows the process of retrieving the relevant documents. The next module processes the
retrieved documents, and the last m
odule performs the extraction and generation of a response. Natural
language processing techniques are used for processing the question and documents and also for answer
extraction. Ontology and domain knowledge are used for reformulating queries and ident
ifying the
relations. The aim of the system is to generate short and specific answer to the question that is asked in the
natural language in a specific domain. We have achieved 94 % accuracy of natural language question
answering in our implementation
There is a vast amount of unstructured Arabic information on the Web, this data is always organized in
semi-structured text and cannot be used directly. This research proposes a semi-supervised technique that
extracts binary relations between two Arabic named entities from the Web. Several works have been
performed for relation extraction from Latin texts and as far as we know, there isn’t any work for Arabic
text using a semi-supervised technique. The goal of this research is to extract a large list or table from
named entities and relations in a specific domain. A small set of a handful of instance relations are
required as input from the user. The system exploits summaries from Google search engine as a source
text. These instances are used to extract patterns. The output is a set of new entities and their relations. The
results from four experiments show that precision and recall varies according to relation type. Precision
ranges from 0.61 to 0.75 while recall ranges from 0.71 to 0.83. The best result is obtained for (player, club)
relationship, 0.72 and 0.83 for precision and recall respectively.
Towards a Query Rewriting Algorithm Over Proteomics XML ResourcesCSCJournals
Querying and sharing Web proteomics data is not an easy task. Given that, several data sources can be used to answer the same sub-goals in the Global query, it is obvious that we can have many candidates rewritings. The user-query is formulated using Concepts and Properties related to Proteomics research (Domain Ontology). Semantic mappings describe the contents of underlying sources. In this paper, we propose a characterization of query rewriting problem using semantic mappings as an associated hypergraph. Hence, the generation of candidates’ rewritings can be formulated as the discovery of minimal Transversals of an hypergraph. We exploit and adapt algorithms available in Hypergraph Theory to find all candidates rewritings from a query answering problem. Then, in future work, some relevant criteria could help to determine optimal and qualitative rewritings, according to user needs, and sources performances.
Ontology Based Approach for Semantic Information Retrieval SystemIJTET Journal
Abstract—The Information retrieval system is taking an important role in current search engine which performs searching operation based on keywords which results in an enormous amount of data available to the user, from which user cannot figure out the essential and most important information. This limitation may be overcome by a new web architecture known as the semantic web which overcome the limitation of the keyword based search technique called the conceptual or the semantic search technique. Natural language processing technique is mostly implemented in a QA system for asking user’s questions and several steps are also followed for conversion of questions to the query form for retrieving an exact answer. In conceptual search, search engine interprets the meaning of the user’s query and the relation among the concepts that document contains with respect to a particular domain that produces specific answers instead of showing lists of answers. In this paper, we proposed the ontology based semantic information retrieval system and the Jena semantic web framework in which, the user enters an input query which is parsed by Standford Parser then the triplet extraction algorithm is used. For all input queries, the SPARQL query is formed and further, it is fired on the knowledge base (Ontology) which finds appropriate RDF triples in knowledge base and retrieve the relevant information using the Jena framework.
In this research work we have developed a semi-deterministic algorithm and a scoring system that takes advantage of the Latent Semantic indexing scoring system for crawling web pages that belong to particular domain or is specific to the topic .The proposed algorithm calculates a preference factor in addition to the LSI score to determine which web page needs to preferred for crawling by the multi threaded crawler application, by doing this we were able to produce a retrieval system that has high recall and precision values as it builds a queue which is specific to a particular domain/topic which would not have been possible in Breath first and only LSI based information retrieval systems.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
Abstract – It is a research venture on the new information-access standard called type-ahead search, in which systems discover responds to a keyword query on-the-fly as users type in the uncertainty. In this paper we learn how to support fuzzy type-ahead search in XML. Underneath fuzzy search is important when users have limited knowledge about the exact representation of the entities they are looking for, such as people records in an online directory. We have developed and deployed several such systems, some of which have been used by many people on a daily basis. The systems received overwhelmingly positive feedbacks from users due to their friendly interfaces with the fuzzy-search feature. We describe the design and implementation of the systems, and demonstrate several such systems. We show that our efficient techniques can indeed allow this search paradigm to scale on large amounts of data.
Index Terms - type-ahead, large data set, server side, online directory, search technique.
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...IJwest
The increasing interest in developing efficient and effective optimization techniques has conducted researchers to turn their attention towards biology. It has been noticed that biology offers many clues for designing novel optimization techniques, these approaches exhibit self-organizing capabilities and permit the reachability of promising solutions without the existence of a central coordinator. In this paper we handle the problem of dynamic web service composition, by using the clonal selection algorithm. In order to assess the optimality rate of a given composition, we use the QOS attributes of the services involved in the workflow as well as, the semantic similarity between these components. The experimental evaluation shows that the proposed approach has a better performance in comparison with other approaches such as the genetic algorithm.
Using Page Size for Controlling Duplicate Query Results in Semantic WebIJwest
Semantic web is a web of future. The Resource Description Framework (RDF) is a language
to represent resources in the World Wide Web. When these resources are queried the problem of duplicate
query results occurs. The present techniques used hash index comparison to remove duplicate query
results. The major drawback of using the hash index to remove duplicate query results is that, if there is a
slight change in formatting or word order, then hash index is changed and query results are no more
considered as duplicate even though they have same contents. We presented an algorithm for detection and
elimination of duplicate query results from semantic web using hash index and page size comparisons.
Experimental results showed that the proposed technique removed duplicate query results from semantic
web efficiently, solved the problems of using hash index for duplicate handling and could be embedded in
existing SQL-Based query system for semantic web. Research could be carried out for certain flexibilities
in existing SQL-Based query system of semantic web to accommodate other duplicate detection techniques
as well.
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
The first element of the search process is the query.
The user query being on an average restricted to two or three
keywords makes the query ambiguous to the search engine.
Given the user query, the goal of an Information Retrieval
[IR] system is to retrieve information which might be useful
or relevant to the information need of the user. Hence, the
query processing plays an important role in IR system.
The query processing can be divided into four categories
i.e. query expansion, query optimization, query classification and
query parsing. In this paper an attempt is made to evaluate the
performance of query processing algorithms in each of the
category. The evaluation was based on dataset as specified by
Forum for Information Retrieval [FIRE15]. The criteria used
for evaluation are precision and relative recall. The analysis is
based on the importance of each step in query processing. The
experimental results show that the significance of each step
in query processing and also the relevance of web semantics
and spelling correction in the user query.
Architecture of an ontology based domain-specific natural language question a...IJwest
Question answering (QA) system aims at retrieving precise information from a large collection of
documents against a query. This paper describes the architecture of a Natural Language Question
Answering (NLQA) system for a specifi
c domain based on the ontological information, a step towards
semantic web question answering. The proposed architecture defines four basic modules suitable for
enhancing current QA capabilities with the ability of processing complex questions. The first m
odule was
the question processing, which analyses and classifies the question and also reformulates the user query.
The second module allows the process of retrieving the relevant documents. The next module processes the
retrieved documents, and the last m
odule performs the extraction and generation of a response. Natural
language processing techniques are used for processing the question and documents and also for answer
extraction. Ontology and domain knowledge are used for reformulating queries and ident
ifying the
relations. The aim of the system is to generate short and specific answer to the question that is asked in the
natural language in a specific domain. We have achieved 94 % accuracy of natural language question
answering in our implementation
There is a vast amount of unstructured Arabic information on the Web, this data is always organized in
semi-structured text and cannot be used directly. This research proposes a semi-supervised technique that
extracts binary relations between two Arabic named entities from the Web. Several works have been
performed for relation extraction from Latin texts and as far as we know, there isn’t any work for Arabic
text using a semi-supervised technique. The goal of this research is to extract a large list or table from
named entities and relations in a specific domain. A small set of a handful of instance relations are
required as input from the user. The system exploits summaries from Google search engine as a source
text. These instances are used to extract patterns. The output is a set of new entities and their relations. The
results from four experiments show that precision and recall varies according to relation type. Precision
ranges from 0.61 to 0.75 while recall ranges from 0.71 to 0.83. The best result is obtained for (player, club)
relationship, 0.72 and 0.83 for precision and recall respectively.
Towards a Query Rewriting Algorithm Over Proteomics XML ResourcesCSCJournals
Querying and sharing Web proteomics data is not an easy task. Given that, several data sources can be used to answer the same sub-goals in the Global query, it is obvious that we can have many candidates rewritings. The user-query is formulated using Concepts and Properties related to Proteomics research (Domain Ontology). Semantic mappings describe the contents of underlying sources. In this paper, we propose a characterization of query rewriting problem using semantic mappings as an associated hypergraph. Hence, the generation of candidates’ rewritings can be formulated as the discovery of minimal Transversals of an hypergraph. We exploit and adapt algorithms available in Hypergraph Theory to find all candidates rewritings from a query answering problem. Then, in future work, some relevant criteria could help to determine optimal and qualitative rewritings, according to user needs, and sources performances.
Ontology Based Approach for Semantic Information Retrieval SystemIJTET Journal
Abstract—The Information retrieval system is taking an important role in current search engine which performs searching operation based on keywords which results in an enormous amount of data available to the user, from which user cannot figure out the essential and most important information. This limitation may be overcome by a new web architecture known as the semantic web which overcome the limitation of the keyword based search technique called the conceptual or the semantic search technique. Natural language processing technique is mostly implemented in a QA system for asking user’s questions and several steps are also followed for conversion of questions to the query form for retrieving an exact answer. In conceptual search, search engine interprets the meaning of the user’s query and the relation among the concepts that document contains with respect to a particular domain that produces specific answers instead of showing lists of answers. In this paper, we proposed the ontology based semantic information retrieval system and the Jena semantic web framework in which, the user enters an input query which is parsed by Standford Parser then the triplet extraction algorithm is used. For all input queries, the SPARQL query is formed and further, it is fired on the knowledge base (Ontology) which finds appropriate RDF triples in knowledge base and retrieve the relevant information using the Jena framework.
In this research work we have developed a semi-deterministic algorithm and a scoring system that takes advantage of the Latent Semantic indexing scoring system for crawling web pages that belong to particular domain or is specific to the topic .The proposed algorithm calculates a preference factor in addition to the LSI score to determine which web page needs to preferred for crawling by the multi threaded crawler application, by doing this we were able to produce a retrieval system that has high recall and precision values as it builds a queue which is specific to a particular domain/topic which would not have been possible in Breath first and only LSI based information retrieval systems.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
Abstract – It is a research venture on the new information-access standard called type-ahead search, in which systems discover responds to a keyword query on-the-fly as users type in the uncertainty. In this paper we learn how to support fuzzy type-ahead search in XML. Underneath fuzzy search is important when users have limited knowledge about the exact representation of the entities they are looking for, such as people records in an online directory. We have developed and deployed several such systems, some of which have been used by many people on a daily basis. The systems received overwhelmingly positive feedbacks from users due to their friendly interfaces with the fuzzy-search feature. We describe the design and implementation of the systems, and demonstrate several such systems. We show that our efficient techniques can indeed allow this search paradigm to scale on large amounts of data.
Index Terms - type-ahead, large data set, server side, online directory, search technique.
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...IJwest
The increasing interest in developing efficient and effective optimization techniques has conducted researchers to turn their attention towards biology. It has been noticed that biology offers many clues for designing novel optimization techniques, these approaches exhibit self-organizing capabilities and permit the reachability of promising solutions without the existence of a central coordinator. In this paper we handle the problem of dynamic web service composition, by using the clonal selection algorithm. In order to assess the optimality rate of a given composition, we use the QOS attributes of the services involved in the workflow as well as, the semantic similarity between these components. The experimental evaluation shows that the proposed approach has a better performance in comparison with other approaches such as the genetic algorithm.
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
The well-known saying is that a picture is worth a thousand words; but what if you’re the picture, and you’re being shown off to a gorgeous woman, or even perhaps a prospective employer? Would you like the way you look? Are you happy with the vibes that you think you give out? What are your eyes, hands and shoulders saying? A little worried aren’t we?
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
Privacy Policy Inference of User-Uploaded Images on Content Sharing Sites1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
On Summarization and Timeline Generation for Evolutionary Tweet Streams1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
Expression of Query in XML object-oriented databaseEditor IJCATR
Upon invent of object-oriented database, the concept of behavior in database was propounded. Before, relational database only provided a logical modeling of data and paid no attention to the operations applied on data in the system. In this paper, a method is presented for query of object-oriented database. This method has appropriate results when the user explains restrictions in a combinational matter (disjunctive and conjunctive) and assumes a weight for each one of restrictions based on their importance. Later, the obtained results are sorted based on their belonging rate to the response set. In continue, queries are explained using XML labels. The purpose is simplifying queries and objects resulted from queries to be very close to the user need and meet his expectation.
Expression of Query in XML object-oriented databaseEditor IJCATR
Upon invent of object-oriented database, the concept of behavior in database was propounded. Before, relational database
only provided a logical modeling of data and paid no attention to the operations applied on data in the system. In this paper, a method
is presented for query of object-oriented database. This method has appropriate results when the user explains restrictions in a
combinational matter (disjunctive and conjunctive) and assumes a weight for each one of restrictions based on their importance. Later,
the obtained results are sorted based on their belonging rate to the response set. In continue, queries are explained using XML labels.
The purpose is simplifying queries and objects resulted from queries to be very close to the user need and meet his expectation.
Expression of Query in XML object-oriented databaseEditor IJCATR
Upon invent of object-oriented database, the concept of behavior in database was propounded. Before, relational database
only provided a logical modeling of data and paid no attention to the operations applied on data in the system. In this paper, a method
is presented for query of object-oriented database. This method has appropriate results when the user explains restrictions in a
combinational matter (disjunctive and conjunctive) and assumes a weight for each one of restrictions based on their importance. Later,
the obtained results are sorted based on their belonging rate to the response set. In continue, queries are explained using XML labels.
The purpose is simplifying queries and objects resulted from queries to be very close to the user need and meet his expectation.
call for paper 2012, hard copy of journal, research paper publishing, where t...IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Abstract:
An increasing number of applications rely on RDF, OWL 2, and SPARQL for storing and querying data. SPARQL, however, is not targeted towards end-users, and suitable query interfaces are needed. Faceted search is a prominent approach for end-user data access, and several RDF-based faceted search systems have been developed. There is, however, a lack of rigorous theoretical underpinning for faceted search in the context of RDF and OWL 2. In this paper, we provide such solid foundations. We formalise faceted interfaces for this context, identify a fragment of first-order logic capturing the underlying queries, and study the complexity of answering such queries for RDF and OWL 2 profiles. We then study interface generation and update, and devise efficiently implementable algorithms. Finally, we have implemented and tested our faceted search algorithms for scalability, with encouraging results.
A NEW TOP-K CONDITIONAL XML PREFERENCE QUERIESijaia
Preference querying technology is a very important issue in a variety of applications ranging from ecommerce
to personalized search engines. Most of recent research works have been dedicated to this topic in the Artificial Intelligence and Database fields. Several formalisms allowing preference reasoning and specification have been proposed in the Artificial Intelligence domain. On the other hand, in the Database field the interest has been focused mainly in extending standard Structured Query Language (SQL) and also eXtensible Markup Language (XML) with preference facilities in order to provide personalized query
answering. More precisely, the interest in the database context focuses on the notion of Top-k preference
query and on the development of efficient methods for evaluating these queries. A Top-k preference query
returns k data tuples which are the most preferred according to the user’s preferences. Of course, Top-k
preference query answering is closely dependent on the particular preference model underlying the semantics of the operators responsible for selecting the best tuples. In this paper, we consider the Conditional Preference queries (CP-queries) where preferences are specified by a set of rules expressed in a logical formalism. We introduce Top-k conditional preference queries (Top-k CP-queries), and the
operators BestK-Match and Best-Match for evaluating these queries will be presented.
Enhancing keyword search over relational databases using ontologiescsandit
Keyword Search Over Relational Databases (KSORDB) provides an easy way for casual users
to access relational databases using a set of keywords. Although much research has been done
and several prototypes have been developed recently, most of this research implements exact
(also called syntactic or keyword) match. So, if there is a vocabulary mismatch, the user cannot
get an answer although the database may contain relevant data. In this paper we propose a
system that overcomes this issue. Our system extends existing schema-free KSORDB systems
with semantic match features. So, if there were no or very few answers, our system exploits
domain ontology to progressively return related terms that can be used to retrieve more
relevant answers to user.
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES cscpconf
Keyword Search Over Relational Databases (KSORDB) provides an easy way for casual users to access relational databases using a set of keywords. Although much research has been done and several prototypes have been developed recently, most of this research implements exact also called syntactic or keyword) match. So, if there is a vocabulary mismatch, the user cannotget an answer although the database may contain relevant data. In this paper we propose a
system that overcomes this issue. Our system extends existing schema-free KSORDB systems with semantic match features. So, if there were no or very few answers, our system exploits
domain ontology to progressively return related terms that can be used to retrieve morerelevant answers to user.
Efficient Filtering Algorithms for Location- Aware Publish/subscribeIJSRD
Location-based services have been mostly used in many systems. preceding systems uses a pull model or user-initiated model, where a user arrival a query to a server which gives response with location-aware answers. To offer outcomes to users with fast responses, a push model or server-initiated model is flattering an important computing model in the next-generation location-based services. In the push model, subscribers arrive spatio-textual subscriptions to fastening their curiosities, and publishers send spatio-textual messages. It is used for a high-performance location-aware publish/subscribe system to send publishers’ messages to valid subscribers. In this paper, we find the exploration happenstances that start in manipulative a location-aware publish/subscribe system. We recommend an R-tree based index by merging textual descriptions into R-tree nodes. We design efficient filtering algorithms and effective pruning techniques to accomplish high performance. This method can support likewise conjunctive queries and ranking queries.
Existing work on keyword search relies on an element-level model (data graphs) to compute keyword query results.Elements mentioning keywords are retrieved from this model and paths between them are explored to compute Steiner graphs. KRG (keyword Element Relationship Graph) captures relationships at the keyword level.Relationships captured by a KRG are not direct edges between tuples but stand for paths between keywords.
We propose to route keywords only to relevant sources to reduce the high cost of processing keyword search queries over all sources. A multilevel scoring mechanism is proposed for computing the relevance of routing plans based on scores at the level of keywords, data elements,element sets, and subgraphs that connect these elements
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...ijseajournal
With the emergence of XML as de facto format for storing and exchanging information over the Internet, the search for ever more innovative and effective techniques for their querying is a major and current concern of the XML database community. Several studies carried out to help solve this problem are mostly oriented towards the evaluation of so-called exact queries which, unfortunately, are likely (especially in the case of semi-structured documents) to yield abundant results (in the case of vague queries) or empty results (in the case of very precise queries). From the observation that users who make requests are not necessarily interested in all possible solutions, but rather in those that are closest to their needs, an important field of research has been opened on the evaluation of preferences queries. In this paper, we propose an approach for the evaluation of such queries, in case the preferences concern the structure of the document. The solution investigated revolves around the proposal of an evaluation plan in three phases: rewriting-evaluation-merge. The rewriting phase makes it possible to obtain, from a partitioningtransformation operation of the initial query, a hierarchical set of preferences path queries which are holistically evaluated in the second phase by an instrumented version of the algorithm TwigStack. The merge phase is the synthesis of the best results.
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
Similar to Context-Based Diversification for Keyword Queries over XML Data (20)
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Context-Based Diversification for Keyword Queries over XML Data
1. Context-Based Diversification
for Keyword Queries Over XML Data
Jianxin Li, Chengfei Liu, Member, IEEE, and Jeffrey Xu Yu, Senior Member, IEEE
Abstract—While keyword query empowers ordinary users to search vast amount of data, the ambiguity of keyword query makes it
difficult to effectively answer keyword queries, especially for short and vague keyword queries. To address this challenging problem, in
this paper we propose an approach that automatically diversifies XML keyword search based on its different contexts in the XML data.
Given a short and vague keyword query and XML data to be searched, we first derive keyword search candidates of the query by a
simple feature selection model. And then, we design an effective XML keyword search diversification model to measure the quality of
each candidate. After that, two efficient algorithms are proposed to incrementally compute top-k qualified query candidates as the
diversified search intentions. Two selection criteria are targeted: the k selected query candidates are most relevant to the given query
while they have to cover maximal number of distinct results. At last, a comprehensive evaluation on real and synthetic data sets
demonstrates the effectiveness of our proposed diversification model and the efficiency of our algorithms.
Index Terms—XML keyword search, context-based diversification
Ç
1 INTRODUCTION
KEYWORD search on structured and semi-structured
data has attracted much research interest recently, as it
enables users to retrieve information without the need to
learn sophisticated query languages and database structure
[1]. Compared with keyword search methods in informa-
tion retrieval (IR) that prefer to find a list of relevant docu-
ments, keyword search approaches in structured and semi-
structured data (denoted as DB and IR) concentrate more
on specific information contents, e.g., fragments rooted at
the smallest lowest common ancestor (SLCA) nodes of a
given keyword query in XML. Given a keyword query, a
node v is regarded as an SLCA if 1) the subtree rooted at the
node v contains all the keywords, and 2) there does not exist
a descendant node v0
of v such that the subtree rooted at v0
contains all the keywords. In other words, if a node is an
SLCA, then its ancestors will be definitely excluded from
being SLCAs, by which the minimal information content
with SLCA semantics can be used to represent the specific
results in XML keyword search. In this paper, we adopt the
well-accepted SLCA semantics [2], [3], [4], [5] as a result
metric of keyword query over XML data.
In general, the more keywords a user’s query contains,
the easier the user’s search intention with regards to the
query can be identified. However, when the given keyword
query only contains a small number of vague keywords, it
would become a very challenging problem to derive the
user’s search intention due to the high ambiguity of this
type of keyword queries. Although sometimes user involve-
ment is helpful to identify search intentions of keyword
queries, a user’s interactive process may be time-consuming
when the size of relevant result set is large. To address this,
we will develop a method of providing diverse keyword
query suggestions to users based on the context of the given
keywords in the data to be searched. By doing this, users
may choose their preferred queries or modify their original
queries based on the returned diverse query suggestions.
Example 1. Consider a query q ¼ {database, query} over the
DBLP data set. There are 21,260 publications or venues
containing the keyword “database”, and 9,896 publica-
tions or venues containing the keyword “query”, which
contributes 2,040 results that contain the two given key-
words together. If we directly read the keyword search
results, it would be time consuming and not user-
friendly due to the huge number of results. It takes
54.22 s for just computing all the SLCA results of q by
using XRank [2]. Even if the system processing time is
acceptable by accelerating the keyword query evaluation
with efficient algorithms [3], [4], the unclear and repeated
search intentions in the large set of retrieved results will
make users frustrated. To address the problem, we
derive different search semantics of the original query
from the different contexts of the XML data, which can
be used to explore different search intentions of the origi-
nal query. In this study, the contexts can be modeled by
extracting some relevant feature terms of the query key-
words from the XML data, as shown in Table 1. And
then, we can compute the keyword search results for
each search intention. Table 2 shows part of statistic
information of the answers related to the keyword query
q, which classifies each ambiguous keyword query into
different search intentions.
The problem of diversifying keyword search is firstly
studied in IR community [6], [7], [8], [9], [10]. Most of
J. Li and C. Liu are with the Faculty of Science, Engineering and
Technology, Swinburne University of Technology, Melbourne, VIC 3122,
Australia. E-mail: {jianxinli, cliu}@swin.edu.au.
J.X. Yu is with the Department of Systems Engineering Engineering
Management, The Chinese University of Hong Kong, Hong Kong.
E-mail: yu@se.cuhk.edu.hk.
Manuscript received 18 Apr. 2013; revised 2 June 2014; accepted 10 June
2014. Date of publication 7 July 2014; date of current version 28 Jan. 2015.
Recommended for acceptance by S. Sudarshan.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TKDE.2014.2334297
660 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 3, MARCH 2015
1041-4347 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2. them perform diversification as a post-processing or re-
ranking step of document retrieval based on the analysis
of result set and/or the query logs. In IR, keyword search
diversification is designed at the topic or document level.
For e.g., Agrawal et al. [7] model user intents at the topical
level of the taxonomy and Radlinski and Dumais [11]
obtain the possible query intents by mining query logs.
However, it is not always easy to get these useful taxon-
omy and query logs. In addition, the diversified results in
IR are often modeled at document levels. To improve the
precision of query diversification in structured databases
or semi structured data, it is desirable to consider both
structure and content of data in diversification model. So
the problem of keyword search diversification is necessary
to be reconsidered in structured databases or semi struc-
tured data. Liu et al. [12] is the first work to measure the
difference of XML keyword search results by comparing
their feature sets. However, the selection of feature set in
[12] is limited to metadata in XML and it is also a method
of post-process search result analysis.
Different from the above post-process methods, another
type of works addresses the problem of intent-based key-
word query diversification through constructing structured
query candidates [13], [14]. Their brief idea is to first map
each keyword to a set of attributes (metadata), and then con-
struct a large number of structured query candidates by
merging the attribute-keyword pairs. They assume that
each structured query candidate represents a type of search
intention, i.e., a query interpretation. However, these works
are not easy to be applied in real application due to the fol-
lowing three limitations:
A large number of structured XML queries may be
generated and evaluated;
There is no guarantee that the structured queries to
be evaluated can find matched results due to the
structural constraints;
Similar to [12], the process of constructing structured
queries has to rely on the metadata information in
XML data.
To address the above limitations and challenges, we ini-
tiate a formal study of the diversification problem in XML
keyword search, which can directly compute the diversi-
fied results without retrieving all the relevant candidates.
Towards this goal, given a keyword query, we first derive
the co-related feature terms for each query keyword from
XML data based on mutual information in the probability
theory, which has been used as a criterion for feature selec-
tion [15], [16]. The selection of our feature terms is not lim-
ited to the labels of XML elements. Each combination of
the feature terms and the original query keywords may
represent one of diversified contexts (also denoted as spe-
cific search intentions). And then, we evaluate each derived
search intention by measuring its relevance to the original
keyword query and the novelty of its produced results. To
efficiently compute diversified keyword search, we pro-
pose one baseline algorithm and two improved algorithms
based on the observed properties of diversified keyword
search results.
The remainder of this paper is organized as follows. In
Section 2, we introduce a feature selection model and define
the problem of diversifying XML keyword search. In
Section 3, we describe the procedure of extracting the rele-
vant feature terms for a keyword query based on the
explored feature selection model. In Section 4, we propose
three efficient algorithms to identify a set of qualified and
diversified keyword query candidates and evaluate them
based on our proposed pruning properties. In Section 5, we
provide extensive experimental results to show the effec-
tiveness of our diversification model and the performance
of our proposed algorithms. We describe the related work
in Section 6 and conclude in Section 7.
2 PROBLEM DEFINITION
Given a keyword query q and an XML data T, our target is
to derive top-k expanded query candidates in terms of high
relevance and maximal diversification for q in T. Here, each
query candidate represents a context or a search intention
of q in T.
2.1 Feature Selection Model
Consider an XML data T and its relevance-based term-pair
dictionary W. The composition method of W depends on
the application context and will not affect our subsequent
discussion. As an example, it can simply be the full or a sub-
set of the terms comprising the text in T or a well-specified
set of term-pairs relevant to some applications.
In this work, the distinct term-pairs are selected based on
their mutual information as [15], [16]. Mutual information
has been used as a criterion for feature selection and feature
transformation in machine learning. It can be used to char-
acterize both the relevance and redundancy of variables,
such as the minimum redundancy feature selection.
Assume we have an XML tree T and its sample result set
RðTÞ. Let Probðx; TÞ be the probability of term x appearing
TABLE 1
Top 10 Selected Feature Terms of q
keyword features
database systems; relational; protein; distributed;
oriented; image; sequence; search;
model; large.
query language; expansion; optimization; evaluation;
complexity; log; efficient; distributed;
semantic; translation.
TABLE 2
Part of Statistic Information for q
database systems query þ
language expansion optimization evaluation complexity
#results 71 5 68 13 1
log efficient distributed semantic translation
#results 12 17 50 14 8
relational database query þ
language expansion optimization evaluation complexity
#results 40 0 20 8 0
log efficient distributed semantic translation
#results 2 11 5 7 5
... ...
LI ET AL.: CONTEXT-BASED DIVERSIFICATION FOR KEYWORD QUERIES OVER XML DATA 661
3. in RðTÞ, i.e., Probðx; TÞ ¼ jRðx; TÞj
jRðTÞj where jRðx; TÞj is the num-
ber of results containing x. Let Probðx; y; TÞ be the probabil-
ity of terms x and y co-occurring in RðTÞ, i.e., Probðx; y;
TÞ ¼ jRðx; y; TÞj
jRðTÞj . If terms x and y are independent, then know-
ing x does not give any information about y and vice versa,
so their mutual information is zero. At the other extreme, if
terms x and y are identical, then knowing x determines the
value of y and vice versa. Therefore, the simple measure can
be used to quantify how much the observed word co-occur-
rences maximize the dependency of feature terms while
reduce the redundancy of feature terms. In this work, we
use the popularly-accepted mutual information model as
follows:
MIðx; y; TÞ ¼ Probðx; y; TÞ Ã log Probðx; y; TÞ
Probðx; TÞ Ã Probðy; TÞ : (1)
For each term in the XML data, we need to find a set of
feature terms where the feature terms can be selected in
any way, e.g., top-m feature terms or the feature terms
with their mutual values higher than a given threshold
based on domain applications or data administrators. The
feature terms can be pre-computed and stored before the
procedure of query evaluation. Thus, given a keyword
query, we can obtain a matrix of features for the query
keywords using the term-pairs in W. The matrix represents
a space of search intentions (i.e., query candidates) of the
original query w.r.t. the XML data. Therefore, our first
problem is to select a subset of query candidates, which
has the highest probability of interpreting the contexts of
original query. In this work, the selection of query candi-
dates are based on an approximate sample at the entity
level of XML data.
Consider query q ¼ {database, query} over DBLP XML
data set again. Its corresponding matrix can be constructed
from Table 1. Table 3 shows the mutual information score
for the query keywords in q. Each combination of the fea-
ture terms in matrix may represent a search intention with
the specific semantics. For example, the combination “query
expansion database systems” targets to search the publica-
tions discussing the problem of query expansion in the area
of database systems, e.g., one of the works, “query expansion
for information retrieval” published in Encyclopedia of Database
Systems in 2009, will be returned. If we replace the feature
term “systems” with “relational”, then the generated query
will be changed to search specific publications of query
expansion over relational database, in which the returned
results are empty because no work is reported to the prob-
lem over relational database in DBLP data set.
2.2 Keyword Search Diversification Model
In our model, we not only consider the probability of new
generated queries, i.e., relevance, we also take into account
their new and distinct results, i.e., novelty. To embody the
relevance and novelty of keyword search together, two cri-
teria should be satisfied: 1) the generated query qnew has the
maximal probability to interpret the contexts of original
query q with regards to the data to be searched; and 2) the
generated query qnew has a maximal difference from the pre-
viously generated query set Q. Therefore, we have the
aggregated scoring function
scoreðqnewÞ ¼ Probðqnew j q; TÞ Ã DIFðqnew; Q; TÞ; (2)
where Probðqnew j q; TÞ represents the probability that qnew is
the search intention when the original query q is issued over
the data T; DIFðqnew; Q; TÞ represents the percentage of
results that are produced by qnew, but not by any previously
generated query in Q.
2.2.1 Evaluating the Probabilistic Relevance of an
Intended Query Suggestion w.r.t. the Original
Query
Based on the Bayes Theorem, we have
Probðqnew j q; TÞ ¼ Probðq j qnew;TÞ Ã Probðqnew j TÞ
Probðq j TÞ ; (3)
where Probðq j qnew; TÞ models the likelihood of generating
the observed query q while the intended query is actually
qnew; and Probðqnew j TÞ is the query generation probability
given the XML data T.
The likelihood value Probðq j qnew; TÞ can be measured by
computing the probability of the original query q that is
observed in the context of the features in qnew. Given an orig-
inal query q ¼ fkig1 i n and a generated new query candi-
date qnew ¼ fsig1 i n where ki is a query keyword in q, si is
a segment that consists of the query keyword ki and one of
its features fiji in W, and 1 ji m. Here, we assume that
for each query keyword, only top m most relevant features
will be retrieved from W to generate new query candidates.
To deal with multi-keyword queries, we make the indepen-
dence assumption on the probability that fiji is the intended
feature of the query keyword ki. That is,
Probðq j qnew; TÞ ¼
Q
ki2q;fiji
2qnew
Probðki j fiji ; TÞ: (4)
According to the statistical sample information, the
intent of a keyword can be inferred from the occurrences of
the keyword and its correlated terms in the data. Thus, we
can compute the probability Probðki j fiji ; TÞ of interpreting
a keyword ki into a search intent fiji as follows:
Probðki j fiji ; TÞ ¼
Probðfiji j ki; TÞ Ã Probðki; TÞ
Probðfiji ; TÞ
¼
jRðfki; fiji g; TÞj=jRðTÞj
jRðfiji ; TÞj=jRðTÞj
¼
jRðsi; TÞj
jRðfiji ; TÞj
;
(5)
where si ¼ fki; fiji
g.
TABLE 3
Mutual Information Score w.r.t. Terms in q
database system relational protein distributed oriented
7.06 3.84 2.79 2.25 2.06
Mutual
score (10À4
)
image sequence search model large
1.73 1.31 1.1 1.04 1.02
query language expansion optimization evaluation complexity
3.63 2.97 2.3 1.71 1.41
Mutual
score (10À4
)
log efficient distributed semantic translation
1.17 1.03 0.99 0.86 0.70
662 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 3, MARCH 2015
4. Consider a query q ¼ {database, query} and a query
candidate qnew¼ {database system; query expansion}.
Probðq j qnew; TÞ shows the probability of a publication
that addresses the problem of “database query” regarding
the context of “system and expansion”, which can be
computed by jRðfdatabase systemg;TÞj
jRðsystem;TÞj * jRðfquery expansiong;TÞj
jRðexpansion;TÞj . Here,
jRðfdatabase systemg; TÞj represents the number of key-
word search results of query {database system} over the
data T. jRðsystem; TÞj represents the number of keyword
search results of running query system over the data T,
but the number can be obtained without running query
system because it is equal to the size of keyword node
list of “system” over T. Similarly, we can calculate the
values of jRðfquery expansiong; TÞj and jRðexpansion; TÞj,
respectively. In this work, we adopt the widely accepted
semantics—SLCA to model XML keyword search results.
Given the XML data T, the query generation probability
of qnew can be calculated by the following equation:
Probðqnew j TÞ ¼ jRðqnew; TÞj
jRðTÞj ¼
T
si2qnew
Rðsi;TÞ
jRðTÞj ; (6)
where
T
si2qnew
Rðsi; TÞ represents the set of SLCA results by
merging the node lists Rðsi; TÞ for si 2 qnew using the XRank
algorithm in [4] that is a popular method to compute the
SLCA results by traversing the XML data only once.
Given an original query and the data, the value 1
Probðq j TÞ is
a relatively unchanged value with regards to different gen-
erated query candidates. Therefore, Equation (3) can be
rewritten as follows:
Probðqnew j q; TÞ ¼ g Ã
Q Rðsi; TÞ
Rðfiji
; TÞ
Ã
T
Rðsi;TÞ
jRðTÞj ; (7)
where ki 2 q, si 2 qnew, fiji 2 si and g ¼ 1
Probðq j TÞ can be any
value in (0, 1] because it does not affect the expanded query
candidates w.r.t. an original keyword query q and data T.
Although the above equation can model the probabilities
of generated query candidates (i.e., the relevance between
these query candidates and the original query w.r.t. the
data), different query candidates may have overlapped
result sets. Therefore, we should also take into account the
novelty of results of these query candidates.
2.2.2 Evaluating the Probabilistic Novelty
of an Intended Query w.r.t. the Original Query
As we know, an important property of SLCA semantics is
exclusivity, i.e., if a node is taken as an SLCA result, then
its ancestor nodes cannot become SLCA results. Due to
this exclusive property, the process of evaluating the nov-
elty for a newly generated query candidate qnew depends
on the evaluation of the other previously generated query
candidates Q. As such, the novelty DIFðqnew; Q; TÞ of qnew
against Q can be calculated as follows:
DIFðqnew; Q; TÞ ¼
jfvx j vx 2 Rðqnew; TÞ ^ @vy 2 f
S
q02QRðq0
; TÞg ^ vx vygj
jRðqnew; TÞ
S
f
S
q02QRðq0; TÞgj
;
(8)
where Rðqnew; TÞ represents the set of SLCA results gener-
ated by qnew;
S
q02QRðq0
; TÞ represents the set of SLCA
results generated by queries in Q, which excludes the
duplicate and ancestor nodes; vx vy means that vx is a
duplicate of vy for “¼”, or vx is an ancestor of vy for ‘‘ };
Rðqnew; TÞ
S
f
S
q02QRðq0
; TÞg is an SLCA result set that sat-
isfies with the exclusive property.
By doing this, we can avoid presenting repeated or over-
lapped SLCA results to users. In other words, the consider-
ation of novelty allows us to incrementally refine the
diversified results into more specific ones when we incre-
mentally deal with more query candidates.
As we know our problem is to find top k qualified query
candidates and their relevant SLCA results. To do this, we
can compute the absolute score of the search intention for
each generated query candidate. However, for reducing the
computational cost, an alternative way is to calculate the rel-
ative scores of queries. Therefore, we have the following
equation transformation. After we substitute Equations (7)
and (8) into Equation (2), we have the final equation
scoreðqnewÞ ¼ g Ã
Q
Rðsi; TÞ
Rðfiji
;TÞ
Ã
j
T
Rðsi; TÞj
jRðTÞj
Ã
È
vx j vx 2 Rðqnew; TÞ ^ @vy 2
È S
q0 2 Q
Rðq0; TÞ
É
^ vx vy
É
Rðqnew; TÞ
S È S
q02Q
Rðq0; TÞ
É
¼ g
jRðTÞj Ã
Q Rðsi; TÞ
Rðfiji
; TÞ
à j
T
Rðsi; TÞj
Ã
È
vx j vx 2 Rðqnew; TÞ ^ @vy 2
È S
q0 2 Q
Rðq0; TÞ
É
^ vx vy
É
Rðqnew; TÞ
S È S
q02Q
Rðq0; TÞ
É
7!
Q
Rðsi; TÞ
Rðfiji
; TÞ
à j
T
Rðsi; TÞj
Ã
È
vx j vx 2 Rðqnew; TÞ ^ @vy 2
È S
q02Q
Rðq0; TÞ
É
^ vx vy
É
Rðqnew; TÞ
S È S
q02Q
Rðq0; TÞ
É
;
(9)
where si 2 qnew; si ¼ ki [ fiji
; ki 2 q; q0
2 Q and the symbol
7! represents the left side of the equation depends on the
right side of the equation because the value g
jRðTÞj keeps
unchanged for calculating the diversification scores of dif-
ferent search intentions.
Property 1. (Upper Bound of Top-k Query Suggestions) Suppose
we have found k query suggestions Q ¼ {q1; . . . ; qk} for an
original keyword query q, and for any qx 2 Q. If we have
scoreðqxÞ !
Q
ð Rðsi; TÞ
Rðfiji
; TÞÞ Ã minfjLki jg where Rðsi; TÞ (or
Rðfiji ; TÞ) is the result set of evaluating si (or fiji ) in the
entity sample space of T and minfjLki
jg is the size of the short-
est keyword node list for keywords ki in q, then the evaluation
algorithm can be safely terminated, i.e., we can guarantee the
current k query suggestions will be the top-k qualified ones w.
r.t. q in T.
Proof. From Equation (9), we can see that Property 1 can be
proved if the following inequation holds: minfjLki jg !
j
T
Rðsi; TÞj Ã
jfvx j vx 2 Rðqnew; TÞ ^ @vy 2 f
S
q0 2 Q
Rðq0; TÞg ^ vx vygj
jRðqnew; TÞ
S
f
S
q0 2 Q
Rðq0; TÞgj
.
Since si ki [ fiji , it must have minfLsi g
minfjLki jg. In addition, we also know j
T
Rðsi; TÞj should
be bounded by the minimal size (minfLsi g) of keyword
nodes that contain si in T, i.e., j
T
Rðsi; TÞj minfLsi
g. As
such, we have j
T
Rðsi; TÞj minfjLki
jg. Because of
LI ET AL.: CONTEXT-BASED DIVERSIFICATION FOR KEYWORD QUERIES OVER XML DATA 663
5. jfvx j vx 2 Rðqnew;TÞ^ @ vy 2 f
S
q0 2 Q
Rðq0;TÞg^vx vygj
jRðqnew;TÞ
S
f
S
q02Q
Rðq0;TÞgj
1, so we can
derive that
j
Rðsi; TÞj
Ã
È
vx j vx 2Rðqnew; TÞ^@vy 2f
S
q02QRðq0
; TÞg^vx vy
É
jRðqnew; TÞ
S
f
S
q02QRðq0; TÞgj
minfjLki
jg:
Therefore, the property can be proved. tu
Since we incrementally generate and assess the intended
query suggestions with the high probabilistic relevance val-
ues, Property 1 guarantees we can skip the unqualified
query suggestions and terminate our algorithms as early as
possible when the top-k qualified ones have been identified.
3 EXTRACTING FEATURE TERMS
To address the problem of extracting meaningful feature
terms w.r.t. an original keyword query, there are two rele-
vant works [17], [18]. In [17], Sarkas et al. proposed a solu-
tion of producing top-k interesting and meaningful
expansions to a keyword query by extracting k additional
words with high “interestingness” values. The expanded
queries can be used to search more specific documents. The
interestingness is formalized with the notion of surprise [19],
[20], [21]. In [18], Bansal et al. proposed efficient algorithms
to identify keyword clusters in large collections of blog
posts for specific temporal intervals. Our work integrates
both of their ideas together: we first measure the correlation
of each pair of terms using our mutual information model
in Equation (1), which is a simple surprise metric; and then
we build term correlated graph that maintains all the terms
and their correlation values. Different from [17], [18], our
work utilizes entity-based sample information to build a
correlated graph with high precision for XML data.
In order to efficiently measure the correlation of a pair
of terms, we use a statistic method to measure how much
the co-occurrences of a pair of terms deviate from the inde-
pendence assumption where the entity nodes (e.g., the
nodes with the “*” node types in XML DTD) are taken as a
sample space. For instance, given a pair of terms x and y,
their mutual information score can be calculated based on
Equation (1) where Probðx; TÞ (or Probðy; TÞ) is the value of
dividing the number of entities containing x (or y) by the
total entity size of the sample space; Probðfx; yg; TÞ is the
value of dividing the number of entities containing both x
and y by the total entity size of the sample space.
In this work, we build a term correlated graph offline,
that is we precompute it before processing queries. The cor-
relation values among terms are also recorded in the graph,
which is used to generate the term-feature dictionary W.
During the XML data tree traversal, we first extract the
meaningful text information from the entity nodes in XML
data. Here, we would like to filter out the stop words. And
then we produce a set of term-pairs by scanning the
extracted text. After that, all the generated term-pairs will
be recorded in the term correlated graph. In the procedure
of building correlation graph, we also record the count of
each term-pair to be generated from different entity nodes.
As such, after the XML data tree is traversed completely, we
can compute the mutual information score for each term-
pair based on Equation (1). To reduce the size of correlation
graph, the term-pairs with their correlation lower than a
threshold can be filtered out. Based on the offline built
graph, we can on-the-fly select the top-m distinct terms as
its features for each given query keyword.
4 KEYWORD SEARCH DIVERSIFICATION
ALGORITHMS
In this section, we first propose a baseline algorithm to
retrieve the diversified keyword search results. And then,
two anchor-based pruning algorithms are designed to
improve the efficiency of the keyword search diversification
by utilizing the intermediate results.
4.1 Baseline Solution
Given a keyword query, the intuitive idea of the baseline
algorithm is to first retrieve the relevant feature terms with
high mutual scores from the term correlated graph of the
XML data T; then generate list of query candidates that are
sorted in the descending order of total mutual scores; and
finally compute the SLCAs as keyword search results for
each query candidate and measure its diversification score.
As such, the top-k diversified query candidates and their
corresponding results can be chosen and returned.
Different from traditional XML keyword search, our work
needs to evaluate multiple intended query candidates and
generate a whole result set, in which the results should be
diversified and distinct from each other. Therefore, we have
to detect and remove the duplicated or ancestor SLCA results
that have been seen when we obtain new generated results.
The detailed procedure has been shown in Algorithm 1.
Given a keyword query q with n keywords, we first load its
pre-computed relevant feature terms from the term corre-
lated graph G of XML data T, which is used to construct a
matrix MmÂn as shown in line 1. And then, we generate a
new query candidate qnew from the matrix MmÂn by calling
the function GenerateNewQuery() as shown in line 2. The
generation of new query candidates are in the descending
order of their mutual information scores. lines 3-7 show the
procedure of computing Probðq j qnew; TÞ. To compute the
SLCA results of qnew, we need to retrieve the precomputed
node lists of the keyword-feature term pairs in qnew from T
by getNodeListðsixjy ; TÞ. Based on the retrieved node lists,
we can compute the likelihood of generating the observed
query q while the intended query is actually qnew, i.e.,
Probðq j qnew; TÞ ¼
Q
fixjy 2sixjy 2qnew
ðjlixjy j=getNodeSize ðfixjy ;
TÞÞ using Equation (5) where getNodeSizeðfixjy ; TÞ can be
quickly obtained based on the precomputed statistic infor-
mation of T. After that, we can call for the function Compu-
teSLCA() that can be implemented using any existing
XML keyword search method. In lines 8-16, we compare
the SLCA results of the current query and the previous
queries in order to obtain the distinct and diversified SLCA
results. At line 17, we compute the final score of qnew as a
diversified query candidate w.r.t. the previously generated
query candidates in Q. At last, we compare the new query
664 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 3, MARCH 2015
6. and the previously generated query candidates and replace
the unqualified ones in Q, which is shown in lines 18-23.
After processing all the possible query candidates, we can
return the top k generated query candidates with their
SLCA results.
Algorithm 1. Baseline Algorithm
input: a query q with n keywords, XML data T and its
term correlated graph G
output: Top-k search intentions Q and the whole result
set F
1: MmÂn ¼ getFeatureTerms(q, G);
2: while (qnew ¼ GenerateNewQuery (MmÂn)) 6¼ null do
3: f ¼ null and prob s k ¼ 1;
4: lixjy ¼ getNodeList(sixjy , T) for sixjy 2 qnew ^ 1
ix m ^ 1 jy n;
5: prob s k ¼
Q
fixjy 2sixjy 2qnew
ð
jlixjy j
getNodeSizeðfixjy ;TÞÞ;
6: f ¼ ComputeSLCA({lixjy g);
7: prob q new ¼ prob s k * jfj;
8: if F is empty then
9: scoreðqnewÞ ¼ prob q new;
10: else
11: for all Result candidates rx 2 f do
12: for all Result candidates ry 2 F do
13: if rx ¼¼ ry or rx is an ancestor of ry then
14: f:removeðrxÞ;
15: else if rx is a descendant of ry then
16: F:removeðryÞ;
17: scoreðqnewÞ ¼ prob q new * jfj* jfj
jfj þ jFj;
18: if jQj k then
19: put qnew : scoreðqnewÞ into Q;
20: put qnew : f into F;
21: else if scoreðqnewÞ scoreðfq0
new 2 QgÞ then
22: replace q0
new : scoreðq0
newÞ with qnew : scoreðqnewÞ;
23: F:removeðq0
newÞ;
24: return Q and result set F;
In the worst case, all the possibe queries in the matrix
may have the possibility of being chosen as the top-k qual-
ified query candidates. In this worst case, the complexity
of the algorithm is O(mjqj
à L1
Pjqj
i¼2 logLi) where L1 is the
shortest node list of any generated query, jqj is the number
of original query keywords and m is the size of selected
features for each query keyword. In practice, the complex-
ity of the algorithm can be reduced by reducing the num-
ber m of feature terms, which can be used to bound the
number (i.e., reducing the value of mjqj
) of generated
query candidates.
4.2 Anchor-Based Pruning Solution
By analyzing the baseline solution, we can find that the
main cost of this solution is spent on computing SLCA
results and removing unqualified SLCA results from the
newly and previously generated result sets. To reduce the
computational cost, we are motivated to design an anchor-
based pruning solution, which can avoid the unnecessary
computational cost of unqualified SLCA results (i.e.,
duplicates and ancestors). In this section, we first analyze
the interrelationships between the intermediate SLCA can-
didates that have been already computed for the gener-
ated query candidates Q and the nodes that will be
merged for answering the newly generated query candi-
date qnew. And then, we will propose the detailed descrip-
tion and algorithm of the anchor-based pruning solution.
4.2.1 Properties of Computing Diversified SLCAs
Definition 1 (Anchor nodes). Given a set of query candidates
Q that have been already processed and a new query candidate
qnew, the generated SLCA results F of Q can be taken as the
anchors for efficiently computing the SLCA results of qnew by
partitioning the keyword nodes of qnew.
Example 2. Fig. 1 shows the usability of anchor nodes for
computing SLCA results. Consider two SLCA results X1
and X2 (assume X1 precedes X2) for the current query
set Q. For the next query qnew ¼ fs1; s2; s3g and its key-
word instance lists L ¼ fls1; ls2; ls3g, the keyword instan-
ces in L will be divided into four areas by the anchor X1:
1) LX1 anc in which all the keyword nodes are the ances-
tor of X1 so LX1 anc cannot generate new and distinct
SLCA results; 2) LX1 pre in which all the keyword nodes
are the previous siblings of X1 so we may generate new
SLCA results if the results are still bounded in the area;
3) LX1 des in which all the keyword nodes are the descen-
dant of X1 so it may produce new SLCA results that will
replace X1; and 4) LX1 next in which all the keyword
nodes are the next siblings of X1 so it may produce new
results, but it may be further divided by the anchor X2. If
there is no intersection of all keyword node lists in an
area, then the nodes in this area can be pruned directly,
e.g., l3
s1 and l3
s2 can be pruned without computation if l3
s3
is empty in Lx1 des. Similarly, we can process LX2 pre,
LX2 des and LX2 next. After that, a set of new and distinct
SLCA results can be obtained with regards to the new
query set Q
S
qnew.
Theorem 1 (Single Anchor). Given an anchor node va and a
new query candidate qnew ¼ fs1; s2; . . . ; sng, its keyword node
lists L ¼ fls1
; ls2
; . . . ; lsn g can be divided into four areas to be
anchored by va, i.e., the keyword nodes that are the ancestors of
va, denoted as Lva anc; the keyword nodes that are the previous
siblings of va, denoted as Lva pre; the keyword nodes that are
the descendants of va, denoted as Lva des; and the keyword
nodes that are the next siblings of va, denoted as Lva next. We
have that Lva anc does not generate any new result; each of the
other three areas may generate new and distinct SLCA results
Fig. 1. Usability of anchor nodes.
LI ET AL.: CONTEXT-BASED DIVERSIFICATION FOR KEYWORD QUERIES OVER XML DATA 665
7. individually; no new and distinct SLCA results can be gener-
ated across the areas.
Proof. For the keyword nodes in the ancestor area, all of
them are the ancestors of the anchor node va that is an
SLCA node w.r.t. Q. All these keyword nodes in the area
can be directly discarded due to the exclusive property of
SLCA semantics. For the keyword nodes in the area
Lva pre (or Lva next), we need to compute their SLCA candi-
dates. But the computation of SLCA candidates can be
bounded by the preorder (or postorder) traversal number
of va where we assume the range encoding scheme is used
in this work. For the keyword nodes in Lva des, if they can
produce SLCA candidates that are bounded in the range
of va, then these candidates will be taken as the new and
distinct results while va will be removed from result set.
The last is to certify that the keyword nodes across any
two areas cannot produce new and distinct SLCA results.
This is because the only possible SLCA candidates they
may produce are the node va and its ancestor nodes. But
these nodes cannot become new and distinct SLCA results
due to the anchor node va. For instance, If we can select
some keyword nodes from Lva pre and some keyword
nodes from Lva des, then their corresponding SLCA candi-
date must be the ancestor of va, which cannot become the
new and distinct SLCA result due to the exclusive prop-
erty of SLCA semantics. Similarly, no new result can be
generated if we select some keyword nodes from the other
two areas Lva pre and Lva next (or Lva des and Lva next). tu
Example 3. Fig. 2 gives a small tree that contains four differ-
ent terms k1, k2, f1 and f2. And the encoding ranges are
attached to all the nodes. Assume f1 and f2 are two fea-
ture terms of a given keyword query {k1; k2}; and f1 has
higher correlation with the query than f2. After the first
intended query {k1; k2; f1} is evaluated, the node v10 will
be the SLCA result. When we process the next intended
query {k1; k2; f2}, the node v10 will be an anchor node to
partition the keyword nodes into four parts by using its
encoding range ½16-23Š: Lpre ¼ {v2; v3; v4; v5; v6; v7; v9} in
which each node’s postorder number is less than 16;
Lnext ¼ {v14; v15; v16; v17} in which each node’s preorder
number is larger than 23; Lanc ¼ {v1; v8} in which each
node’s preorder number is less than 16 while its postor-
der number is larger than 23; and the rest nodes belong
to Ldes ¼ {v11; v12; v13}. Although v8 contains f2, it cannot
contribute more specific results than the anchor node v10
due to exclusive property of SLCA semantics. So we only
need to evaluate {k1; k2; f2} over the three areas Lpre, Lnext
and Ldes individually. Since no node contains f2 in Ldes,
all the keyword nodes in Ldes can be skipped. At last, the
nodes v4 and v16 will be two additional SLCA results. If
we have the following intended queries to be processed,
then their relevant keyword nodes can be partitioned by
the current three anchor nodes v4; v10; v16. Generally, the
number of keyword nodes to be skipped increases with
the increase of number of anchor nodes to be selected,
which can significantly reduce the time cost of algo-
rithms as shown in experiments.
Theorem 2 (Multiple Anchors). Given multiple anchor nodes
Va and a new query candidate qnew ¼ fs1; s2; . . . ; sng, its key-
word node lists L ¼ fls1
; ls2
; . . . ; lsn g can be maximally
divided into ð3 Ã jVaj þ 1Þ areas to be anchored by the nodes in
Va. Only the nodes in ð2 Ã jVaj þ 1Þ areas can generate new
SLCA candidates individually and the nodes in the
ð2 Ã jVaj þ 1Þ areas are independent to compute SLCA
candidates.
Proof. Let Va ¼ ðx1; . . . ; xi; . . . ; xj; . . ., xjVaj), where xi pre-
cedes xj for i j. We first partition the space into four
areas using x1, i.e., Lx1 anc, Lx1 pre, Lx1 des and Lx1 next.
For Lx1 next, we partition it further using x2 into four
areas. We repeat this process until we partition the area
xðjVajÀ1Þ next into four areas xjVaj anc, LxjVaj pre, LxjVaj des and
xjvaj next by using xjVaj. Obviously the total number of par-
titioned areas is 3 Ã jVaj þ 1. As the ancestor areas do not
contribute to SLCA results, we end up with 2 Ã jVaj þ 1
areas that need to be processed.
Theorem 1 guarantees that any keyword nodes
across different areas cannot contribute new and dis-
tinct SLCA results. This is because any possible result
candidate to be generated by these keyword nodes
must be the ancestor of one current SLCA result at least.
Due to the exclusive property of SLCA semantics, no
new and distinct SLCA results can be produced across
different areas. tu
Property 2. Consider the ð2 Ã jVaj þ 1Þ effective areas to be
anchored by the nodes in Va. If 9si 2 qnew and none of its
instances appear in an area, then the area can be pruned
because it cannot generate SLCA candidates of qnew.
The reason is that any area that can possibly generate new
results should contain at least one keyword matched
instance for each keyword in the query based on the SLCA
semantics. Therefore, if an area contains no keyword
instance, then the area can be removed definitely without
any computation.
Following Example 3, if we have the third intended
query that needs to be evaluated and v4, v10 and v16 have
been selected as anchor nodes, then we have ten areas of
keyword nodes: Lv4Àpre ¼ fv3g; Lv4Àanc ¼ fv1; v2g; Lv4Àdes ¼
fv5; v6g; Lv4Ànext=v10Àpre ¼ fv7; v9g; Lv10Àanc ¼ fv1; v8g;
Lv10Àdes ¼ fv11; v12; v13g; Lv10Ànext=v16Àpre ¼ fv15g; Lv16Àanc ¼
fv1; v14g; Lv16Àdes ¼ fv17g; and Lv16Ànext ¼ NIL; Three of
them can be directly discarded due to exclusive property
of SLCA semantics, i.e., Lv4Àanc, Lv10Àanc and Lv16Àanc.
Lv16Ànext can also be removed due to empty set. The rest
can be first filtered if they do not contain all the key-
words in the third intended query before computing the
possible new and distinct SLCA results.
Fig. 2. Demonstrate usability of anchor nodes in Tree.
666 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 3, MARCH 2015
8. 4.2.2 Anchor-Based Pruning Algorithm
Motivated by the properties of computing diversified
SLCAs, we design the anchor-based pruning algorithm.
The basic idea is described as follows. We generate the
first new query and compute its corresponding SLCA
candidates as a start point. When the next new query is
generated, we can use the intermediate results of the pre-
viously generated queries to prune the unnecessary nodes
according to the above theorems and property. By doing
this, we only generate the distinct SLCA candidates every
time. That is to say, unlike the baseline algorithm, the
diversified results can be computed directly without fur-
ther comparison.
The detailed procedure is shown in Algorithm 2. Similar
to the baseline algorithm, we need to construct the matrix of
feature terms, retrieve their corresponding node lists where
the node lists can be maintained using R-tree index. And
then, we can calculate the likelihood of generating the
observed query q when the issued query is qnew. Different
from the baseline algorithm, we utilize the intermediate
SLCA results of previously generated queries as the anchors
to efficiently compute the new SLCA results for the follow-
ing queries. For the first generated query, we can compute
the SLCA results using any existing XML keyword search
method as the baseline algorithm does, shown in line 18.
Here, we use stack-based method to implement the function
ComputeSLCA().
Algorithm 2. Anchor-Based Pruning Algorithm
input: a query q with n keywords, XML data T and its
term correlated graph G
output: Top-k query intentions Q and the whole result
set F
1: MmÂn ¼ getFeatureTerms(q, G);
2: while qnew ¼ GenerateNewQuery(MmÂn) 6¼ null do
3: Lines 3-5 in Algorithm 1;
4: if F is not empty then
5: for all vanchor 2 F do
6: get lixjy pre, lixjy des, and lixjy next by calling
for Partition(lixjy , vanchor);
7: if 8lixjy pre 6¼ null then
8: f0
¼ ComputeSLCA(flixjy preg, vanchor);
9: if 8lixjy des 6¼ null then
10: f00
¼ ComputeSLCA(flixjy desg, vanchor);
11: f þ ¼ f0
þ f00
;
12: if f00
6¼ null then
13: F:removeðvanchorÞ;
14: if 9lixjy next ¼ null then
15: Break the FOR-Loop;
16: lixjy ¼ lixjy next for 1 ix m ^ 1 jy n;
17: else
18: f ¼ ComputeSLCA(flixjy g);
19: scoreðqnewÞ ¼ prob q new * jfj* jfj
jFj þ jfj;
20: Lines 18-23 in Algorithm 1;
21: return Q and result set F;
The results of the first query will be taken as anchors to
prune the node lists of the next query for reducing its evalu-
ation cost in lines 5-16. Given an anchor node vanchor, for
each node list lixjy of a query keyword in the current new
query, we may get three effective node lists lixjy pre, lixjy des
and lixjy next using R-tree index by calling for the function
Partition(). If a node list is empty, e.g., lixjy pre ¼ null, then
we don’t need to get the node lists for the other query key-
words in the same area, e.g., in the left-bottom area of
vanchor. This is because it cannot produce any SLCA candi-
dates at all. Consequently, the nodes in this area cannot gen-
erate new and distinct SLCA results. If all the node lists
have at least one node in the same area, then we compute
the SLCA results by the function ComputeSLCA(), e.g.,
ComputeSLCA(flixjy desg, vanchor) that merges the nodes in
flixjy desg. If the SLCA results are the descendant of vanchor,
then they will be recorded as new distinct results and vanchor
will be removed from the temporary result set. Through
line 19, we can obtain the final score of the current query
without comparing the SLCA results of f with that of F. At
last, we need to record the score and the results of the new
query into Q and F, respectively. After all the necessary
queries are computed, the top-k diversified queries and
their results will be returned.
4.3 Anchor-Based Parallel Sharing Solution
Although the anchor-based pruning algorithm can avoid
unnecessary computation cost of the baseline algorithm, it
can be further improved by exploiting the parallelism of
keyword search diversification and reducing the repeated
scanning of the same node lists.
4.3.1 Observations
According to the semantics of keyword search diversifica-
tion, only the distinct SLCA results need to be returned to
users. We have the following two observations.
Observation 1. Anchored by the SLCA result set Va of pre-
viously processed query candidates Q, the keyword
nodes of the next query candidate qnew can be classified
into 2 Ã jVaj þ 1 areas. According to Theorem 2, no new
and distinct SLCA results can be generated across areas.
Therefore, the 2 Ã jVaj þ 1 areas of nodes can be processed
independently, i.e., we can compute the SLCA results
area by area. It can make the parallel keyword search
diversification efficient without communication cost
among processors.
Observation 2. Because there are term overlaps between
the generated query candidates, the intermediate partial
results of the previously processed query candidates
may be reused for evaluating the following queries, by
which the repeated computations can be avoided.
4.3.2 Anchor-Based Parallel Sharing Algorithm
To make the parallel computing efficiently, we utilize the
SLCA results of previous queries as the anchors to partition
the node lists that need to be computed. By assigning areas
to processors, no communication cost among the processors
is required. Our proposed algorithm guarantees that the
results generated by each processor are the SLCA results of
the current query. In addition, we also take into account the
shared partial matches among the generated query
LI ET AL.: CONTEXT-BASED DIVERSIFICATION FOR KEYWORD QUERIES OVER XML DATA 667
9. candidates, by which we can further improve the efficiency
of the algorithm.
Different from the above two proposed algorithms, we
first generate and analyse all the possible query candidates
Qnew. Here, we use a vector V to maintain the shared query
segments among the generated queries in Qnew. And then,
we begin to process the first query like the above two algo-
rithms. When the next query is coming, we will check its
shared query segments and explore parallel computing to
evaluate the query candidate.
To do this, we first check if the new query candidate qnew
contains shared parts c in V . For each shared part in c, we
need to check its processing status. If the status has already
been set as “processed”, then it says that the partial matches
of the shared part have been computed before. In this condi-
tion, we only need to retrieve the partial matches from pre-
viously cached results. Otherwise, it says that we haven’t
computed the partial matches of the shared part before. We
have to compute the partial matches from the original node
lists of the shared segments. After that, the processing status
will be set as “processed”. And then, the node lists of the
rest query segments will be processed. In this algorithm, we
also explore parallel computing to improve the performance
of query evaluation. At the beginning, we specify the maxi-
mal number of processors and the current processor’s id
(denoted as PID). And then, we distribute the nodes that
need to be computed to the processor with PID in a round.
When all the required nodes are reached at a processor, the
processor will be activated to compute the SLCA results of
qnew or the partial matches for the shared segments in qnew.
After all active processors complete their SLCA computa-
tions, we get the final SLCA results of qnew. At last, we can
calculate the score of qnew and compare its score with the
previous ones. If its score is larger than that of one of the
query candidates in Q, then the query candidate qnew, its
score scoreðqnewÞ and its SLCA results will be recorded.
After all the necessary queries are processed, we can return
the top-k qualified and diversified query candidates and
their corresponding SLCA results. The detailed algorithm is
not provided due to the limited space.
5 EXPERIMENTS
In this section, we show the extensive experimental results
for evaluating the performance of our baseline algorithm
(denoted as baseline evaluation BE) and anchor-based algo-
rithm (denoted as anchor-based evaluation AE), which were
implemented in Java and run on a 3.0 GHz Intel Pentium
4 machine with 2 GB RAM running Windows XP. For our
anchor-based parallel sharing algorithm (denoted as ASPE),
it was implemented using six computers, which can serve
as six processors for parallel computation.
5.1 Data Set and Queries
We use a real data set, DBLP [22] and a synthetic XML
benchmark data set XMark [23] for testing the proposed
XML keyword search diversification model and our
designed algorithms. The size of DBLP data set is 971 MB
and the size of generated XMark data set is 697 MB. Com-
pared with DBLP data set, the synthetic XMark data set has
varied depths and complex data structures, but as we
know, it does not contain clear semantic information due to
its synthetic data. Therefore, we only use DBLP data set to
measure the effectiveness of our diversification model in
this work.
For each XML data set used, we selected some terms
based on the following criteria: 1) a selected term should
often appear in user-typed keyword queries; 2) a selected
term should highlight different semantics when it co-occurs
with feature terms in different contexts. Based on the crite-
ria of selection, we chose some terms for each data set as fol-
lows. For DBLP data set, we selected “database, query,
programming, semantic, structure, network, domain, dynamic,
parallel”. And we chose “brother, gentleman, look, free, king,
gender, iron, purpose, honest, sense, metals, petty, shakespeare,
weak, opposite” for XMark data set. By randomly combining
the selected terms, we generated 14 sets of keyword queries
for each data set.
In addition, we also designed 10 specific DBLP keyword
queries with clear search intentions as the ground truth
query set. And then, we modified the clear keyword queries
into vague queries by masking some keywords, e.g., “query
optimization parallel database”, “query language database” and
“dynamic programming networks” where the keywords with
underlines are masked.
5.2 Effectiveness of Diversification Model
Three metrics will be applied to evaluate the effectiveness of
our diversification model by assessing its relevance, diver-
sity and usefulness.
5.2.1 Test Adapted-nDCG
The normalized discounted cumulative gain (nDCG) has
established itself as the standard evaluation measure when
graded relevance values are available [9], [13], [24]. The first
step in the nDCG computation is the creation of a gain vec-
tor G. The gain G½kŠ at rank k can be computed as the rele-
vance of the result at this rank to the user’s keyword query.
The gain is discounted logarithmically with increasing rank,
to penalize documents lower in the ranking, reflecting the
additional user effort required to reach them. The dis-
counted gain is accumulated over k to obtain the DCG value
and normalized using the ideal gain at rank k to finally
obtain the nDCG value. For the relevance-based retrieval,
the ideal gain may be considered as an identical one for
some users. However, for the diversification-based retrieval,
the ideal gain cannot be considered as an identical one for
users because users may prefer to see different results for a
query. Therefore, we need adapt nDCG measure to test the
effectiveness of query diversification model.
Since it is difficult to find the ground truth for query
diversification, we carefully selected the ground truth for
our tested keyword queries. Here, we employed human
assessors to manually mark the ground truth. We invited
three groups of users to do user study where each group
has ten undergraduate volunteers and all these users share
the similar scoring criteria, i.e., interestingness, meaningful-
ness, and novelty. In this paper, Google and Bing search
engines are used to explore diversified search intentions as
a standard diverse suggestion set for a given keyword
query. To do this, the volunteers in the first and second
668 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 3, MARCH 2015
10. groups evaluated the original query using Google and Bing
search engines in the domain of DBLP “informatics.unitrier.
de/-ley/db/”. By checking at most 100 relevant results,
each user ui can identify 10 diversified queries based on
her/his satisfaction and give each diversified query qdiv a
satisfaction credit Creditðui; qdivÞ in the range [0-10]. By
doing this, each group can produce a set of different diversi-
fied queries where the credit of each query is marked by its
average score:
P10
i¼1
ðCreditðui; qdivÞÞ
10 where Creditðui; qdivÞ ¼ 0 if
qdiv does not appear in the list of 10 diversified queries of ui.
After that, the top-10 diversified queries with the highest
scores can be selected for each group. As such, the first and
second groups can produce two independent ideal gain for
a keyword query using Google and Bing search engines.
Similarly, the third group of users marked the diversified
query suggestions produced by our work, which is used to
calculate the gain G½kŠ. Based on the formula nDCGk ¼
DCGk
idealDCGk
, we can calculate the nDCG of each keyword query
based on the statistical information supported by the three
groups of users.
Figs. 3a, 3b, 3c, and 3d show the nDCG values of key-
word queries over DBLP data set where only the top-10
diversified query suggestions are taken into account. From
Figs. 3a and 3b, we can see that nDCG values are no less
than 0.8 when k ! 2, i.e., our proposed diversification model
has 80 percent of the chance to approach the ground truth of
ideal diversified query suggestion. Fig. 3c shows for query
set q9, our diversification model only has 60 percent of
chance to approach the ground truth of ideal diversified
query suggestion. Fig. 3d also shows the average nDCG
value of the 11th keyword query set q11 based on Google
engine is near to 0.6. For these kinds of queries, we need rec-
ommend more diversified query suggestions to improve the
effectiveness of our diversification model.
5.2.2 Test Differentiation of Diversified Query Results
To show the effectiveness of diversification model, we also
tested the differentiation between result sets by evaluating
our suggested query candidates. This assessment is used to
measure the possibility of diversified results based on the
suggested diversified query candidates. To do this, we first
selected the top-5 diversified query suggestions for each
original keyword query. And then, we evaluated each
diversified query suggestion over the DBLP data set, respec-
tively. From each list of relevant results, we chose its top-10
most relevant results. As such, we can have 50 results as a
sample for each original keyword query to test its diversi-
fied possibility. The assessing metric can be summarized by
the following equation:
P
qi;qj2topÀkDSðqÞ 1 À
P
ri2RðqiÞ;rj2RðqjÞ
Simðri;rjÞ
jRðqiÞj à jRðqjÞj
C2
k
; (10)
where top À kDSðqÞ represents top-k diversified query sug-
gestions; RðqiÞ represents the number of selected most rele-
vant results for the diversified query suggestion qi;
Simðri; rjÞ represents the similarity of two results ri and rj
to be measured by n-gram model.
Similarly, we tested the differentiation of diversified
query results using DivQ in [13] and compared their experi-
mental results with ours (denoted as DivContext). Fig. 4
demonstrates the possibility of the different results with
regards to the 14 selected sets of queries over DBLP data
set. The experimental results show that our diversification
model provides significantly better differentiated results
than DivQ. This is because structural difference in struc-
tured queries generated by DivQ cannot exactly reflect the
differentiation of results to be generated by these structured
queries. For the sixth, 10th and 11th sets of queries, DivQ
can reach up to 80 percent because the three sets of queries
Fig. 3. nDCG Values versus top-k query suggestions.
LI ET AL.: CONTEXT-BASED DIVERSIFICATION FOR KEYWORD QUERIES OVER XML DATA 669
11. contain a few specific keywords and their produced struc-
tured queries (i.e., query suggestions in DivQ) are with sig-
nificantly different structures. The experimental study
shows our context sensitive diversification model outper-
forms the DivQ probabilistic model due to the consideration
of both the word context (i.e., term correlated graph) and
structure (i.e., exclusive property of SLCA).
5.2.3 Test Adapted Keyword Queries
This part of experiment builds on the general assumption
that users often pick up their interested queries ranked at
the top in the list of query suggestions. Therefore, in this
section, we first selected 10 specific keyword queries
with clear semantics and manually generated 10 corre-
sponding vague keyword queries by removing some key-
words from each specific keyword query. And then, we
computed a list of diversified query suggestions for each
vague keyword query. By checking the positions of the
specific queries appearing in their corresponding query
suggestion lists, we can test the usefulness of our diversi-
fication model. Here, we utilized a simple metric to
quantify the value of usefulness. For instance, if the spe-
cific keyword query occurs at the top 1-5 positions in its
corresponding suggestion list, then the usefulness of the
diversification model is marked by 1. If it is located at
the top 6-10 positions, then the usefulness is marked by
0.5. If it is located at the position range 11-20, then the
usefulness is marked as 0.25. Otherwise, the suggestion
is unusual for users because it is seldom for users to
check more than 20 suggestions in a sequence, i.e., the
usefulness is labeled as zero.
We also compared our work with the diversification
model in [13]. Since [13] uses structured queries to represent
query suggestions while ours uses a set of terms as query
suggestions, we manually transformed their structured
queries into the corresponding term sets based on the con-
text of the structured queries. To make the comparison be
fair, a structured query is considered as a specific keyword
query if its corresponding term set contains all the key-
words in the specific keyword query.
From Fig. 5, we can see the specific keyword queries
within top five query suggestions for the ten adapted vague
queries by our method (DivContext). But for DivQ method,
only one of them can be observed within the top five query
suggestions, three of them found within top 10 suggestions,
another three within top 20 suggestions, and the rest cannot
get into the top 20 suggestions. From the experimental
results, we can see the usefulness of a qualified suggestion
model should consider the relevance of the query
suggestions, as well as the novelty of the results to be gener-
ated by these query suggestions.
5.3 Efficiency of Diversification Approaches
Figs. 6a and 6b show the average response time of comput-
ing all the SLCA results for the selected original keyword
queries by using XRank [2], and directly computing their
diversifications and corresponding SLCA results by using
our proposed BE, AE and ASPE. According to the experi-
mental results, we can see that our diversified algorithms
just consumed 20 percent time of XRank to obtain top-k
qualified diversified query suggestions and their diverse
results where k is 5, 10, and 15, respectively. When k ¼ 20,
our algorithms consumed at most 30 percent time of XRank.
Here, the time of XRank just counted the cost of computing
SLCA results for the original keyword queries. If we further
took into account the time of selecting diverse results, then
our algorithms will be performed much better than XRank.
Especially, when the users are only interested in a few of
search contexts, our context-based keyword query diversifi-
cation algorithms can outperform the post-processing diver-
sification approaches greatly. The main reason is that the
Fig. 5. Usefulness of diversification model.
Fig. 6. Average time comparison of original keyword queries and their
suggested query candidates.
Fig. 4. Differentiation of diversified query results.
670 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 3, MARCH 2015
12. computational cost is bounded by the minimal keyword
node list. The specific feature terms with short node lists
can be used to skip lots of keyword nodes in the other node
lists. So the efficiency can be improved greatly.
From Fig. 7a, we can see BE consumed more time to
answer a vague query with the increase of diversified query
suggestions over DBLP data set, i.e., take up to 5.7 s to
answer a vague query with diversified results when the
number of qualified suggestions is set as 20. However, AE
and ASPE can do that in about 3.5 and 2 s, respectively. This
is because lots of nodes can be skipped by anchor nodes
without computation. Another reason is that when the num-
ber of suggestions is small, e.g., 5, we can quickly identify
the qualified suggestions and safely terminate the evaluation
with the guarantee of the upper bound. As such, the quali-
fied suggestions and their diverse results can be output.
Fig. 7b shows the time cost of evaluating the keyword
queries over XMark data set. Although the efficiency of BE
is slower than that AE and ASPE, it can still finish each
query evaluation in about 1.5 s. This is because: 1) the size
of keyword nodes is not as large as that of DBLP; 2) the key-
word nodes are distributed evenly in the synthetic XMark
data set, which limits the performance of AE and ASPE.
6 RELATED WORK
Diversifying results of document retrieval has been intro-
duced [6], [7], [8], [9]. Most of the techniques perform diver-
sification as a post-processing or re-ranking step of
document retrieval. Related work on result diversification
in IR also includes [25], [26], [27]. Santos et al. [25] used
probabilistic framework to diversify document ranking, by
which web search result diversification is addressed. They
also applied the similar model to discuss search result diver-
sification through sub-queries in [26]. Gollapudi and
Sharma [27] proposed a set of natural axioms that a diversi-
fication system is expected to satisfy, by which it can
improve user satisfaction with the diversified results. Differ-
ent from the above relevant works, in this paper, our diver-
sification model was designed to process keyword queries
over structured data. We have to consider the structures of
data in our model and algorithms, not limited to pure text
data like the above methods. In addition, our algorithms can
incrementally generate query suggestions and evaluate
them. The diversified search results can be returned with
the qualified query suggestions without depending on the
whole result set of the original keyword query.
Recently, there are some relevant work to discuss the
problem of result diversification in structured data. For
instance, [28] conducted thorough experimental evaluation
of the various diversification techniques implemented in
a common framework and proposed a threshold-based
method to control the tradeoff between relevance and diver-
sity features in their diversification metric. But it is a big chal-
lenge for users to set the threshold value. Hasan et al. [29]
developed efficient algorithms to find top-k most diverse set
of results for structured queries over semi-structured data.
As we know, a structured query can be used to express
much more clear search intention of a user. Therefore, diver-
sifying structured query results is less significant than that of
keyword search results. In [30], Panigrahi et al. focused on
the selection of diverse item set, not considering structural
relationships of the items to be selected. The most relevant
work to ours is the approach DivQ in [13] where Demidova
et al. first identified the attribute-keyword pairs for an origi-
nal keyword query and then constructed a large number of
structured queries by connecting the attribute-keyword
pairs using the data schema (the attributes can be mapped to
corresponding labels in the schema). The challenging prob-
lem in [13] is that two generated structured queries with
slightly different structures may still be considered as differ-
ent types of search intentions, which may hurt the effective-
ness of diversification as shown in our experiments.
However, our diversification model in this work utilized
mutually co-occurring feature term sets as contexts to repre-
sent different query suggestions and the feature terms are
selected based on their mutual correlation and the distinct
result sets together. The structure of data are considered by
satisfying the exclusive property of SLCA semantics.
7 CONCLUSIONS
In this paper, we first presented an approach to search
diversified results of keyword query from XML data based
on the contexts of the query keywords in the data. The
diversification of the contexts was measured by exploring
their relevance to the original query and the novelty of their
results. Furthermore, we designed three efficient algorithms
based on the observed properties of XML keyword search
results. Finally, we verified the effectiveness of our diversifi-
cation model by analyzing the returned search intentions
for the given keyword queries over DBLP data set based on
the nDCG measure and the possibility of diversified query
suggestions. Meanwhile, we also demonstrated the effi-
ciency of our proposed algorithms by running substantial
number of queries over both DBLP and XMark data sets.
Fig. 7. Average time cost of queries.
LI ET AL.: CONTEXT-BASED DIVERSIFICATION FOR KEYWORD QUERIES OVER XML DATA 671
13. From the experimental results, we get that our proposed
diversification algorithms can return qualified search inten-
tions and results to users in a short time.
ACKNOWLEDGMENTS
This work was supported by the ARC Discovery Projects
under Grant No. DP110102407 and DP120102627, and par-
tially supported by grant of the RGC of the Hong Kong
SAR, No. 418512.
REFERENCES
[1] Y. Chen, W. Wang, Z. Liu, and X. Lin, “Keyword search on struc-
tured and semi-structured data,” in Proc. SIGMOD Conf., 2009,
pp. 1005–1010.
[2] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram, “Xrank:
Ranked keyword search over xml documents,” in Proc. SIGMOD
Conf., 2003, pp. 16–27.
[3] C. Sun, C. Y. Chan, and A. K. Goenka, “Multiway SLCA-based
keyword search in xml data,” in Proc. 16th Int. Conf. World Wide
Web, 2007, pp. 1043–1052.
[4] Y. Xu and Y. Papakonstantinou, “Efficient keyword search for
smallest lcas in xml databases,” in Proc. SIGMOD Conf., 2005,
pp. 537–538.
[5] J. Li, C. Liu, R. Zhou, and W. Wang, “Top-k keyword search over
probabilistic xml data,” in Proc. IEEE 27th Int. Conf. Data Eng.,
2011, pp. 673–684.
[6] J. G. Carbonell and J. Goldstein, “The use of MMR, diversity-
based reranking for reordering documents and producing sum-
maries,” in Proc. SIGIR, 1998, pp. 335–336.
[7] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong,
“Diversifying search results,” in Proc. 2nd ACM Int. Conf. Web
Search Data Mining, 2009, pp. 5–14.
[8] H. Chen and D. R. Karger, “Less is more: Probabilistic models for
retrieving fewer relevant documents,” in Proc. SIGIR, 2006,
pp. 429–436.
[9] C. L. A. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A.
Ashkan, S. B€uttcher, and I. MacKinnon, “Novelty and diver-
sity in information retrieval evaluation,” in Proc. SIGIR, 2008,
pp. 659–666.
[10] A. Angel and N. Koudas, “Efficient diversity-aware search,” in
Proc. SIGMOD Conf., 2011, pp. 781–792.
[11] F. Radlinski and S. T. Dumais, “Improving personalized web
search using result diversification,” in Proc. SIGIR, 2006, pp. 691–
692.
[12] Z. Liu, P. Sun, and Y. Chen, “Structured search result differ-
entiation,” J. Proc. VLDB Endowment, vol. 2, no. 1, pp. 313–324,
2009.
[13] E. Demidova, P. Fankhauser, X. Zhou, and W. Nejdl, “DivQ:
Diversification for keyword search over structured databases,” in
Proc. SIGIR, 2010, pp. 331–338.
[14] J. Li, C. Liu, R. Zhou, and B. Ning, “Processing xml keyword
search by constructing effective structured queries,” in Advances
in Data and Web Management. New York, NY, USA: Springer, 2009,
pp. 88–99.
[15] H. Peng, F. Long, and C. H. Q. Ding, “Feature selection based on
mutual information: Criteria of max-dependency, max-relevance,
and min-redundancy,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 27, no. 8, pp. 1226–1238, Aug. 2005.
[16] C. O. Sakar and O. Kursun, “A hybrid method for feature selec-
tion based on mutual information and canonical correlation analy-
sis,” in Proc. 20th Int. Conf. Pattern Recognit., 2010, pp. 4360–4363.
[17] N. Sarkas, N. Bansal, G. Das, and N. Koudas, “Measure-driven
keyword-query expansion,” J. Proc. VLDB Endowment, vol. 2,
no. 1, pp. 121–132, 2009.
[18] N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa, “Seeking sta-
ble clusters in the blogosphere,” in Proc. 33rd Int. Conf. Very Large
Data Bases, 2007, pp. 806–817.
[19] S. Brin, R. Motwani, and C. Silverstein, “Beyond market baskets:
Generalizing association rules to correlations,” in Proc. SIGMOD
Conf., 1997, pp. 265–276.
[20] W. DuMouchel and D. Pregibon, “Empirical bayes screening for
multi-item associations,” in Proc. 7th ACM SIGKDD Int. Conf.
Knowl. Discovery Data Mining, 2001, pp. 67–76.
[21] A. Silberschatz and A. Tuzhilin, “On subjective measures of inter-
estingness in knowledge discovery,” in Proc. 7th ACM SIGKDD
Int. Conf. Knowl. Discovery Data Mining, 1995, pp. 275–281.
[22] [Online]. Available: http://dblp.uni-trier.de/xml/
[23] [Online]. Available: http://monetdb.cwi.nl/xml/
[24] K. J€arvelin and J. Kek€al€ainen, “Cumulated gain-based evaluation
of ir techniques,” ACM Trans. Inf. Syst., vol. 20, no. 4, pp. 422–446,
2002.
[25] R. L. T. Santos, C. Macdonald, and I. Ounis, “Exploiting query
reformulations for web search result diversification,” in Proc. 16th
Int. Conf. World Wide Web, 2010, pp. 881–890.
[26] R. L. T. Santos, J. Peng, C. Macdonald, and I. Ounis, “Explicit
search result diversification through sub-queries,” in Proc. 32nd
Eur. Conf. Adv. Inf. Retrieval, 2010, pp. 87–99.
[27] S. Gollapudi and A. Sharma, “An axiomatic approach for result
diversification,” in Proc. 16th Int. Conf. World Wide Web, 2009,
pp. 381–390.
[28] M. R. Vieira, H. L. Razente, M. C. N. Barioni, M. Hadjieleftheriou,
D. Srivastava, C. Traina J., and V. J. Tsotras, “On query result
diversification,” in Proc. IEEE 27th Int. Conf. Data Eng., 2011,
pp. 1163–1174.
[29] M. Hasan, A. Mueen, V. J. Tsotras, and E. J. Keogh, “Diversifying
query results on semi-structured data,” in Proc. 21st ACM Int.
Conf. Inf. Knowl. Manag., 2012, pp. 2099–2103.
[30] D. Panigrahi, A. D. Sarma, G. Aggarwal, and A. Tomkins, “Online
selection of diverse results,” in Proc. 5th ACM Int. Conf. Web Search
Data Mining, 2012, pp. 263–272.
Jianxin Li received the BE and ME degrees in
computer science, from the Northeastern Univer-
sity, Shenyang, China, in 2002 and 2005, respec-
tively. He received the PhD degree in computer
science, from the Swinburne University of Tech-
nology, Australia, in 2009. He is currently a post-
doctoral research fellow at Swinburne University
of Technology. His research interests include
database query processing optimization, and
social network analytics.
Chengfei Liu received the BS, MS and PhD
degrees in computer science from Nanjing Uni-
versity, China in 1983, 1985, and 1988, respec-
tively. He is currently a professor at the Faculty
of Science, Engineering and Technology,
Swinburne University of Technology, Australia.
His current research interests include keywords
search on structured data, query processing and
refinement for advanced database applications,
query processing on uncertain data and big data,
and data-centric workflows. He is a member of
the IEEE, and a member of the ACM.
Jeffrey Xu Yu received the BE, ME, and PhD
degrees in computer science, from the University
of Tsukuba, Tsukuba, Japan, in 1985, 1987, and
1990, respectively. He is currently a professor in
the Department of Systems Engineering and
Engineering Management, The Chinese Univer-
sity of Hong Kong, Hong Kong. His current
research interests include graph mining, graph
database, social networks, keyword search, and
query processing and optimization. He is a senior
member of the IEEE, a member of the IEEE
Computer Society, and a member of the ACM.
For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
672 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 3, MARCH 2015