50120130406007

263 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
263
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

50120130406007

  1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 6, November - December (2013), pp. 70-77 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET ©IAEME AUTHENTICATED INDEXING FOR THE QUERY DEPENDENT K-NEAREST NEIGHBOURS IN SPATIAL DATABASE K.Padmapriya1, Dr. S.Sridhar2 1 Research Scholar, Department of Computer Science and Engineering, Sathyabama University, Chennai, India 2 Research Supervisor, Department of Computer Science and Engineering, Sathyabama University, Chennai, India ABSTRACT Various indexing models have been proposed in the area of information retrieval and artificial intelligence. But most of the existing algorithms do not consider about the significant differences among the queries. They tried to solve the problem with a single query. In this paper, we propose different indexing models for multiple queries and we call it as Indexing for Multiple Queries(IMQ). As a first step, we propose k-Nearest Neighbor(kNN) for indexing the queries. We classified our method into online and offline method. In online method, we create a indexing model for a given query by maintaining the labelled queries and then index the documents with respect to the queries. Next we consider two offline methods create indexing models in advance to make efficient indexing. Then we tried with different datasets and experimental results prove that our proposed online and offline methods both perform better than the baseline method uses single indexing model. Keywords – Information retrieval, indexing for multiple queries, k-Nearest Neighbor. 1. INTRODUCTION Since searching and information retrieval attains a very good growth, indexing will always be an important research topic. While searching, indexing has been performed as follows. Given a query, the documents related to the query from the document repository, are sorted based on their relevance to the query using a indexing model. Then the list of top indexed documents is presented to the user. The problem in this method is developing the suitable indexing model to provide best relevance. Lots of models have been proposed indexing like the vector space model[24, 25], Boolean model[3] , BM25[22] and language model for IR[14, 19]. Recently, learning to index have been 70
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME applied to automatic indexing model construction [6, 7, 8, 12, 18, 29, 30] in machine learning techniques. By applying machine learning algorithms and labelled training data, this method is able to perform indexing model effectively. These training data contains queries, their relevance documents labels show their association with the documents. In this paper, we also try to base our model on index learning. Previously a single indexing function is used to handle all the queries. But it may not be proper, especially in web search. While searching for a query in web may differ in semantics and the user’s objectives they represent, how they appear, how many relevant documents they have in the document repository. Sometimes queries may be informational, transactional and navigational. Sometimes queries may be phrases or combination of phrases or natural language sentences. Queries may be product names, personal names or terminology. Queries can be longer or shorter, popular or not popular. Hence single indexing function do not give appropriate results but with lower accuracy in relevance indexing. The importance of conducting query dependent indexing is well understood by IR community. However much efforts have been taken on query classification [4,5, 13, 15, 23], but not on indexing model learning and construction. Kang and Kim[13] classified into two categories. One is based on search intention and the next is two different indexing models for different categories. As discussed in previous work [9], we propose a query-dependent indexing model construction for k-Nearest Neighbors. We used training queries where each query is denoted by a point. During indexing, we can retrieve k nearest training queries for a given test query, learn a indexing model, and then index the documents relevant to the test query with the model. The advantages of our proposed methods are: 1. Query indexing method is achieved by controlling the needful information of the similar queries by neglecting the dissimilar queries. 2. Classification of queries has been done dynamically and the similar queries are selected. Our experiment results prove its advantage over the single indexing model and the query classification method. Since kNN wants to conduct online training of the indexing model for each query, it is expensive practically. Hence we propose two methods to move the training by offline. Especially we proved our methods are accurate especially in loss of prediction if the algorithm used for learning is stable while doing minor changes in training examples. 2. PREVIOUS WORK There is no much work done on dependent query ranking. Most of the work done on query classification and learning to index. Many methodologies have been proposed for query classification. In [4, 5, 26] queries have classified according to its topic, computers, for instance, information and entertainment etc. Like KDD cup2005. In [13, 15, 23, 27], queries have been classified according to the user’s need for searching, topic distillation, for instance, home page finding and named page finding. Support vector method for machine learning was also applied in classification. However many surveys have been conducted for learning to index and extends its application into information retrieval. Previously they categorised as 1. Point-wise approach [ 18] – transforms indexing to classification or regression done on single documents 2. Pair-wise approach [6, 9, 12] – performs ranking as classification on pairs of document and 3. List-wise approach [7, 29, 30] – reduces loss function on list of documents. 3. INDEXING USING k-NEAREST NEIGHBORS In practical, we categorized the queries into two types: 1. Popular queries- have many related documents and features lead to its popularity is important for indexing. 2. Rare queries - have very 71
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME few related documents and the use of its popularity is not necessary for indexing. Hence we use different type of indexing model for different queries. A direct approach is to apply hard classification model to classify the queries into its categories and train a indexing model for each and every category. Still it is not so easy to achieve the performance using this approach. Fig.1. Sample for distribution of data When we consider about the data, it is not so easy to provide clear boundaries among the queries in various categories. We have represented our queries in 27-dimensional features as per defined in [27]. Then we minimize the space into two dimensions by applying Principal Component Analysis (PCA). We plot our queries within this constrained space and got the graph as inFig.1. Since the queries from different categories are combined together, it is not possible to separate it using hard classification. A query belongs to the same category with high probability will be its neighbour. It can be called as “Queries Locality Property (QLP)”. 3.1 kNN Online method We use kNN method for query dependent indexing. We use feature vector for each training query qi with the corresponding data set as Sqi, where i=1, 2,...., m. We represented in terms of Euclidean space in query feature space. Having a test query q, we tried to find out the k nearest queries by means of its Euclidean distance. Then we tried to train a local indexing model online by using its closest training queries Ck(qi) and index the test queries using local training model SVM. The working principle of our algorithm is illustrated in Fig.2 where the red circles represent the test queries qi, blue circles represent training queries, and the large circle represents the neighbours of queries qi. Fig.2. Representation of kNN online method For each query qi, we apply a reference model (BM25) to find out its top T indexed documents, and take its mean values as query features. 72
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME Algorithm 1: kNN Online method Step 1: Use reference model hr to find out the top T indexed documents for query qi, and defines its query features from those documents. Step 2: Find kNNs of qi from training data(Ck(qi), with Euclidean distance calculated in the query feature space. Step 3: Learn a local training model hq using the training set Sck(qi) = Uqiᶦϵck(qi)Sqᶦ. Step 4: Apply hq to the documents related to the queries to obtain the indexed list. 3.2 kNN Offline method – 1: To get better efficiency, we use offline training model. Initially for each training query qi, we tried to find its closest queries ck(qi) from the feature space of the queries. Then we use a local training model hqi from Sck(qi). During testing we find kNN ck(q) for any new query q. There after we compare Sck(q) with every Sck(qi) to find out the similar neighbour queries with Sck(q). Then we use hqiᶦ to index the query documents. In fig.3. the circled solid dot represents the selected training query qiᶦ. Fig. 3. Representation of kNN Offline method – 1 For each training query qi, compute kNNs of qi from the training data in the query feature space and denote it as ck(qi). And use the training data set Sck(qi) to learn a model hqi. Algorithm 2: kNN Offline method -1 Step 1: Use reference model hr to compute the top T indexed documents for query qi, and defines its query features from those documents. Step 2: Find kNNs of qi from training data (Ck(qi), with Euclidean distance calculated in the query feature space. Step 3: Compute the similar training set Sck(qiᶦ) by comparing Sck(q) with Sck(qi). Step 4: Use the training model hr to the documents related with the query q to find out the indexed list. 3.3. kNN Offline method -2 Instead of finding the kNNs for the test query qi, we compute its single closest neighbour from the query feature space. Directly we use the training model hqiᶦ from Sck(qiᶦ) to test query qi. Hence we reduced the time complexity by searching a single nearest neighbour. 73
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME Fig.4 Representation of kNN Offline method -2 For each training query qi, compute kNNs of qi from the training data in the query feature space and denote it as ck(qi). And use the training data set Sck(qi) to learn a model hqi. Algorithm 3: kNN Offline method -2 Step 1: Use reference model hr to compute the top T indexed documents for query qi, and defines its query features from those documents. Step 2: Find the nearest neighbour of q, represented as qi . Step 3: Use the training model hr to the documents related with the query q to find out the indexed list. 4. EXPERIMENTAL RESULT Here we used two data sets: 1. Dataset1 – contains 1500 training queries and 500 test queries, 2. Dataset2 – contains 3000 training queries and 600 test queries. Each query is related with its labelled documents relevant to it. The relevance to those documents has been given with its scores as – Perfect- 4, Excellent-3, Good-2, Fair-1 and Bad-0. A feature vector is defined for a query dependent pair. In our experiments, we used Ranking SVM[12] as our baseline algorithm, which has only one n representation of trade-off between the model complexity and the empirical loss. We set n=0.01 for all the methods. In kNN, we used a reference model BM25 to index the documents, select the top T=50 documents and then create the query features. 0.72 0.7 0.68 0.66 SM 0.64 kNN Online 0.62 kNN Offline- 1 0.6 kNN Offline - 2 0.58 0.56 1 2 3 4 5 Fig.5a. Indexing accuracies in terms of Dataset1 74
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME 0.72 0.7 0.68 0.66 SM 0.64 kNN Online 0.62 kNN Offline - 1 0.6 kNN Offline - 2 0.58 0.56 0.54 1 2 3 4 5 Fig. 5b. Indexing accuracies in terms of Dataset2 We compared our methods with the single model approach(SM). From Fig.5a and 4b. We can understand that the proposed methods perform well with each other and also outperform the baseline algorithm. We conducted t-tests and the results show that the improvements of the kNN methods are statistically significant for both Dataset1 and Dataset2 when compared with SM. In SM, it has been observed that errors in the query classification mainly damage the results of document indexing. And also it proves that it is difficult to develop a query dependent indexing method which can hit the conventional indexing methods. Whereas, the kNN methods control the indexing patterns of similar queries successfully and attain better performances on indexing. 5. CONCLUSION AND FUTURE WORK In this paper, we have discussed about different queries, indexing of documents while searching, based on different models with different types of queries. We have defined kNN model for learning indexing functions. We have categorised two offline models to improve the efficiency of the methods. Our experimental results prove that the proposed model outperforms well than the baseline algorithm. However when a small number of neighbours are applied, the kNN’s performances are not good because of the inadequate training data. When there is increase in numbers of neighbours, the performances will also improve gradually due to the use of more information. Conversely if too many neighbours are used (like attaining 1500 as in SM), the performances begin to go worse. Hence the best performance can be achieved in this model when n takes values from a relatively large range i.e from 300 to 700. In future, we can try to reduce the complexity of the online method by using kD-trees. We can further reduce the complexity of offline methods by conducting clustering on the training queries. We can make use of some other metrics other than Euclidean distance to check whether it performs better for the task. 75
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME 6. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. S. Agarwal and P. Niyogi, “Stability and generalization of bipartite ranking algorithms”, Proceedings of COLT 2005, pp 32–47. M. Richardson, A. Prakash, and E. Brill, “Beyond PageRank: machine learning for static ranking”. WWW ’06: Proceedings of the 15th international conference on World Wide Web, New York, NY, USA, 2006, pp 707–715. R. Baeza-Yates and B. Ribeiro-Neto, “Modern Information Retrieval”, Addison Wesley, May 1999. S. M. Beitzel, E. C. Jensen, A. Chowdhury, and O. Frieder, “Varying approaches to topical web query classification”, SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2007, pp 783–784 S. M. Beitzel, E. C. Jensen, O. Frieder, D. Grossman, D. D. Lewis, A. Chowdhury, and A. Kolcz, “Automatic web query classification using labeled and unlabeled training data”. SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2005, pp 581–582. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, “Learning to rank using gradient descent”. ICML ’05: Proceedings of the 22nd international conference on Machine learning, New York, NY, USA, 2005, pp 89–96. Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank: from pairwise approach to listwise approach”. ICML’07, volume 227 of ACM International Conference Proceeding Series, 2007, pp 129–136. Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An efficient boosting algorithm for combining preferences”. J. Mach.Learn. Res., 4:933–969, 2003. D. S. Guru and H. S. Nagendraswam,. “Clustering of interval-valued symbolic patterns based on mutual similarity value and the concept of -mutual nearest neighbourhood”. ACCV (2), 2006, pp 234–243. K. Jarvelin and J. Kekalainen, “Cumulated gain-based evaluation of IR techniques”. ACM Trans. Inf. Syst., 20(4):422–446, 2002. T. Joachims, “Making large-scale support vector machine learning practical. Advances in Kernel Methods: Support Vector Machines”. T. Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li, “LETOR: Benchmark dataset for research on learning to rank for information retrieval”. SIGIR ’07: Proceedings of the Learning to Rank workshop in the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007. T. Joachims, “Optimizing search engines using clickthrough data”. Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2002. I. Kang and G. Ki, “Query type classification for web document retrieval”. SIGIR ’03: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2003. J. Laterty and C. Zhai, “Document language models, query models, and risk minimization for information retrieval”. SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2001, pp 111–119. U. Lee, Z. Liu, and J. Cho, “Automatic identification of user goals in web search”. WWW ’05: Proceedings of the 14th international conference on World Wide Web, New York, NY, USA, 2005, pp 391–400. 76
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME 17. T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, and W.-Y. Ma, “Support vector machines classification with a very large-scale taxonomy”. SIGKDD Explor. Newsl., 7(1):36–43, 2005. 18. R. Nallapati, “Discriminative models for information retrieval”. SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2004, pp 64–71. 19. J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval”. Research and Development in Information Retrieval, 1998, pp 275–281. 20. F. P. Preparata and M. I. Shamos, “Computational Geometry:Discriminative models An Introduction (Monographs in Computer Science)”. Springer, August 1985. 21. S. Robertson, “Overview of the okapi projects”. Journal of Documentation, 1998, pp 275– 281. 22. D. E. Rose and D. Levinson, “Understanding user goals in web search”. WWW ’04: Proceedings of the 13th international conference on World Wide Web, New York, NY, USA, 2004, pp 13–19. 23. G. Salton, “The SMART Retrieval System-Experiments in Automatic Document Processing”. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1971. 24. G. Salton and M. E. Lesk, “Computer evaluation of indexing and text processing”. J. ACM, 15(1):8–36, 1968. 25. D. Shen, J.-T. Sun, Q. Yang, and Z. Chen, “Building bridges for web query classification”. SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, , New York, NY, USA, 2006 pp 131–138. 26. J. Xu and H. Li, “Adarank: a boosting algorithm for information retrieval”. SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2007, pp 391–398. 27. R. Song, J.-R. Wen, S. Shi, G. Xin, T.-Y. Liu, T. Qin, X. Zheng, J. Zhang, G. Xue, and W.-Y. Ma, “Microsoft research asia at web track and terabyte track of trec 2004”. Proceedings of the Thirteenth Text REtrieval Conference Proceedings (TREC-2004), 2004. 28. E. Xing, A. Ng, M. Jordan, and S. Russell, “Distance metric learning, with application to clustering with side-information”. Advances in NIPS, number vol. 15, 2003. 29. Y. Yue, T. Finley, F. Radlinski, and T. Joachim, “A Support Vector Method for Optimizing Average Precision”. SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2007, pp 271–278. 30. Ioannidis, Y., Kang, Y, "Randomized Algorithms for Optimizing Large Join Queries". ACM SIGMOD, 1990. 31. Y. Angeline Christobel and P. Sivaprakasam, “Improving the Performance of K-Nearest Neighbor Algorithm for the Classification of Diabetes Dataset with Missing Values”, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 3, 2012, pp. 155 - 167, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. 32. Mousmi Chaurasia and Dr. Sushil Kumar, “Natural Language Processing Based Information Retrieval for the Purpose of Author Identification”, International Journal of Information Technology and Management Information Systems (IJITMIS), Volume 1, Issue 1, 2010, pp. 45 - 54, ISSN Print: 0976 – 6405, ISSN Online: 0976 – 6413. 33. Prakasha S, Shashidhar Hr and Dr. G T Raju, “A Survey on Various Architectures, Models and Methodologies for Information Retrieval”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. 77

×