International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
Upcoming SlideShare
Loading in …5
×

A novel approach towards developing a statistical dependent and rank

267 views

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
267
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A novel approach towards developing a statistical dependent and rank

  1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME229A NOVEL APPROACH TOWARDS DEVELOPING A STATISTICALDEPENDENT AND RANKING MEASURE FOR KEYWORD SEARCHOVER XML DATADayananda P1, Dr. Rajashree Shettar 21Assistant Professor, Department of Information Science and Engg, MSRIT, Bangalore-542Professor, Department of Computer Science and Engg, RVCE, Bangalore-59ABSTRACTExtensible Markup Language (XML) defines a set of conventions for representing theencrypted documents in both human-readable and machine-readable format. XML is widelyused to represent the arbitrary data structure. Since XML is being largely accepted as astandard for data representation, it is mostly preferred markup language to support keywordsearch. In this paper, a statistical dependent and ranking measure for keyword search overXML data is proposed. The proposed method consists of the following steps such as: 1)Indexing, 2) Selecting the exact T-type node, 3) Data search and Ranking of search results. AT-type node is considered as a desired node to searched, if XML node contains informativeenough with relevant information and node type T should relate to every keyword in query.First the input XML data is given to indexing process that converts the XML data into theindexed format to make search easier. Then, the corresponding T-type node is selectedthrough our proposed statistical dependent formulae. Once selection of T-type node, therelevant data is obtained based on sorting the node type paths. Finally, ranking is done basedon the search results obtained from the previous steps with our designed ranking measure.This work of ours addresses the two challenges addressed by TF*IDF strategy and improvethe effectiveness of the search for node type and ranking of search results.Keywords: XML Keyword search, Indexing, search for node type, Data search and RankingMeasure.INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING& TECHNOLOGY (IJCET)ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online)Volume 4, Issue 3, May-June (2013), pp. 229-247© IAEME: www.iaeme.com/ijcet.aspJournal Impact Factor (2013): 6.1302 (Calculated by GISI)www.jifactor.comIJCET© I A E M E
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME2301. INTRODUCTIONFor big amounts of information, Internet is the depository space. The sharing of XMLinformation quantity over the World Wide Web is expanding severely. The text-centric XMLdocument collections are now obtaining more and more common, as the big majority of thisXML data is data-centric. As an effect, it became useful to give means to control thesecollections. Using document-clustering methods this can be done by automatically arrangingvery big collections into smaller sub-collections. Unluckily, the majority of the research onstructured document processing [1] and [3] is still focused on data-centric XML. With themajor difficulty in this area being the need to optimally index them for storage and retrievalpurposes, the Processing and management of XML documents [4] have already becomepopular research issues. There have been several searching methods grown up in the IRresearch community that basically depend on a set of weighted keywords in a search query todecide the proximity of the query and a document in the feature space. However, the findingof XML documents goes away from the conventional data retrieval strategy, which meansthat the XML documents have nested XML elements and semantics of information valuesindicated by tags. As an effect, in XML searching, the notion of keyword proximity utilizedin IR [13] is too simple to be effective.To enquire XML documents the Keyword search is a handy way, since it permitsusers to easily issue keyword queries without the knowledge of complex query languages orthe structure of underlying information. The keyword proximity search is focused on bymajority of the research efforts in XML keyword search in either tree model or generaldigraph model. The two approaches commonly suppose a smaller sub-structure of the XMLdocument which consists of all query keywords indicates a better effect. Smallest LowestCommon Ancestor (SLCA) is a simple and effective semantics in tree model for XMLkeyword proximity search [15, 8]. Every SLCA result of a keyword query is a smallest XMLnode that 1) covers all keywords in its descendants and 2) has no single proper descendant tocover all query keywords. Based on tree model, however, the SLCA semantics does not catchID reference data that is generally available and significant in XML data-bases. It may, as aneffect, return a large tree consisting of irrelevant data. XML documents, on the other handmay be modeled as digraphs to take into account ID reference edges. The main concept indigraph model, which finds for minimal connected sub trees in graph, is called reduced subtrees [14]. However, the difficulty of searching all reduced sub trees and enumerating effectsby rising sizes of reduced sub trees is NP-hard [17, 10].The heuristics are dependent on by current XML keyword and natural language queryanswering approaches that suppose certain properties of the DB schema. Though theseheuristics are intuitively logical, even in the highest-quality XML schemas, they are enoughad hoc that they are often violated in practice. Thus present approaches endure from lowprecision, low recall, or both [19]. Now the concern is turning to queries of the end-usereffectiveness of such search systems. To the new domain, the Traditional IR similaritymetrics have been ported and combined with domain-specific structural features. Boththrough developing new methods and tuning existing ones, there is also proof of significantimprovements in effectiveness [20].Motivation of our research is to design and develop a technique for keyword searchover XML data. The work presented in [10] over the XML search technique is our realmotivation, in which they have used TF*IDF strategy by addressing two challenges. Whenanalyzing the existing work [10], finding the term frequency-based score computation was
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME231not much impressive in selecting the exact T-type node. Incorporating some other featuresalong with frequency can lead to effective T-type search in XML data. Searching output for auser is significantly high, the ranking of search result is more important. This problem can besolved easily by putting the effective ranking mechanism.The above mentioned two challenges will be solved using the proposed methodologyalong this; work addresses the effectiveness and efficiency in term of result relevance byaddressing the challenges addressed in [10] such as identifying the users search intention,resolving the keyword ambiguity issues and effective ranking of the search results. Theproposed method consists of the following steps such as;1) Indexing: The input XML data is given to indexing process that converts the XML datainto the two indices (data index and node index) which will make search easier.2) Selecting the exact T-type node: The corresponding T-type nodes will be selectedthrough our designed statistical dependent formulae such as Dscore and Tscore .3) Data search and Ranking of search results: Once selection of T-type nodes, the relevantdata are obtained based on the sorting the node type paths. Finally, ranking will be donebased on the search results obtained from the previous steps with our designed rankingmeasure using correlation measure.The rest of the paper is organized as follows. The literature of keyword search overXML data is presented in Section 2, and proposed research methodology in Section 3. InSection 4 the proposed method is discussed, while the Results and Experiments are discussedin Section 5. The conclusion is done in Section 6.2. RELATED WORKJianhuaFeng and GuoliangLiet al in [5] presented a fuzzy type-ahead search in XMLdata, their information-access paradigm in which the system searches XML data on the fly asthe user types in query keywords. It allows users to explore data as they type, even in thepresence of minor errors of their keywords. Their approach had the following features: 1)Search as you type: It extended Auto complete by supporting queries with multiple keywordsin XML data. 2) Fuzzy: It could find high-quality answers that have keywords matchingquery keywords approximately. 3) Efficient: effective index structures and searchingalgorithms can achieve a very high interactive speed. They presented an effective indexstructures and top-k algorithms to achieve a high interactive speed. Also, they examinedeffective ranking functions and early termination techniques to progressively identify the top-k relevant answers. And their implementation results achieved high search efficiency andresult quality.Wei Waet al in [6] presented a multidimensional search approach that allows users toperform fuzzy searches for structure and metadata conditions in addition to keywordconditions. Their techniques individually score each dimension and integrate the threedimension scores into a meaningful unified score. They also have designed indexes andalgorithms to efficiently identify the most relevant files that match multidimensional queries.Experimental evaluation of their approach showed that their relaxation and scoring
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME232framework for fuzzy query conditions in non content dimensions can significantly improveranking accuracy.Ziyang Liu et al in [7] presented an XML search engine Target Search that addressesan open problem in XML keyword search: given relevant matches to keywords, how tocompose query results properly so that they could be effectively ranked and easily digestedby users. Intuitively, each query had a search target and each result should contain exactlyone instance of the search target along with its evidence. They have developed Target Searchwhich composes atomic and intact query results driven by users search targets.ChunxiaoLiuetalin [8] presented a user-friendly Top-k keywords searching approachbased on the relationship of keywords. The SLCA of a keyword search was first obtained bythe LISA II algorithm. Then, the structure of SLCA was leveraged to speculate therelationship of keywords, i.e., the keyword search was translated into twig queries. Next, therelationship of keywords could be estimated by the structure of twig queries and these twigqueries were ranked according to the relationships of keywords. Finally, all results of theordered twig queries were obtained by TJFast algorithm.Yiqun Chen and Jinyin Cao in [9] have presented an approach to type-ahead keywordsearched in XML data, call Take XIR. The IR-style approach basically utilized the statisticsof underlying XML data to address that the following challenges in XML IR system: (1)identify the user search intention, i.e. identify the keywords to express user interests andidentify nodes user wanted to search for and search via. (2) Resolve keyword ambiguityproblems: synonyms and polysemy exist in natural language, and a keyword could appear asthe text values or tag value of different XML node and carry different meanings. They havemodeled XML data as a graph, analyzed the identification of user search intention and resultranking in the presence of keyword ambiguities and used the related definition and formula tobuild a query prediction technique to improved search efficiency.Jiang Li and Junhu Wang [11] have presented an XML keyword search provided asimple and user-friendly way of retrieved data from XML databases, but the ambiguities ofkeywords make it difficult to effectively answer keyword queries. XReal utilized the statisticsof underlying data to resolved keyword ambiguity problems. However, they found theirpresented formula for inferring the search-for node type suffers from inconsistency andabnormality problems. Finally a dynamic reduction factor schemes as well as an algorithmDynamic Infer to resolve these two problems. Experimental results are shown provided toverify the effectiveness.Liang Jeff Chen and YannisPapakonstantinouin[12] have presented a series ofalgorithm that incorporated both the efficient semantic pruning and the top-K processing tosupport top-K keyword search[23]. They presented a join-based algorithm that processesnodes bottom up and reduced keyword query evaluated into relational joins. Severaloptimizations were presented to further improve its efficiency. They then incorporated theidea of the top-K join from relational databases and presented a join-based top-K algorithm tocomputed top K results. Extensive experimental results confirmed the advantages ofalgorithms over previous algorithms in both efficiency and top-K processing.ZhifengBaoetalin [10] have studied the problem of effective XML keyword searchwhich included the identification of user search intention and result ranking in the presence ofkeyword ambiguities. They utilized statistics to infer user search intention and rank the queryresults. In particular, they have defined XML TF and XML DF, based on which have beendesigned formulae to computed the confidence level of each candidate node type to be asearch for/search via node, and further proposed XML TF*IDF similarity ranking scheme to
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME233captured the hierarchical structure of XML data. Finally, the popularity of a query result(captured by ID Ref relationships) was considered to handle the case that multiple resultshave comparable relevance scores.As an extension of [10], several major updates in terms of: 1)our ranking frameworkuses the correlation concept considered in section 4, which outperforms the ranking conceptsin[10], 2) Selecting the exact T-type node into consideration in section 4, 3) New index andalgorithm are designed in section 4.3. RESEARCH METHODOLOGYDefinition 3.1(Structural Node) A tag name is used to label XML node called a structuralnode. Internal node is defined as children’s of structural node; otherwise, it is called a leafnode.Definition 3.2(T type node) A T type node is considered as a desired search for node if, Ttype node is intuitively related to every query keyword, XML nodes of T type should beinformative enough to contain enough relevant information and XML nodes of type T shouldbe not overwhelming to contain too much irrelevant information .Definition 3.2 (Data Node) the leaf node of XML data containing text values and have notag name is called as data node.The primary intention of our research is to design and develop a technique forkeyword search over XML data. The real motivation of the work is come out from the XMLsearch technique given in [10], in which they have used TF*IDF strategy by addressing twochallenges. When analyzing the existing work [10], the finding is that term frequency-basedscore computation was not much impressive in selecting the exact T-type node. Incorporatingsome other features along with frequency can lead to effective T-type search in XML data.Also, the ranking of the search results is important for the users if search output issignificantly high. This problem can be solved easily by putting the effective rankingmechanism.The above mentioned two challenges will be solved using the proposed methodology.The proposed method consists of the three major steps such as, 1) Indexing, 2) Selecting theexact T-type node, 3) Data search and Ranking of search results. At first, the input XML datais given to indexing process that converts the XML data into the indexed format to makesearch easier. Then, the corresponding T-type nodes are selected through our designedstatistical dependent formulae. Once we select T-type nodes, the relevant data are obtainedbased on the similarity matching with the input query. Finally, ranking will be done based onthe search results obtained from the previous steps with our designed ranking measure. Theproposed algorithm will be implemented using JAVA and the performance of the algorithmwill be compared with existing algorithm in terms of precision, recall and ranking measurewith two different datasets.
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME2344. PROPOSED METHOD1. IndexingThe approach presented in [10] for Data processing, built two indices viz. keywordinverted list and frequency table. Of these indices, the keyword inverted list retrieves a list ofdata nodes in document order whose values contain the input keyword. For each inverted list,an index viz. B+-Tree is built on top of it. The second index built, called frequency table,stores only the frequency (number of T-typed nodes that contain keyword k in their subtreesin XML data) for each combination of keyword k and node type T in XML document. If aquery keyword is searched, the approach presented in [10] doesn’t identify the keyword asnode or data and this leads to more complex query processing.There by, to overcome the above discussed demerits, a specific indexing method isproposed that builds two indices viz. Nodeindex and Data index for structural nodes and datanodes respectively. These two indices are represented in Table 1 and Table 2 for DBLP XMLdocument. In contrast to the indices presented in[10], the proposed approach stores nodename of each structural node, frequency of occurrence of each structural node either in T-typed nodes or their subtrees, prefix path of the corresponding T-typed nodes in the nodeindex and name of data nodes. Corresponding node names and frequency of occurrences ofeach data node in XML document is stored in data index. The data node information table isdependent on the Node index in relation with the node name. Scores with reference to the twoindices is utilized efficiently to determine the exact T-typed node for a given keyword query.Thus, the proposed indexing approach addresses each node and data separately in XMLdatabase and results in effective query processing. The fig 1 shows the partial structure ofDBLP XML database and Fig 2 shows partial data subtree for DBLP XML database.Fig.1. Partial data tree structure for ‘DBLP’ XML databasepages416-440book titleyear1986dblpinproceedingsphdthesisarticlemastersthesisauthortitleyearschoolTolgaYurek“Efficientviewmaintenanceat datawarehouses”1997 “Universityof Californiaat santaBarbara,departmentof computerscience”ee author cdrom“GTE/MAN095 pdf”“FrankManola”“db/labs/gte/TR-0310-11-95-165.html”authortitle schoolyear“AndraSikeler”“implementierungskonzeptefuuml; rNon-standard-Datenbanksysteme.”1989“Universitauml; tkaiserslavtern”author title url“EikeBest”“COSY:ItsRelationto NetsandCSP.”“db/conf/ac/petri86-2.html#Best86”month“November”
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME235Sr. no. Node Frequency Path300 author 212898 dblp,article302 url 106805 dblp,article303 publisher 4 dblp,article307 year 72 dblp,phdthesis311 publisher 3 dblp,phdthesis319 author 14 dblp,www320 editor 21 dblp,www321 booktitle 1 dblp,www324 title 2609 dblp,proceedings326 series 1955 dblp,proceedingsTable 1: Node indexTable 2: Data index3. SEARCH FOR NODE TYPE-TFor selection of exact T- type node for a given keyword query, the keyword matchingtag may occur many times in different T-typenode and their subtrees. Thus, causing searchfor node type process to be more complex. In order to overcome this drawback, we haveproposed a couple of mathematical scores such that the optimal T-type nodes are selected.The proposed mathematical scores which addresses the complexity issue are viz; 1) Dscore and2) Tscore. Where, Dscoreis the ratio of the depth of the ancestor nodes from the keywords in agiven query and Tscore gives the percentage score of each node type having the best depthscore (Dscore).a) DscoreFor a given input Qurery ‘q’, initially the depth of the Lowest commonancestor(LCA) node from all the keywords in the query, as well the depth of the Highestcommon ancestor(HCA) node for the same keywords are computed. Therefore, the ratio ofthe depth of the ancestor nodes from the keywords in a given query is known as the Dscore.Sr. no. Data Node Frequency30 db/labs/gte/index.html#TR-0169-12-91-165 url 132 db/labs/gte/TR-0231-08-93-165.html ee 133 Sandra Heiler author 735 TR-0231-08-93-165 volume 836 1993 year 414438 GTE/MANO93c.pdf cdrom 142 June month 544 db/labs/gte/index.html#TM-0014-06-88-165 url 145 GTE/MANO88.pdf cdrom 146 db/labs/gte/TM-0332-11-90-165.html ee 1
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME236MonthFig 2. Partial data sub tree Structure for ‘DBLP’ XML databasenodeHCAofdepthnodeLCAofdepth=D score (1)The LCA nodes with the lowest set of Dscore values are selected as the probable nodetype for the given Query ‘q’. From these set of likely Dscore values the best node will beselected as the T-type node for given Query keywords. To do so, a Tscore percentage isestimated.b) TscoreTscore percentage is estimated by defining the score as for a keyword query, what is thechance of occurrence of keyword ‘k’ at that node type-T. This can be identified byconditional probability property. The conditional probability states that, if ‘q’ and ‘T’ are theevents respectively, then it is said to be the probability of ‘q’ given ‘T’ and it is denoted by P(q/T).Therefore, the conditional probability with respect to the above definition and notations isexpressed as;( )( )TPTqP=TqPI (2)Where;P(q/T) is defined as the chance of event ‘q’ when event ‘T’ have occurred, P(q n T) isthe occurrence of event ‘q’ in event ‘T’, P(T) is defined as the probability of occurrence ofevent ‘T’.dblpArticle“November”ee Author cdrom“GTE/MAN095pdf”“FrankManola”“db/labs/gte/TR-0310-11-95-165.html”
  9. 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME237Now with reference to the mathematical derivation of the conditional probability(P(q/T)), say probability of ‘q’ given ‘T’. Equation in (2) can be represented the sum of theprobability of occurrence of the keyword at that node type-T.( )∑∈Tqk P(T)P(k)=TqPI(3)×∑∈P(k)P(T)1=TqPT)(qk IP (T) is constant for no of keywords (‘k’=1 to n) in the query(4))(1P(k)=TqPn1k TP=×∑=αα (5)Thus, to estimate the best T-node type the percentage of frequency of occurrence of‘k’ at that node type is very important and hence it is considered as the Tscore% of a particularnode and the node having highest Tscore% is the relevant type node and is defined as-Therefore, ∑=×nk 1score P(k)=T α (6)But, P (k) can also be defined as the frequency of occurrence of ‘k’ at that node type‘T’ and P (T) can also be defined as the frequency of the node type-T. And hence defined inequation (6) as;)(1,f(k)=T1scoreTffornk=× ∑=αα (7)Thus the Tscorepercentage is defined as,100f(k)=T1score% ×× ∑=nkα (8)The percentage score of the optimal node type Tscore% is thus defined as, thepercentage of frequency of occurrence of keywords in the query at a particular node type withrespect to the frequency of occurrence of that node type defined in equation(8).4. DATA SEARCH AND RANKINGFor a input keyword query containing ‘n’ keywords. Based on proposed indexingtechniques after pre-processing the XML document, we extract two different indices for eachkeyword in the Query. These indices are viz; data index and node index. Data index is theone having its frequency and node type information whereas; Node index is the one having
  10. 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME238its frequency and path information. The proposed XML keyword search is carried out infollowing steps:1. It identifies the search intent of the user. To identify the desired search for node typewe initially estimate the Dscore of the LCA nodes in the XML document usingequation (1) and choose those nodes having leastDscore.2. Then for each node type having a valid Dscore, we evaluate its Tscore% by usingequation (8) and choose the optimal or maximum Tscore% as the best search for nodetype.3. With respect to the desired or relevant search for node type-T computed form validTscore% the prefix paths for the node type are sorted. Then the sorted prefix paths of thesearch for node type is Ranked by defining the correlation between the sorted paths.Algorithm 1:Input: Query; Node_index; Data_index;Keyword Matching= index( ){Query="q";if (q = node & Node index!=null)for(Node_indexlength){q = keyword[Node_index];f= get_nodefrequency(query);}Else if(q = data &Data_index!=null)for(Data_indexlength){q = keyword[Data_index];f= get_datafrequency(q);}}// search for node type//Score = get_Dscore( ){if (Dscore( ) = min) thenget_Tscore()node_type = max[Tscore( )]}//Ranking//Rank = get_corr( ){if (sum_corr( ) = max) thenRy = max[sum_corr( )]Check threshold(){if difference (Rank1-Rank2)<Thresholdthen select lowest Tscoreelse Rank1.}}In algorithm 1, function get_nodefrequency will calculate the frequency of T type nodescontaining all the query keywords and function get_datafrequency will retrieves the numberof data node present under an each T-type node. Dscore retrieves the list of path with lowestDscore value and it is based on output of Dcore function, the path is selected with highestTscore. Finally ranking is done through get_corr function, by finding correlation between allpaths.
  11. 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME239Generally, any statistical relationship between two random variables or two sets ofdata is referred to as Dependence. And any of a broad class of statistical relationshipsinvolving dependence is referred to as Correlation. There are several correlation coefficientsmeasuring the degree of correlation. The most commonly preferred is Pearson’s correlationcoefficient. Pearson’s correlation is obtained by dividing the covariance of the two variablesby the product of their standard deviations. Since we have series of n sorted paths of say X &Y written as Xi& Yi where i=1, 2… n. thus the sample correlation coefficient is used toestimate the population pearson correlation ‘r’ between X & Y. The sample correlationcoefficient for Ranking is written as;∑∑ ∑== =×n1i1i 1i2i2iiixy)y-(y)x-(x)]y-)(yx-[(x=rn n(9)∑= ××n1iixixy)y-(yS)x-(x1)-(n1=ryS(10)xiS)x-(xIs the standard score, the equation above can be corrected for a sample X’ isthe sample mean and sx is the sample standard deviation given in equation 9 & 10.Afterdetermining the correlation for each combination of paths for the search for node type, thesum of the correlation of a path with itself and the other paths related to the node type willrank the node type path.Correlation mapXYP1 P2 P3 P4 P5P1 Corr(P1,P1) Corr(P1,P2) Corr(P1,P3) Corr(P1,P4) Corr(P1,P5)P2 Corr(P2,P1) Corr(P2,P2) Corr(P2,P3) Corr(P2,P4) Corr(P2,P5)P3 Corr(P3,P1) Corr(P3,P2) Corr(P3,P3) Corr(P3,P4) Corr(P3,P5)P4 Corr(P4,P1) Corr(P4,P2) Corr(P4,P3) Corr(P4,P4) Corr(P4,P5)P5 Corr(P5,P1) Corr(P5,P2) Corr(P5,P3) Corr(P5,P4) Corr(P5,P5)Rank Σx=1to5corr(Px,P1) Σx=1to5corr(Px,P2) Σx=1to5corr(Px,P3) Σx=1to5corr(Px,P4) Σx=1to5corr(Px,P5)Therefore from the correlation map it is observed that the correlation each pair of pathaddresses the ranking effectiveness. The ranking is defined as;∑=51xyxy )P,corr(P=R (11)The Path of the search for node type having the ‘Ry’ value with the highest sum isranked as the best search intention given in equation 11, if the difference of the first to rankedcorrelation sum of the paths is greater than or equal to the threshold value, else if thedifference is less than the threshold then the lowest Tscore% is selected as the desired searchfor node type, as given in equation 12.
  12. 12. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME240Rank1.(Rank)maxRelsifTscore%lowestistypenodeRthenthresholdRank2)-iff(Rank1=R===<d(12)5. RESULTS AND COMPARISONOur proposed statistical dependent and ranking measure for keyword search over XMLdata was experimented by implementing our approach using JAVA software (jdk-1.6 version) on3.20GHz Intel(R) Pentium(R) D, 1.00GB RAM, and 32-bit operating system with windows 7professional. The experimental results obtained are tabulated and these results are compared withthe existing method XReal. The results generated and compared are tested for the real datasets;viz., DBLP, WSU, and eBay [10, 2], and are further discussed in terms of effectiveness andefficiency.Effectiveness test: This type contains two tests viz., 1.1) Inferring the desired search fornode type and 1.2) Quality measure using metrics= Precision, Recall and F-measure.Efficiency test: This type of test is evaluated by measure of Query response time of the proposedmethod with the XReal for all three real datasets.Note: Query under testNotation QueryDBLP datasetQD1 “Java book”QD2 “author Chen Lei”QD3 “Jim Gray article”QD4 “XML twig”QD5 “Ling tokwang twig”QD6 “vldb 2000”QD7 “Philip Bernstein”QD8 “WISE”QD9 “ER 2005”QD10 “LATIN 2006”WSU datasetQW1 “230”QW2 “CAC 101”QW3 “ECON”QW4 “Biology”QW5 “place TODD”QW6 “days TU TH”eBay datasetQE1 “2 days”QE2 “cpu 933”QE3 “Hard drive CA”5.1 Effectiveness testThe effectiveness of our approach for a statistical dependent and ranking measure forkeyword search over XML data is addressed by identifying the user search intention andresolving the ambiguity issues. The accuracy of our approach is tested by evaluating the usersearch intention for the search for node type for the query tabulated in the table 3 of which coupleof query having both ambiguity 1 and 2 and few having ambiguity 2 are considered.
  13. 13. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME2415.1.1 Inferring the desired search for node typeThe queries used in table 3, such as QD1 and QD3 have both ambiguity 1(keywordappearas an XML tag name and text value) and ambiguity 2(keyword appear as text values ofdifferent type of XML nodes) whereas QD2, QD6 and QW1 have ambiguity 2. The usersearch intention, if observed from the table 3 for DBLP dataset is ideal for our method andXReal approach compared to the SLCA/XSeek. While for the WSU and eBay dataset thesearch intention is almost able to infer a desired search for node type as these datasets are ofsmall size and the root node occurs alongside the search intention. For example in case ofQuery QE1 search intention is auction_info and our approach outputs auction _info; listing.Example for desired Search for node type using our proposed method is as follows;We consider a Query for which the complete Search for node type is presented.Input Query: “java book”==========================================1) DscoreTag frequency path Dscoreauthor 413010 dblp,inproceedings 1.0author 212898 dblp,article 1.0title 179060 dblp,inproceedings 1.0url 179058 dblp,inproceedings 1.0booktitle 179058 dblp,inproceedings 1.0title 106834 dblp,article 1.0url 106805 dblp,article 1.0ee 73560 dblp,inproceedings 1.0ee 23442 dblp,article 1.0title 2609 dblp,proceedings 1.0url 2491 dblp,proceedings 1.0booktitle 2293 dblp,proceedings 1.0author 1996 dblp,incollection 1.0author 1153 dblp,book 1.0title 1009 dblp,incollection 1.0booktitle 1009 dblp,incollection 1.0url 1006 dblp,incollection 1.0title 845 dblp,book 1.0book 845 dblp,book 1.0url 128 dblp,book 1.0ee 107 dblp,incollection 1.0title 72 dblp,phdthesis 1.0author 72 dblp,phdthesis 1.0url 38 dblp,www 1.0title 38 dblp,www 1.0author 14 dblp,www 1.0ee 6 dblp,proceedings 1.0title 5 dblp,mastersthesis 1.0ee 5 dblp,book 1.0author 5 dblp,mastersthesis 1.0ee 1 dblp,phdthesis 1.0booktitle 1 dblp,www 1.0
  14. 14. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME2422) TscoreTag Name Tscore pathbooktitle 182361.0 dblp,wwwauthor 125829.6 dblp,mastersthesisee 97121.0 dblp,phdthesistitle 58094.4 dblp,mastersthesisauthor 44939.142857142855 dblp,wwwee 19424.2 dblp,bookee 16186.833333333332 dblp,proceedingsauthor 8738.166666666666 dblp,phdthesistitle 7644.0 dblp,wwwurl 7619.105263157894 dblp,wwwtitle 4034.333333333333 dblp,phdthesisurl 2261.921875 dblp,bookee 907.6728971962616 dblp,incollectionauthor 545.661751951431 dblp,booktitle 343.75384615384615 dblp,bookauthor 315.2044088176352 dblp,incollectiontitle 287.8810703666997 dblp,incollectionurl 287.79920477137176 dblp,incollectionbooktitle 180.73439048562932 dblp,incollectionurl 116.228823765556 dblp,proceedingstitle 111.33461096205444 dblp,proceedingsbooktitle 79.52943741822939 dblp,proceedingsee 4.143033870830134 dblp,articleauthor 2.9551616266944736 dblp,articletitle 2.7189097103918227 dblp,articleurl 2.710790693319601 dblp,articletitle 1.6222048475371385 dblp,inproceedingsurl 1.6169397625350446 dblp,inproceedingsauthor 1.5233238904627007 dblp,inproceedingsee 1.3202963567156063 dblp,inproceedingsbooktitle 1.0184465368763194 dblp,inproceedingsbook 0.0 dblp,book=================================================3) RankingExample: correlation of dblp,proceedings and dblp,incollectioncorr(dblp,proceedingsdblp,incollection)= 0.1221784083384564Ranked Sum of correlation:Path RankP1=dblp,book 3.2727014742218543P2=dblp,phdthesis 3.1869696826431175P3=dblp,incollection 3.0431260287060002P4=dblp,www 2.0916351992181195P5=dblp,article 1.8924147256281627P6=dblp,inproceedings 1.8924147256281627P7=dblp,proceedings 0.13822919060961375P8=dblp,mastersthesis 0.0Rank1.(Rank)maxRelsifTscore%lowestistypenodeRthenthresholdRank2)-iff(Rank1=R===<dSelected Path is dblp, book
  15. 15. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME243Table 3. Effectiveness test on Inferring the desired search for node typeQuery Intention XReal SLCA/XSeek OurDBLP (370MB)QD1Java,bookbook bookbook ; title/book; articlebookQD2author,Chen, Leiinproceedings inproceedings authorinproceedingsQD3Jim,Gray,articlearticle article articlearticleQD4XML,twiginproceedings inproceedingstitle/inproceedingsinproceedingsQD5Ling, tok,wang,twiginproceedings inproceedings InproceedingsinproceedingsQD6vldb,2000inproceedings inproceedings inproceedingsinproceedingsWSU (16.5MB)QW1 230 place course;placeroom; crs /coursePlace;courseQW2CAC,101course course courseCourseQW3 ECON course course prefix/course CourseQW4 Biology course course title/course courseQW5place,TODDcourse course place/coursePlace;courseQW6days, TU,THcourse course days/coursePlaceeBay (0.36MB)QE1 2 , days auction_info listingtime_left /listingauction_info;listingQE2 cpu, 933 listing listing cpu / listing Item_info;listingQE3Hard,drive, CAlisting listingdescription /listing`listing5.1.2 Quality measure (Precision, Recall & F-measure)Quality measure is also addresses the effectiveness of our approach by evaluating allthe queries under test, and sums up few metrics viz; precision, recall and F-measure.Precision is the percentage measure of, the output subtrees that are desired; recall is thepercentage measure of the desired subtrees that are output; while F-measure is the weightedmean value of precision and recall. Because most of the queries on DBLP have more than100 results, therefore, in [10] precision, recall and F- measure are XReal’s. Similarly, foreach query issued on WSU and eBay, thus in figure 3 and 4.
  16. 16. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME244(a) (b)(c)Fig. 3. Precision comparison (percent) (a) DBLP (b) WSU and (c) EBAY(a) (b)(c)Fig. 4. Recall comparison (percent). (a) DBLP, (b) WSU, and (c) EBAY0102030405060708090100X RealProposed0102030405060708090100QW1 QW2 QW3 QW4 QW5 QW6X RealProposed0102030405060708090100QE1 QE2 QE3X RealProposed80828486889092949698100X Real0102030405060708090100QW1 QW2 QW3 QW4 QW5 QW6X RealProposed0102030405060708090100QE1 QE2 QE3X RealProposed
  17. 17. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME245Table 4: F-Measure (%)MethodDatasetXReal ProposedDBLP 47.48 48.48WSU 49.67 37.5EBAY 40.02 44.44Figure 3 represents that the Average precision for our proposed approach is effective thanthe XReal for the queries in the DBLP dataset. Figure 4 represents the Recall measure for allthree real datasets and the recall measure for our approach out performs XReal. Further, F-measure is measured adopting formula F = [(precision * recall)/ (precision + recall)] to get F-measure in Table 4. This can be measured as the average precision and recall score of all thequeries under test. F-measure for our method in the DBLP dataset is 48.48% and Ebay is 44.44%whereas; for XReal in DBLP it is 47.48 % and in Ebay it is 40.02%.5.2 Efficiency testThe efficiency test is addressed by evaluating the query response time adopting ourproposed method designing the indices for keyword information discussed in section 4. This isexecuted by measuring the time taken to search for the node type of the given query. Theresponse time of individual queries under test is represented in Table 4. Proposed method iscompared with the XReal Dup type norm. In case of DBLP,WSU and ebay real dataset it isobserved that our approach is faster than even Dup type norm (three level information indexing).Fig. 5 shows the response time in seconds on individual queries DBLP, WSU and eBaydatabases.(a) (b)(c)Fig. 5. Response time on individual queries (a) DBLP (b) WSU and (c) eBay024681012QD1 QD2 QD3 QD4 QD5 QD6DupTypeNormProposed methodTime(s)00.10.20.30.40.50.60.70.80.91QW1 QW2 QW3 QW4 QW5 QW6DupTypeNormProposed methodTime(s)0123456QE1 QE2 QE3DupTypeNormProposed methodTime(s)
  18. 18. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME2466. CONCLUSIONIn this paper, a statistical dependent and ranking measure for keyword search overXML data is designed and this approach is analyzed over various real XML datasets. Also,we have performed a broad analysis over the different approaches available for keywordsearch on XML data in the literature. We developed representations for identifying the userssearch intention and to resolve the keyword ambiguity issues as well ranking the desiredsearch intention. This was done by introducing Node index and Data index, based on whoseinformation Dscore and Tscore measures were developed to infer the search for node type,and a Correlation Ranking mechanism to Rank the search intention. From the results obtainedof the Query under testing different datasets in terms of effectiveness and efficiency indicatesthat the proposed approach outperforms the existing techniques of XML keyword search.7. REFERENCES[1] D. Guillaume and F. Murtaugh, “Clustering of XML Documents”, Computer physicscommunication, Vol: 127, pp: 215-227, 2000.[2] N. Sundaresan, “A classifier for semi-structured documents”, in proceedings of thesixth ACM SIGKDD international conference on knowledge discovery and datamining, pp: 3404—344, 2000.[3] Antoine Doucet and Helena Ahonen-Myka, "Naive clustering of a large XMLdocument collection", in Proceedings of the 1st INEX, Germany, 2002.[4] Abiteboul, S., Buneman, P. and Suciu, D, “Data on the Web”, Morgan Kaufmann,2000.[5] JianhuaFeng and GuoliangLi , “Efficient Fuzzy Type-Ahead Searchin XMLData”,IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 24, NO. 5, MAY 2012.[6] Wei Wang, Christopher Peery, Ame´lie Marian, and Thu D. Nguyen, “EfficientMultidimensional Fuzzy Search for Personal Information Management Systems”,IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.24, NO. 9, SEPTEMBER 2012.[7] Ziyang Liu, YichuanCai, and Yi Chen, “TargetSearch: A Ranking Friendly XMLKeyword Search Engine”,International conference on Data Engineering, pp:1101-1104, 2010.[8] Chunxiao Liu, XiangfuMeng and Ke Wei, “A Top-k Keywords Searching Approachbased on the Relationship of Keywords”, IEEE International Conference on Systems,Man, and Cybernetics, October 2012.[9] Yiqun Chen and Jinyin Cao, "TakeXIR: a Type-Ahead Keyword Search XMLInformation Retrieval System", I.J. Education and Management Engineering, vol.8,pp: 1-5, 2012.[10] ZhifengBao, Jiaheng Lu, Tok Wang Ling and Bo Chen, "Towards an Effective XMLKeyword Search", Knowledge and Data Engineering, Vol. 22, no. 8, pp: 1077- 1092,2010.[11] Jiang Li and Junhu Wang, "Effectively Inferring the Search-for Node Type in XMLKeyword Search", Database Systems for Advanced Applications, p p.110-124, 2010.[12] Liang Jeff Chen and YannisPapakonstantinou, "Supporting Top-K Keyword Search inXML Databases", Data Mining Workshops (ICDMW), p p. 805- 812, 2012.
  19. 19. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME247[13] Wilfred Ng and Lau Ho Lam, "A Co-Training Framework for Searching XMLDocuments", Journal Information Systems, vol.32, no.3, 2007.[14] B. Kimelfeld and Y. Sagiv, “Efficiently enumerating results of keyword search”, InProceedings of DBPL Conference, pp. 58-73, 2005.[15] Y. Li, C. Yu, and H. V. Jagadish, “Schema-free XQuery”, In VLDB, pp. 72-83, 2004.[16] A. Schmidt, M. L. Kersten, and M. Windhouwer, “Querying XML documents madeeasy: Nearest concept queries”, In ICDE, pp. 321-329, 2001.[17] Ralf Schenkel and Martin Theobald, "Structural Feedback for Keyword-Based XMLRetrieval", ECIR, pp. 326-337, 2006.[18] Bo Chen, Jiaheng Lu, and Tok Wang Ling, "Exploiting ID References for Effectivekeyword Search in XML Documents", In Proceedings of DASFAA, pp. 529-537,2008.[19] ArashTermehchy, mariannewinslett, “Using Structural Information in XML KeywordSearch Effectively”, ACM Transactions on Database Systems, Vol. 36, No.1, Month2011.[20] William Webber, “Evaluating the Effectiveness of Keyword Search", IEEE Data Eng.Bull., vol. 33, no. 1, pp. 54-59, 2010.[21] Junfeng Zhou, ZhifengBao, Wei Wang, Tok Wang Ling, Ziyang Chen, Xudong Linand JingfengGuo, "Fast SLCA and ELCA Computation for XML Keyword Queriesbased on Set Intersection”, Data Engineering (ICDE), p p.905-916, April 2012.[22] Jia-Jian Jiang, Zhi-Hong Deng, NingGao, and Sheng-Long Lv, "Guess What I Want:Inferring the Semantics of Keyword Queries Using Evidence T heory", Springer-Verlag Berlin Heidelberg, p p. 388-398, 2012.[23] Dayananda P, Dr. Rajashree Shettar,” Survey on Information Retrieval in SemiStructured Data”, International Journal of Computer Applications 32(8):1-5, October2011.[24] Y. Swapna, S. Ravi Sankar, “A Frame Work For Clustering Time Evolving DataUsing Sliding Window Technique” International Journal of Computer Engineering &Technology (IJCET),Volume 3,Issue 3,2012,pp. 377 - 383,ISSN Print:0976 – 6367,ISSN Online: 0976 – 6375.

×