At33264269

190 views

Published on

International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
190
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

At33264269

  1. 1. Harish Chaware, Prof. Nitin Chopade / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.264-269264 | P a g eMultiple Web Database Handle Using CTVS Method and RecordMatchingHarish Chaware#1,Prof. Nitin Chopade #2#1M.E (Scholar),G.H.Raisoni College of Engineering and Management, Amravati.*2M.E (CSE),G.H.Raisoni College of Engineering and Management, Amravati.AbstractWeb databases generate query resultpages based on a user’s query. For manyapplications, automatically extracting the datafrom these query result pages is very important,such as data integration, which needs tocooperate with multiple web databases. Wepresent a novel data extraction and alignmentmethod called CTVS that combines both tag andvalue similarity. CTVS automatically extractsdata from query result pages by first identifyingand segmenting the query result records (QRRs)in the query result pages and then aligning thesegmented QRRs into a table, in which the datavalues from the same attribute are put into thesame column. We present an unsupervised,online record matching method, UDD, which,for a given query, can effectively identifyduplicates from the query result records ofmultiple Web databases. We propose newtechniques to handle the case when the QRRsare not contiguous, which may be due to thepresence of auxiliary information, such as acomment, recommendation or advertisement,and for handling any nested structure that mayexist in the QRRs.Keyword: Data extraction, automatic wrappergeneration, data record alignment, informationintegration, Record matching, duplicate detection,record linkage.I. IntroductionToday, more and more databases thatdynamically generate Web pages are available onthe Web in response to user queries. These Webdatabases compose the deep or hidden Web.Compared with WebPages in the surface web,which can be accessed by a unique URL, pages inthe deep web are dynamically generated inresponse to a user query submitted through thequery interface of a web database. Upon receivinga user’s query, a web database returns the relevantdata, either structured or semi structured, encodedin HTML pages.[1] Compare the query resultsreturned from multiple Web databases, a crucialtask is to match the different sources’ records thatrefer to the same real-world entity. The records tomatch are highly query-dependent, since they canonly be obtained through online queries. Moreover,they are only a partial and biased portion of all thedata in the source Web databases [2].This seminar focuses on the problem ofidentifying duplicates, that is, two recordsdescribing the same entity, has attracted muchattention from many research fields, includingDatabases, Data Mining, Artificial Intelligence, andNatural Language Processing, and of automaticallyextracting data records that are encoded in thequery result pages generated by web databases. Ingeneral, a query result page contains not only theactual data, but also other information, such asnavigational panels, advertisements, comments,information about hosting sites, and so on. The goalof web database data extraction is to remove anyirrelevant information from the query result page,extract the query result records (referred to asQRRs in this paper) from the page, and align theextracted QRRs into a table such that the datavalues belonging to the same attribute are placedinto the same table column. We propose two newtechniques: Record matching and CTVS.[3]Unsupervised Duplicate Detection (UDD)Record matching method UnsupervisedDuplicate Detection (UDD) for the specific recordmatching problem of identifying duplicates amongrecords in query results from multiple Webdatabases. The key ideas of our method are:1. We focus on techniques for adjusting the weightsof the record fields in calculating the similaritybetween two records. Two records areconsidered as duplicates if they are “similarenough” on their fields. We believe differentfields may need to be assigned differentimportance weights in an adaptive and dynamicmanner.2. We use a sample of universal data consisting ofrecord pairs from different data sources as anapproximation for a negative training set as wellas the record pairs from the same data source.[4]Combining Tag and Value Similarity (CTVS)We employ the following two-stepmethod, called Combining Tag and Value
  2. 2. Harish Chaware, Prof. Nitin Chopade / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.264-269265 | P a g eSimilarity (CTVS), to extract the QRRs from aquery result page p. Record extraction identifies the QRRs in p andinvolves two substeps: data regionidentification and the actual segmentation step. Record alignment aligns the data values of theQRRs in p into a table so that data values forthe same attribute are aligned into the sametable column. CTVS improves data extraction accuracyin three ways:1. New techniques are proposed to handle the casewhen the QRRs are not contiguous in p, which maybe due to the presence of an auxiliary information,such as a comment, recommendation, oradvertisement. While the method in can find alldata regions containing at least two QRRs in aquery result page using data mining techniques,almost all other data extraction methods, such as,assume that the QRRs are presented contiguouslyin only one data region in a page. However, thisassumption may not be true for many webdatabases where auxiliary information separates theQRRs.[5] We examined 100 websites to determinethe extent to which the QRRs in the query resultpages are noncontiguous. We found that the QRRsin 26 out of the 100 websites are noncontiguous,which indicates that noncontiguous data regions arequite common. Furthermore, 14 of the 26 websiteshave noncontiguous QRRS with the same parent inthe page HTML tag tree. The other 12 websiteshave noncontiguous QRRs with different parents inthe HTML tag tree. To address this problem, weemploy two techniques according to the layout ofthe QRRs and the auxiliary information in theresult page’s HTML tag trees.a. An adapted data region identification method isproposed to identify the noncontiguous QRRsthat have the same parents according to their tagsimilarities.b. A merge method is proposed to combinedifferent data regions that contain the QRRs(with or without the same parent) into a singledata region.2. A novel method is proposed to align the datavalues in the identified QRRs, first pairwise thenholistically, so that they can be put into a table withthe data values belonging to the same attributearranged into the same table column. Both tagstructure similarity and data value similarity areused in the pairwise alignment process. To ourknowledge, this work is the first to combine tagstructure and data value similarity to perform thealignment. We observe that the data values withinthe same attribute usually have the same data type,and similar data values in many cases, because theyare the result for the same query. Hence, thepairwise alignment is reduced to finding a value-to-value alignment with the maximal data valuesimilarity score under some constraints. After allpairs of records are aligned, a holistic alignment isperformed, by viewing the pairwise alignmentresult as a graph and finding the connectedcomponents from the graph.3. A new nested-structure processing algorithm isproposed to handle any nested structure in theQRRs after the holistic alignment. Unlike existingnested-structure processing algorithms that rely ononly tag information, CTVS uses both tag and datavalue similarity information to improvenestedstructure processing accuracy.II. Related work:An important aspect of duplicate detectionis to reduce the number of record pair comparisons.Several methods have been proposed for thispurpose including standard blocking [1], sortedneighborhood method [2], Bigram Indexing, andrecord clustering. Even though these methods differin how to partition the data set into blocks, they allconsiderably reduce the number of comparisons byonly comparing records from the same block. Sinceany of these methods can be incorporated intoUDD to reduce the number of record paircomparisons, we do not further consider this issue.While most previous record matching work istargeted at matching a single type of record, morerecent work[1], [3],[5] has addressed the matchingof multiple types of records with rich associationsbetween the records. Even though the matchingcomplexity increases rapidly with the number ofrecord types, these works manage to capture thematching dependencies between multiple recordtypes and utilize such dependencies to improve thematching accuracy of each single record type.Unfortunately, however, the dependencies amongmultiple record types are not available for manydomains. Compared to these previous works, UDDis specifically designed for the Web databasescenario where the records to match are of a singletype with multiple string fields. These records areheavily query-dependent and are only a partial andbiased portion of the entire data, which makes theexisting work based on offline learninginappropriate. Moreover, our work focuses onstudying and addressing the field weightassignment issue rather than on the similaritymeasure.In UDD, any similarity measure, or somecombination of them, can be easily incorporated.Our work is also related to the classificationproblem using only a single class of trainingexamples, i.e., either positive or negative, to finddata similar to the given class. To date, mostsingle-class classification work has relied onlearning from positive and unlabeled data[4],[5],[7]. In [9], multiple classification methodsare compared and it is concluded that one-classSVM and neural network methods are comparableand superior to all the other methods. In particular,
  3. 3. Harish Chaware, Prof. Nitin Chopade / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.264-269266 | P a g eone-class SVM distinguishes one class of data fromanother by drawing the class boundary of theprovided data’s class in the feature space.However, it requires lots of data to induce theboundary precisely, which makes it liable to overfitor underfit the data and, moreover, it is veryvulnerable to noise. The record matching worksmost closely related to UDD are Christen’s method[10] and PEBL [11]. Using a nearest basedapproach, Christen first performs a comparison stepto generate weight vectors for each pair of recordsand selects those weight vectors as trainingexamples that, with high likelihood, correspond toeither true matches (i.e., pairs with high similarityscores that are used as positive examples) or truenonmatches (i.e., pairs with low similarity scoresthat are used as negative examples). These trainingexamples are then used in a convergence step totrain a classifier (either nearest neighbor or SVM)to label the record pairs not in the training set.Combined, these two steps allow fully automated,unsupervised record pair classification, without theneed to know the true match and nonmatch statusof the weight vectors produced in the comparisonstep. PEBL [14] classifies Web pages in two stagesby learning from positive examples and unlabeleddata. In the mapping stage, a weak classifier, e.g., arule-based one, is used to get “strong” negativeexamples from the unlabeled data, which containnone of the frequent features of the positiveexamples. In the convergence stage, an internalclassifier, e.g., SVM, is first trained by the positiveexamples and the autoidentified negative examplesand is then used to iteratively identify new negativeexamples until it converges.Web database extraction has receivedmuch attention from the Database and InformationExtraction research areas in recent years due to thevolume and quality of deep web data [12], [13],[14], and [16]. As the returned data for a query areembedded in HTML pages, the research hasfocused on how to extract this data. Earlier workfocused on wrapper induction methods, whichrequire human assistance to build a wrapper. Morerecently, data extraction methods have beenproposed to automatically extract the records fromthe query result pages. In wrapper induction,extraction rules are derived based on inductivelearning. A user labels or marks part or all of theitem(s) to extract (the target item(s)) in a set oftraining pages or a list of data records in a page andthe system then learns the wrapper rules from thelabeled data and uses them to extract records fromnew pages. A rule usually contains two patterns, aprefix pattern and a suffix pattern, to denote thebeginning and the end, respectively, of the targetitem. Some existing systems that employ wrapperinduction include WIEN, SoftMealy [1], Stalker [2]and [3], XWRAP [4], WL2 [7] and [10], and Lixto[13]. While wrapper induction has the advantagethat no extraneous data are extracted, since the usercan label only the items of interest, it requires laborintensive and time-consuming manual labeling ofdata. Thus, it is not scalable to a large number ofweb databases. Moreover, an existing wrapper canperform poorly when the format of a query resultpage changes, which may happen frequently on theweb. Hence, the wrapper induction approachinvolves two further difficult problems: monitoringchanges in a page’s format and maintaining awrapper when a page’s format changes. Toovercome the problems of wrapper induction, someunsupervised learning methods, such asRoadRunner, Omini, IEPAD, ExAlg [1], DeLa [2],PickUp [9], and TISP [10], have been proposed toautomatically extract the data from the query resultpages. These methods rely entirely on the tagstructure in the query result pages. Here, we discussonly DeLa since we compare its performance withCTVS. DeLa models the structured data containedin template-generated webpages as string instancesencoded in HTML tags, of the implied nested typeof their web database. A regular expression isemployed to model the HTML-encoded version ofthe nested type. Since the HTML tag-structureenclosing the data may appear repeatedly if thepage contains more than one instance of the data,the page is first transformed into a token sequencecomposed of HTML tags and a special token “text”representing any text string enclosed by pairs ofHTML tags. Then, continuous repeated substringsare extracted from the token sequence and a regularexpression wrapper is induced from the repeatedsubstrings according to some hierarchicalrelationships among them. The main problem withthis method is that it often produces multiplepatterns (rules) and it is hard to decide which iscorrect.III. Proposed Work:In this section, we describe the methodsthat are use in Data extraction and Alignment.There is method for automatically extracting datarecords that are encoded in the query result pagesgenerated by web databases. [3]
  4. 4. Harish Chaware, Prof. Nitin Chopade / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.264-269267 | P a g e1 QRR ExtractionFigure 1. QRR extraction framework.Fig. 1 shows the framework for QRRextraction. Given a query result page, the Tag TreeConstruction module first constructs a tag tree forthe page rooted in the <HTML> tag. Each noderepresents a tag in the HTML page and its childrenare tags enclosed inside it. Each internal node n ofthe tag tree has a tag string tsn, which includes thetags of n and all tags of n’s descendants, and a tagpath tpn, which includes the tags from the root to n.Next, the Data Region Identification moduleidentifies all possible data regions, which usuallycontain dynamically generated data, top downstarting from the root node. The RecordSegmentation module then segments the identifieddata regions into data records according to the tagpatterns in the data regions. Given the segmenteddata records, the Data Region Merge modulemerges the data regions containing similar records.Finally, the Query Result Section Identificationmodule selects one of the merged data regions asthe one that contains the QRRs. The following foursections describe each of the last four modules indetail.[6]1. Data Region IdentificationWe first assume that some child subtreesof the same parent node form similar data records,which assemble a data region. which claim thatsimilar data records are typically presentedcontiguously in a page. Instead, we observed that inmany query result pages some additional item thatexplains the data records, such as arecommendation or comment, often separatessimilar data records. Hence, we propose a newmethod to handle non-contiguous data regions sothat it can be applied to more web databases.2. Record SegmentationTo illustrate the record segmentationalgorithm, first finds tandem repeats within a dataregion. The following two heuristics are used forthe tandem repeat selection: If there is auxiliary information, whichcorresponds to nodes between record instances,within a data region, the tandem repeat thatstops at the auxiliary information is the correcttandem repeat since auxiliary informationusually is not inserted into the middle of arecord. The visual gap between two records in a dataregion is usually larger than any visual gapwithin a record. Hence, the tandem repeat thatsatisfies this constraint is selected. If the preceding two heuristics cannot be used,we select the tandem repeat that starts the dataregion.[7]3. Data Region MergeThe data region identification step mayidentify several data regions in a query result page.Moreover, the actual data records may span severaldata regions. In the websites we examined, 12percent had QRRs with different parents in theHTML tag tree. Thus, before we can identify all theQRRs in a query result page, we need to determinewhether any of the data regions should be merged.Given any two data regions, we treat them assimilar if the segmented records they contain aresimilar. The similarity between any two recordsfrom two data regions is measured by the similarityof their tag strings. The similarity between two dataregions is calculated as the average recordsimilarity.4. Query Result Section IdentificationEven after performing the data regionmerge step, there may still be multiple data regionsin a query result page. However, we assume that atmost one data region contains the actual QRRs.Three heuristics are used to identify this dataregion, called the query result section. The query result section usually occupies alarge space in the query result page The query result section is usually located atthe center of the query result page Each QRR usually contains more raw datastrings than the raw data strings in othersections.4.1 QRR AlignmentQRR alignment is performed by a novelthree-step:1. Pair wise QRR AlignmentThe pair wise QRR alignment algorithm isbased on the observation that the data valuesbelonging to the same attribute usually have thesame data type and may contain similar strings,especially since the QRRs are for the same query.During the pairwise alignment, we require that thedata value alignments must satisfy the followingthree constraints: Same record path constraint
  5. 5. Harish Chaware, Prof. Nitin Chopade / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.264-269268 | P a g e unique constraint No cross alignment constraint2. Holistic AlignmentGiven the pair wise data value alignmentsbetween every pair of QRRs, the step of holisticalignment performs the alignment globally amongall QRRs to construct a table in which all datavalues of the same attribute are aligned in the sametable column. Intuitively, if we view each datavalue in the QRRs as a vertex and each pairwisealignment between two data values as an edge, thepairwise alignment set can be viewed as anundirected graph.3. Nested Structure ProcessingHolistic data value alignment constrains adata value in a QRR to be aligned to at most onedata value from another QRR. If a QRR contains anested structure such that an attribute has multiplevalues, then some of the values may not be alignedto any other values. Therefore, nested structureprocessing identifies the data values of a QRR thatare generated by nested structures. To overcomethis problem, CTVS uses both the HTML tags andthe data values to identify the nested structures.4.3 UDDOur focus is on Web databases from thesame domain, i.e., Web databases that provide thesame type of records in response to user queries.Suppose there are s records in data source A andthere are t records in data source B, with eachrecord having a set of fields/attributes. Each of the trecords in data source B can potentially be aduplicate of each of the s records in data source A.The goal of duplicate detection is to determine thematching status, i.e., duplicate or nonduplicate, ofthese s * t record pairs.We present the assumptions andobservations on which UDD is based.1. A global schema for the specific type of resultrecords is predefined and each database’sindividual query result schema has beenmatched to the global schema.2. Record extractors, i.e., wrappers, are availablefor each source to extract the result data fromHTML pages and insert them into a relationaldatabase according to the global schema.Besides these two assumptions, we also make useof the following two observations:1. The records from the same data source usuallyhave the same format.2. Most duplicates from the same data source canbe identified and removed using an exactmatching method.Duplicate records exist in the query resultsof many Web databases, especially when theduplicates are defined based on only some of thefields in a record. Using a straightforward pre-processing step, exact matching, can merge thoserecords that are exactly the same in all relevantmatching fields. [10]IV. ConclusionWe presented a novel data extractionmethod, CTVS, to automatically extract QRRsfrom a query result page. CTVS employs two stepsfor this task. The first step identifies and segmentsthe QRRs. We improve on existing techniques byallowing the QRRs in a data region to be non-contiguous. The second step aligns the data valuesamong the QRRs. A novel alignment method isproposed in which the alignment is performed inthree consecutive steps: pairwise alignment,holistic alignment, and nested structure processing.Experiments on five data sets show that CTVS isgenerally more accurate than current state-of-the-art methods.Duplicate detection is an important step in dataintegration and most state-of-the-art methods arebased on offline learning techniques, which requiretraining data. In theWeb database scenario, where records tomatch are greatly query-dependent, a pretrainedapproach is not applicable as the set of records ineach query’s results is a biased subset of the fulldata set. To overcome this problem, we presentedan unsupervised, online approach, UDD, fordetecting duplicates over the query results ofmultiple Web databases. Two classifiers, WCSSand SVM, are used cooperatively in theconvergence step of record matching to identify theduplicate pairs from all potential duplicate pairsiteratively. Experimental results show that ourapproach is comparable to previous work thatrequires training examples for identifyingduplicates from the query results of multiple Webdatabases.References[1] Weifeng Su, Jiying Wang, Frederick H.Lochovsky, “Combining Tag and ValueSimilarity for Data Extraction andAlignment” IEEE Transactions OnKnowledge And Data Engineering, Vol.24, No. 7, July 2012[2] Weifeng Su, Jiying Wang, Frederick H.Lochovsky, “Record Matching over QueryResults from Multiple Web Databases”IEEE Transactions On Knowledge AndData Engineering, Vol. 22, No. 4, April2010[3] Y. Zhai and B. Liu, “Structured DataExtraction from the WebBased on PartialTree Alignment,” IEEE Trans. Knowledgeand Data Eng., vol. 18, no. 12, pp. 1614-1628, Dec. 2006.
  6. 6. Harish Chaware, Prof. Nitin Chopade / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.264-269269 | P a g e[4] W. Su, J. Wang, and F.H. Lochovsky,“Holistic Schema Matching for WebQuery Interfaces,” Proc. 10th Int’l. Conf.Extending Database Technology, pp. 77-94, 2006[5] C. Tao and D.W. Embley, “AutomaticHidden-Web Table Interpretation bySibling Page Comparison,” Proc. 26thInt’l Conf. Conceptual Modeling, pp. 566-581, 2007[6] R. Baxter, P. Christen, and T. Churches,“A Comparison of Fast Blocking Methodsfor Record Linkage,” Proc. KDDWorkshop Data Cleaning, RecordLinkage, and Object Consolidation, pp.25-27, 2003.[7] K. Simon and G. Lausen, “ViPER:Augmenting Automatic InformationExtraction with Visual Perceptions,” Proc.14th ACM Int’l Conf. Information andKnowledge Management, pp. 381-388,2005.[8] D. Buttler, L. Liu, and C. Pu, “A FullyAutomated Object Extraction System forthe World Wide Web,” Proc. 21st Int’lConf. Distributed Computing Systems, pp.361-370, 2001.[9] J. Wang and F. Lochovsky, “Data-RichSection Extraction from HTML Pages,”Proc. Third Int’l Conf. Web InformationSystem Eng., 2002[10] K. C.-C. Chang, B. He, C. Li, M. Patel,and Z. Zhang, “Structured Databases onthe Web: Observations and Implications,”SIGMOD Record, vol. 33, no. 3, pp. 61-70, 2004.

×