Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Joint Repairs for Web Wrappers

Presentation of the paper "Joint Repairs for Web Wrappers" (ICDE '16)

  • Be the first to comment

Joint Repairs for Web Wrappers

  1. 1. Joint Repairs for Web Wrappers Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche ICDE Helsinki - May, 19 2016 E E Schindler’s List Lawrence of Arabia (re-release) Le cercle Rouge (re-release) Director: Steven Spielberg Rating: R Runtime: 195 min Director: David Lean Rating: PG Runtime: 216 min Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min Director: Steven Spielberg Rating: R Runtime: 195 min Director: David Lean Rating: PG Runtime: 216 min Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min Director: David R Lynch Rating: Not Rated Runtime: 123 min SUFFIX= substring-before(“_(“) PREFIX= substring-after(“tor:_“) SUFFIX= substring-before(“_Rat“) PREFIX= substring(string-length()-7) WADaR induces regular expressions by looking, e.g., at common prefixes, suffixes, non-content strings, and character length. Induced expressions improve recall token value1 token token token token token token value2 token token token token token token value3 token token When WADaR cannot induce regular expressions (not enough regularity), data is repaired directly with annotators. Wrappers are instead repaired with value-based expressions, i.e., disjunction of the annotated values. ATTRIBUTE= string-contains(“value1”|”value2”|”value3”) y$ $ e)$ WADaR is highly robust to errors of the NERs. Optimality WADaR provably produces relations of maximum fitness, provided that the number of correctly annotated tuples is more than the maximum error rate of the annotators.
  2. 2. Background: Web wrapping refcode postcode bedrooms bathrooms available price 33453 OX2 6AR 3 2 15/10/2013 £1280 pcm 33433 OX4 7DG 2 1 18/04/2013 £995 pcm Process or turning semi-structured (templated) web data into structured form Hidden databases are actually a form of dark / dim data (ref. panel on Tuesday)
  3. 3. manual / (semi) supervised accurate expensive + non-scalable unsupervised less accurate cheaper + scalable Wrapidity Background: Web wrapping
  4. 4. Background: Web wrapping From (manually or automatically) created examples to XPath-based wrappers Even on templated websites, automatic wrapping can be inaccurate Pairs <field,expression> that, once applied to the DOM, return structured records field expression listing //body record //div[contains(@class,'movlist_wrap')] title //span[contains(@class,’title’)]/text() rated .//span[.='rating:']/following-sibling::strong/text() genre .//span[.=genre']/following-sibling::strong/text() releaseMo .//span[@class='release']/text() releaseDy .//span[@class='release']/text() releaseYr .//span[@class='release']/text() image .//@src runtime .//span[.=runtime']/following-sibling::strong/text()
  5. 5. Problems with wrapping Inaccurate wrapping results in over(under) segmented data Attribute_1 Attribute_2 Ava’s Possessions Release Date: March 4, 2016 | Rated: R | Genre(s) : Sci-Fi, Mystery, Thriller, Horror | Production Company: Off Hollywood Pictures “| Runtime: 216 min Camino Release Date: March 4, 2016 | Rated: Not Rated | Genre(s): Action, Adventure, Thriller | Production Company: Bielberg Entertainment | Runtime: 103 min Cemetery of Splendor Release Date: March 4, 2016 | Rated: Not Rated | Genre(s): Drama | User Score: 4.6 | Production Company: Centre National de la Cinématographie (CNC) | Runtime: 122 min Title Release Genre Rating Runtime RS: Source Relation : Target Schema Example extraction using RoadRunner (Crescenzi Et Al.)
  6. 6. Questions The questions we want to answer are: can we fix the data, and use what we learn to repair wrappers as well? are the solutions scalable? Why do we care? Companies such as FB, and Skyscanner spend millions of dollars of engineering time, creating and maintaining wrappers Wrapper maintenance is a major cost of data acquisition from the web
  7. 7. Fixing the data MAKE MODEL PRICEThe wrapper thinks it is filling this schema… £19k Audi A3 Sportback £43k Audi A6 Allroad quattro £10k Citroën C3 £22k Ford C-max Titanium X If all instances looked like this (i.e., mis-segmentation, no garbage, no shuffling) table induction problem: TEGRA, WebTables, etc. Moreover… we still have no clue on how to fix the wrapper afterwards …but instead it produces this instance… £19k Make: Audi Model:A3 Sportback £43k Make: Audi Model: A6 Allroad Citroën £10k Model: C3 Ford £22k Model: C-max Titanium X
  8. 8. What is a good relation? The problem is that wrapper generated relations really look like this… First, we need a way to determine how “far” we are from a good relation… ū = ⟨u1, u2, …, un⟩ a tuple generated by the wrapper Σ = ⟨A1, A2, …, Am⟩ the (target) schema for the extraction Ω = {ωA1 , …, ωAarity(Σ) } set of oracles for Σ The fitness then quantifies how well ū (resp. the whole instance) “fits” Σ Ω = {ωMAKE, ωMODEL,ωPRICE}, Σ = ⟨MAKE, PRICE, MODEL⟩ ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise f(R, Σ, Ω) = 1/2 = 50% £19k Make: Audi Model:A3 Sportback £43k Make: Audi Model: A6 Allroad Citroën £10k Model: C3 Ford £22k Model: C-max Titanium X ωMAKE ωPRICE ωMODEL
  9. 9. Problem Definition: Fitness Σ = ⟨A1, A2, …, Am⟩ attributes (fields) of the target schema of the relation ū = ⟨u1, u2, …, un⟩ tuple of the wrapper-generated relation R Ω = {ωA1 , …, ωAarity(Σ) } set of oracles for the fields of the Σ, s.t., ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise We define the fitness of a tuple ū (resp. relation R) w.r.t. a schema Σ as: f(ū, Σ, Ω) = ∑ ωAi (ui) / d i=1 c where: c=min{ arity(Σ), arity(R) } and d=max{ arity(Σ), arity(R) } resp. f(R, Σ, Ω) = ∑ f(ū, Σ, Ω) / |R| ū∈R Input: a wrapper W, a relation R | W(P)=R for some set of pages P, and a schema Σ £19k Audi A3 Sportback £43k Audi A6 Allroad quattro Citroën £10k C3 Ford £22k C-max Titanium X f(R, Σ, Ω) = 1/6 = 17% MAKE MODEL PRICE
  10. 10. Problem Definition: Σ-repairs Π = (i, j, …, k) permutation of the fields of R ρ = { ⟨A1,ƐAi ⟩, ⟨A2,ƐAi ⟩, …, ⟨Am,ƐAm ⟩ } set of regexes for each attribute in Σ A Σ-repair is a pair σ = ⟨Π,ρ⟩ where: Σ-repairs can be applied to a tuple ū in the following way σ(ū) = ⟨ ƐAi (Π(ū)), ƐA2 (Π(ū)), … , ƐAm (Π(ū)) ⟩ The notion of applicability extends naturally to relations σ(R) (i.e., sets of tuples) Similarly, Σ-repairs can be applied to wrappers as well [details in the paper] Output: a wrapper W’ and a relation R’ | W’(P)=R’ and R’ is of maximum fitness w.r.t. Σ The goal is to find the Σ-repair that maximises the fitness
  11. 11. Computing Σ-repairs Complexity [details in the paper]: 1. non atomic misplacements: NP-complete (red. from Weighted Set Packing) 2. atomic misplacements: polynomial (red. from Stars and Buckets) We have an atomic misplacement when the correct value for an attribute is: 1. entirely misplaced, or 2. if it is over-segmented, the fragments are in adjacent fields in the relation MAKE MODEL PRICE £22k Ford C-max Titanium X MAKE MODEL PRICE C-max £22k X Ford Titanium atomic misplacement non atomic misplacement Naïve Algorithm: For each tuple… 1. permute tuples in all possible ways (only if non-atomic misplacements) 2. segment tuples in all possible ways 3. ask the oracles and keep the segmentation of highest fitness
  12. 12. Approximating Σ-repairs The naïve algorithm has the following problems: 1. oracles do not (always) exist 2. it fixes one tuple at a time, the wrapper needs a single fix for each attribute 3. even under the assumption of atomic misplacements we still have to try O(nk) different segmentations (worst case) before finding the one of maximum fitness (1) Weak oracles Use noisy NERs in spite of oracles. If unavailable, it’s easy to build one. In this work we use ROSeAnn (Chen&Al. PVLDB13) (2 and 3) Approximate relation-wide repairs Wrappers are programs, if they make a mistake they make it consistently There is hope to have a common underlying attribute structure
  13. 13. Finding the right structure We have to solve two problems: find the underlying structure(s) of the relation find an segmentation that maximises the fitness An obvious way is sequence labelling (e.g., Markov chains + Viterbi) where oracles are simulated by NERs (so they can make mistakes) A SINK 5 SOURCE B C 3 D 2 3 3 4 4 2 The maximum likelihood sequence is actually <A,D> which “fits” ~28% It looks like there’s another sequence that fits better… a b c a b c a b c a d a d b a d b a d Ω = {ωA, ωB, ωC, ωD}A B C D
  14. 14. Finding the right structure The sequence corresponding to the max-flow is <A,B,C> which “fits” ~32% vA,() SINK 13 SOURCE vB,(A) vC,(A,B) 9 9 9 vD,(A) 4 4 vB,() vA,(B) vD,(B,A) 4 4 4 4 The problem is that Markov chains are memory-less… we have to remember the context and make sure our sequence satisfies the oracles more than any other Ok… this sounds like a max-flow! Ω = {ωA, ωB, ωC, ωD} a b c a b c a b c a d a d b a d b a d A B C D
  15. 15. Iteratively compute max flows on the network, i.e., likely sequences of high fitness MAKE 6/8 SINK 0/3 SOURCE PRICE MAKE 0/2 6/6 0/2 6/6PRICE, MAKE, MODEL MODEL Iteration 0 PRICE 0/3 0/3 MODEL MODEL 0/3 6/6 MAKE SINK 3/3 SOURCE PRICE 3/3 3/33/3 MAKE, PRICE, MODEL MODEL Iteration 1 We stop when we covered “enough” of the tuples in the relation First, annotate the relation using NERs (surrogate oracles) and build the network MAKE 8 SINK 3 SOURCE PRICE MAKE 2 6 2 6 MODEL PRICE 3 3 MODEL MODEL 3 6 Example: £19k Make: Audi Model:A3 Sportback £43k Make: Audi Model: A6 Allroad quattro Citroën £10k Model: C3 Ford £22k Model: C-max Titanium X Ω = {ωMAKE, ωMODEL,ωPRICE} Finding the right structure
  16. 16. Fixing the relation (and the wrapper) Max flows represent likely sequences. We use them to eliminate unsound annotations. We can use standard regex-induction algorithms to obtain robust expressions £19k Make: Audi Model: A3 Sportback MAKE [11,15) MODEL [24,36) PRICE [0,4) The remaining annotations can be used as examples for regex induction The induced expressions recover missing (incomplete) annotations £19k Make: Audi Model: A3 Sportback £43k Make: Audi Model: A6 Allroad quattro Citroën £10k Model: C3 Ford £22k Model: C-max Titanium X ρ = { ⟨MAKE, substring-before($, £) or substring-before(substring-after($, ‘ke:␣’),’␣Mo’)⟩, ⟨MODEL, substring-after($, el:␣)⟩, ⟨PRICE, substring-after(substring-before($, ’kMa␣’ || ’kMo␣’),␣)⟩ }
  17. 17. Approximating Σ-repairs MAKE MODEL PRICE £19k Audi A3 Sportback £43k Audi A6 Allroad quattro Citroën £10k C3 Ford £22k C-max Titanium X When an expression fails to match a minimum number of tuples, we fall back to the NERs: value-based expressions ρ = { ⟨MAKE, value-based($, [Audi, Ford] )⟩, ⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩, ⟨PRICE, substring-after(substring-before($, k␣),␣)⟩ } Example: (induction threshold 75%) MAKE MODEL PRICE £19k Audi A3 Sportback £43k Audi A6 Allroad quattro Citroën £10k C3 Ford £22k C-max Titanium X ρ = { ⟨MAKE, substring-before($, £) or substring-before(substring-after($, k␣),␣)⟩, ⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩, ⟨PRICE, substring-after(substring-before($, k␣),␣)⟩ } Example: (induction threshold 20%)
  18. 18. Evaluation Dataset: An enhanced version of the SWDE dataset (https://swde.codeplex.com) 10 domains, 100 websites, 78 attributes, ~100k pages, ~130k records Systems: wrapper generation systems: DIADEM, Depta, ViNTs, RoadRunner Baseline wrapper induction/repair systems: WEIR (Crescenzi et Al. VLDB ‘13) Implementation: WADaR (Wrapper and Data Repair) – Java + SQL
  19. 19. Evaluation: Highlights 0 0.2 0.4 0.6 0.8 1 ViN Ts (R E) ViN Ts (Auto) D IAD EM (R E) D EPTA (R E) D IAD EM (Auto) D EPTA (Auto) R R (Auto) R R (Book) R R (C am era) R R (Job) R R (M ovie) R R (N ba)R R (R estaurant)R R (U niversity) Precision (Original) Precision (Repaired) Recall (Original) Recall (Repaired) FScore (Original) FScore (Repaired) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FScore Original FScore Repaired Fig. 2: Impact of repair. to 30% in real estate, with an identical effect in almost all domains. Attribute-level accuracy. Another question is whether there are substantial differences in attribute-level accuracy. The top of Table III shows attributes where the repair is very effective (F1-Score'1 after repair). These values appear as highly structured attributes on web pages and the corre- sponding expressions repair almost all tuples. As an example, DOOR NUMBER is almost always followed by suffixes dr or door. In these cases, the wrapper induction under-segmented the text due to lack of sufficient examples. TABLE III: Attribute-level evaluation. System Domain Attribute Original F1-Score Repaired F1-Score DIADEM real estate POSTCODE 0.304 0.947 DIADEM auto DOOR 0 0.984 shows Precision and Recall computed on the sample (values higher than 0.9 are highlighted in bold). In order to estimate TABLE IV: Accuracy of large scale evaluation. Attribute Precision Recall % Modified values LOCALITY 0.993 0.993 11.34% OPENING HOURS 1.00 0.461 17.14% LOCATED WITHIN 1.00 0.224 29.75% PHONE 0.987 0.849 50.74% POSTCODE 0.999 0.989 9.4% STREET ADDRESS 0.983 0.98 83.78% the impact of the repair, we computed, for each attribute, the percentage of values that are different before and after the repair step. These numbers are shown in the last column of Table IV. Clearly, the repair is beneficial on all of the cases. For OPENING HOURS and LOCATED WITHIN, where recall is very WADaR increases F1-score between 15% and 60% (excluding ViNTs) number of records ws linearly w.r.t the w.r.t. the number of ted network contains st network obtained es and 45,797 edges n 3 seconds. pare our approach data integration sys- 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$ Auto$ Book$ Cam era$ Job$ M ovie$ Nba$Restaurant$ University$ WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$ Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$ WADaR is 23% more accurate than WEIR on average
  20. 20. Evaluation: Robustness We studied how F1-score varies w.r.t. annotation noise The accuracy numbers are limited to those attributes where our approach induces regular expressions, since it is already clear that annotator errors directly reduce the accuracy of value-based expressions. This is still a significant number of attributes, i.e., 65% in all cases except for RoadRunner on book (35%), and RoadRunner on movie (46%). Figure 8 shows Fig. 8: Annotator recall drop - Fixed threshold the impact of a drop in recall (x-axis) on F1-Score. As we can see, our approach is robust to a drop in recall until we reach 80% loss, then the performance rapidly decays. This is somehow expected, since the regular expressions compensate for the missing recall up to the point where the max-flow sequences are no longer able to determine the underlying attribute structure reliably. Figure 9 show the effect on F1-Score if we set a low regex- induction threshold (i.e., 0.1) instead. Clearly, in this case our approach is highly robust to annotator inaccuracy and we notice a loss in performance only after 80-90% loss in recall. In summary, a lower regex-induction threshold is advisable when we know that annotators have low recall. Even involving an annotator with very low accuracy, our approach is robust on le ef re Tu of sc W sh th ap le so an [1 (i) in in re so En re cu of en us cl in [3 D re Fixed induction threshold 75% (high dependence on annotation quality) Fig. 8: Annotator recall drop - Fixed threshold the impact of a drop in recall (x-axis) on F1-Score. As we can see, our approach is robust to a drop in recall until we reach 80% loss, then the performance rapidly decays. This is somehow expected, since the regular expressions compensate for the missing recall up to the point where the max-flow sequences are no longer able to determine the underlying attribute structure reliably. Figure 9 show the effect on F1-Score if we set a low regex- induction threshold (i.e., 0.1) instead. Clearly, in this case our approach is highly robust to annotator inaccuracy and we notice a loss in performance only after 80-90% loss in recall. In summary, a lower regex-induction threshold is advisable when we know that annotators have low recall. Even involving an annotator with very low accuracy, our approach is robust Fig. 9: F1-Score variation with a threshold value of 0.1 [1 (i) in in re so En re cu of en us cle in [3 Di re co in re va flo pr pr re Fixed induction threshold 10% (low dependence on annotation quality) F1 starts being affected when recall loss at ~80% Precision loss does not affect WADaR until ~300% (random noise)
  21. 21. Evaluation: Scalability Worst-case scenario: all tuples are annotated with all attribute types WADaR scales linearly w.r.t. the size of the relation and polynomially w.r.t. attributes alue s of ated ions with uish n of ocial used plied been mple e IV i.e., each record contains k annotated tokens, each annotation has a different context and each record produces a different path on the network. This results in a network with n · k + 2 nodes, and n · k + n edges. The chart on the left of Figure 3 plots the running time over an increasing number of records (with number of attributes fixed), while the chart on the right Fig. 3: Running time. re the value ses a loss of he annotated E in relations of text with ot distinguish rmance. extraction of a large social ites. We used then applied acy has been on a sample ed. Table IV i.e., each record contains k annotated tokens, each annotation has a different context and each record produces a different path on the network. This results in a network with n · k + 2 nodes, and n · k + n edges. The chart on the left of Figure 3 plots the running time over an increasing number of records (with number of attributes fixed), while the chart on the right Fig. 3: Running time. Oracles decouple the problem of finding similar instances from the segmentation £19k Audi A3 Sportback £43k Audi A6 Allroad quattro £10k Citroën C3 £22k Ford C-max Titanium X Ω = {ωMAKE, ωMODEL, ωPRICE}
  22. 22. Open issues Learning oracles Building oracles is not difficult but still requires engineering time. The IBM SystemT people did some of good work in this direction. We can start there. Missing attributes Right now, if the wrapper fails to recover data, then we cannot repair it. It is possible to manipulate the wrapper to match more content. Markov Chains vs Max flows on wrapped relations They seem to eventually compute the same sequences but in different order… proof? What I know is that max-flows best approximate the maximum fitness at every step.
  23. 23. Questions? L. Chen, S. Ortona, G. Orsi, M. Benedikt. Aggregating Semantic Annotators. PVLDB ’13 S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. Joint Repairs for Web Wrappers. ICDE ’16 S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. WADaR: Joint Wrapper and Data Repair. VLDB ’15 (Demo) References: T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, C. Wang. DIADEM: Thousands of websites to a single database. PVLDB ’15 Title Director Rating Runtime Schindler’s List Steven Spielberg R 195 min Web Data Extraction Road Runner DEPTA Attribute_1 Attribute_2 Schindler’s List Director: Steven Spielberg Rating: R Runtime: 195 min Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min Joint Data and Wrapper Repair Attribute_1 Attribute_2 Schindler’s List Director: Steven Spielberg Rating: R Runtime: 195 min Title Director Rating Runtime Schindler’s List Steven Spielberg R 195 min Maximal Repair is NP-complete Attribute Director: Steven Spielberg Rating: R Runtime: 195 min Director Rating Runtime Director Rating Runtime Steven Spielberg R 195 min φ1: φ2: φ3: φ4: OBSERVATIONS Templated Websites: Data is published following a template. Wrapper Behaviour Wrappers rarely misplace and over-segment at the same time. Wrappers make systematic errors. Oracles Oracles can be implemented as (ensembles of) NERs. NERs are not perfect, i.e., they make mistakes Joint Wrapper And Data Repair Authors When values are both misplaced and over-segmented, computing repairs of maximal fitness is hard, otherwise, just do the following: (1) Compute all possible k non- crossing partitions (k = |R|) of tokens, i.e., assign to each attribute an element of the partition (O(nk) - Narayana Number). (2) Discard tokens never accepted by oracles in any of the partitions. (3) Collapse identical partitions and choose the one with maximal fitness. Without misplacement and over-segmentation, solution in polynomial time by computing non-crossing k-partition NP-hardness: reduction from Weighted Set Packing. Membership in NP: guess a partition, decide non crossing and compute fitness in PTIME. Stefano Ortona stefano.ortona@cs.ox.ac.uk University of Oxford, UK Giorgio Orsi giorgio.orsi@cs.ox.ac.uk University of Oxford, UK Marcello Buoncristiano marcello.buoncrisitano@yahoo.it Università della Basilicata,Italy Tim Furche tim.furche@cs.ox.ac.uk University of Oxford, UK http://diadem.cs.ox.ac.uk/wadar Web data extraction (aka scraping/wrapping) uses wrappers to turn web pages into structured data. Wrapper: structure { ⟨R,!R⟩ { ⟨A1,!A1⟩,…,⟨Am,!Am⟩ } } specifying objects to be extracted (listings, records, attributes) and corresponding XPath expressions. Wrappers are often created algorithmically and in large numbers. Tools capable of maintaining them over time are missing. ⟨RATING, //li[@class=‘second’]/p⟩ ⟨RUNTIME, //li[@class=‘third’]/ul/li[1]⟩ Algorithmically-created wrappers generate data that is far from perfect. Data can be badly segmented and misplaced. ⟨TITLE,⟨1⟩,string($)⟩ ⟨DIRECTOR,⟨2⟩,substring-before(substring-after($,tor:_),_Rat)⟩ ⟨RATING,⟨2⟩,substring-before(substring-after($,ing:_),_Run)⟩ ⟨RUNTIME,⟨2⟩,substring-after($,time:_)⟩ Take a set Ω of oracles, where each ωA in Ω can say whether a value vA belongs to the domain of A. We define the fitness of a relation R w.r.t. Ω as: Repair: specifies regular expressions that, when applied on the original relation, produce a new relation with higher fitness. <Director: Steven> <195 min> <Director:><Steven Spielberg> <Rating: R Runtime:195> <Runtime: k195 min> <min Director: Steven Spielberg> <Rating: Runtime: 195> <Director: Steven Spielberg> <Rating: R> <R> <Spielberg Rating: R Runtime:> WADaR: ⟨DIRECTOR, //li[@class=‘first’]/div/span⟩ APPROXIMATING JOINT REPAIRS Annotation 1 Each record is interpreted as a string (concatenation of attributes), where NERs analyse and identify relevant attributes. Entity recognisers make mistakes, WADaR tolerates incorrect and missing annotations. Attribute_1 Attribute_2 Schindler’s List Director: Steven Spielberg Rating: R Runtime: 195 min Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min The life of Jack Tarantino (coming soon) Director: David R Lynch Rating: Not Rated Runtime: 123 min Title Title Director Title Director Director Rating Rating Runtime Runtime Runtime Runtime Rating Director Segmentation SINK RATIN G RATIN G MAX FLOW SEQUENCE: DIRECTOR Goal: Understand underlying structure of the relation. START TITLE Two possible ways of encoding the problem: 2 TITLE 11 1. Max Flow Sequence in a Flow Network RUN TIME RUN TIME DIREC TOR DIREC TOR START TITLE DIREC TOR RATIN G 2. Most Likely Sequence in a Memoryless Markov Chain RUN TIME SINK Solutions often coincide. Markov Chains: intuitive and faster to compute. Max Flows: provably optimal. RUNTIMERATING TITLEMOST LIKELY SEQUENCE: DIRECTOR RATING RUNTIME 3/4 1/4 1 3/4 1/4 1 1 3 11 8 8 11 3 3 3 3 Induction 3 Schindler’s List Lawrence of Arabia (re-release) Le cercle Rouge (re-release) Director: Steven Spielberg Rating: R Runtime: 195 min Director: David Lean Rating: PG Runtime: 216 min Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min Director: Steven Spielberg Rating: R Runtime: 195 min Director: David Lean Rating: PG Runtime: 216 min Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min Director: David R Lynch Rating: Not Rated Runtime: 123 min SUFFIX= substring-before(“_(“) PREFIX= substring-after(“tor:_“) SUFFIX= substring-before(“_Rat“) PREFIX= substring(string-length()-7) Input: set of clean annotations to be used as positive examples. WADaR induces regular expressions by looking, e.g., at common prefixes, suffixes, non-content strings, and character length. Induced expressions improve recall token value1 token token token token token token value2 token token token token token token value3 token token When WADaR cannot induce regular expressions (not enough regularity), data is repaired directly with annotators. Wrappers are instead repaired with value-based expressions, i.e., disjunction of the annotated values. ATTRIBUTE= string-contains(“value1”|”value2”|”value3”) Empirical Evaluation Table 1 Precision_O riginal Precision_R epaired Recall_Origi nal Recall_Repai red FScore_Origi nal FScore_Rep aired 0.013233 0.5689 0.004155 0.4255 0.006324 0.488 0.535259 0.1396 0.307571 0.2871 0.390661 0.2665 0.8243 0.0914 0.5348 0.3002 0.6487 0.2248 0.5264 0.3716 0.3501 0.5276 0.4205 0.4666 0.7332 0.1943 0.5147 0.3361 0.6048 0.2827 0.6703 0.2281 0.5091 0.3295 0.5787 0.2888 0.5777 0.2766 0.553 0.3314 0.5651 0.304 0.292 0.441 0.2317 0.4597 0.2584 0.4531 0.158 0.6278 0.1404 0.588 0.1487 0.6074 0.5636 0.2263 0.5191 0.235 0.5405 0.2311 0.446 0.3314 0.2552 0.471 0.3246 0.4263 0.6799 0.302 0.5609 0.299 0.6147 0.3022 0.7252 0.1589 0.6525 0.1429 0.6869 0.150 0.5461 0.3267 0.3965 0.3268 0.4594 0.3316 FScore_Origi nal FScore_Rep aired 0.006324 0.488 0.390661 0.2665 0.6487 0.2248 0.4205 0.4666 0.6048 0.2827 0.5787 0.2888 0.5651 0.304 0.2584 0.4531 0.1487 0.6074 0.5405 0.2311 0.3246 0.4263 0.6147 0.1904 0.6869 0.1505 0.4594 0.3316 0 0.2 0.4 0.6 0.8 1 Vi N Ts (R E) Vi N Ts (A ut o) D IA D EM (R E) D EP TA (R E) D IA D EM (A ut o) D EP TA (A ut o) R R (A ut o) R R (B oo k) R R (C am er a) R R (J ob ) R R (M ov ie ) R R (N ba )R R (R es ta ur an t) R R (U ni ve rs ity ) Precision (Original) Precision (Repaired) Recall (Original) Recall (Repaired) FScore (Original) FScore (Repaired) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vin ts _R EVin ts _U CD IA D EM _R E D EPTA_R E D IA D EM _U C D EPTA_U CR R _AutoR R _BookR R _C am eraR R _JobR R _M ovieR R _N ba R R _re sta ura nt R R _U niv ers ity U ntitled 1 U ntitled 2 FScore Original FScore Repaired 5.1 Setting Datasets. The dataset consists of 100 websites from 10 do- mains and is an enhanced version of SWDE [20], a benchmark com- monly used in web data extraction. SWDE’s data is sourced from 80 sites and 8 domains: auto, book, camera, job, movie, NBA player, restaurant, and university. For each website, SWDE provides collec- tions of 400 to 2k detail pages (i.e., where each page corresponds to a single record). We complemented SWDE with collections of listing pages (i.e., pages with multiple records) from 20 websites of real estate (RE) and auto domains. Table 1 summarises the char- acteristics of the dataset. SWDE comes with ground-truth data cre- Table 1: Dataset characteristics. Domain Type Sites Pages Records Attributes Real Estate listing 10 271 3,286 15 Auto listing 10 153 1,749 27 Auto detail 10 17,923 17,923 4 Book detail 10 20,000 20,000 5 Camera detail 10 5,258 5,258 3 Job detail 10 20,000 20,000 4 Movie detail 10 20,000 20,000 4 Nba Player detail 10 4,405 4,405 4 Restaurant detail 10 20,000 20,000 4 University detail 10 16,705 16,705 4 Total - 100 124,715 129,326 78 ated under the assumption that wrapper-generation systems could only generate extraction rules with DOM-element granularity, i.e., without segmenting text nodes. Modern wrapper-generation sys- tems support text-node segmentation and we therefore refined the ground-truth accordingly. As an example, in the camera domain, the original ground-truth values for MODEL consisted of the entire product title. The text node includes, other than the model, COLOR, PIXELS, MANUFACTURER. The ground-truth for real estate and auto domains has been created following the SWDE format. The final dataset consist of more than 120k pages, for almost 130k records containing more than 500k attribute values. Wrapper-generation systems. We generated input relations for our evaluation using four wrapper-generation systems: DIA- DEM [19], DEPTA [36] and ViNTs [39] for listing pages, and Road- Runner [12] for detail pages.1 The output of DIADEM, DEPTA, and RoadRunner can be readily used in the evaluation since these are full fledged data extraction systems, supporting the segmentation of both records and attributes within listings or (sets of) detail- pages. ViNTs, on the other hand, segments rows into records within a search result listing and, as such, it does not have a concept of attribute. Instead, it segments rows within a record. We therefore post-processed its output, typing the content of lines from differ- ent records that are likely to have the same semantics. We used a naïve heuristic similarity based on relative position in the record and string-edit distance of the row’s content. This is a very simple version of more advanced alignment methods based on instance- level redundancy used by, e.g., WEIR and TEGRA [7]. Metrics. The performance of the repair is evaluated by com- paring wrapper-generated relations against the SWDE ground truth before and after the repair. The metrics used for the evaluation are Precision, Recall, and F1-Score computed at attribute-level. Both the ground truth and the extracted values are normalised, and exact matching between the extracted values and the ground-truth is re- quired for a hit. For space reasons, in this paper we only present the most relevant results. The results of the full evaluation, together 1RoadRunner can be configured for listings but it performs better on detail pages. with the dataset, gold standard, extracted relations, the code of the normaliser and of the scorer are available at the online appendix [1]. All experiments are run on a desktop with an Intel quad-core i7 at 3.40GHz with 16 GB Ram and Linux Mint OS 17. 5.2 Repair performance Relation-level Accuracy. The first two questions we want to an- swer are: whether joint repairs are necessary and what their impact is in terms of quality. Table 2 reports, for each system, the percent- age of: (i) Correctly extracted values. (ii) Under-segmentations, i.e., when values for an attribute are extracted together with val- ues of other attributes or spurious content. Indeed often websites publish multiple attribute values within the same text node and the involved extraction systems are not able to split values into multi- ple attributes. (iii) Over-segmentations, i.e., when attribute values are split over multiple fields. As anticipated in Section 2, this rarely happens since an attribute value is often contained in a single text node. In this setting an attribute value can be over-segmented only if the extraction system is capable of splitting single text nodes (DIADEM), but even in this case the splitting happens only when the system can identify a strong regularity within the text node. (iv) Misplacements, i.e., values are placed or labeled as the wrong attribute. This is mostly due to lack of semantic knowledge and confusion introduced by overlapping attribute domains. (v) Miss- ing values, due to lack of regularity and optionality in the web source (RoadRunner, DEPTA, ViNTs) or missing values from the do- main knowledge (DIADEM). Note that the numbers do not add up to Table 2: Wrapper generation system errors. System Correct (%) Under Segmented (%) Over Segmented (%) Misplaced (%) Missing (%) DIADEM 60.9 34.6 0 23.2 3.5 DEPTA 49.7 44 0 25.3 6 ViNTs 23.9 60.8 0 36.4 15.2 RoadRunner 46.3 42.8 0 18.6 10.4 100% since errors may fall into multiple categories. These numbers clearly show that there is a quality problem in wrapper-generated relations and also support the atomic misplacement assumption. Figure 2 shows, for each system and each domain, the impact of the joint-repair on our metrics. Light (resp. dark)-colored bars denote the quality of the relation before (resp. after) the repair. A first conclusion that can be drawn is that a repair is always ben- eficial. From 697 extracted attributes, 588 (84.4%) require some form of repair and the average pre-repair F1-Score produced by the systems is 50%. We are able to induce a correct regular expression for 335 (57%) attributes, while for the remaining 253 (43%) it pro- duces value-based expressions. We can repair at least one attribute in each of the wrappers in all of the cases, and we can repair more than 75% of attributes in more than 80% of the cases. Among the considered systems, DIADEM delivers, in average, the highest pre-repair F1-Score ( 60%), but it never exceeds 65%. RoadRunner is in average worse than DIADEM but it reaches a bet- ter 70% F1-Score on restaurant. Websites in this domain are in fact highly structured and individual attribute values are contained in a dedicated text node. When attributes are less structured, e.g., on book, camera, movie, RoadRunner has a significant drop in perfor- mance. As expected, ViNTs delivers the worst pre-cleaning results. In terms of accuracy, our approach delivers a boost in F1-Score between 15% and 60%. Performance is consistently close to or above 80% across domains and, except for ViNTs, across systems, with a peak of 91% for RoadRunner on NBA player. The following are the remaining causes of errors: (i) Missing values cannot be repaired as we can only use the data available in 8 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$ Auto $ Book$ Cam era $ Jo b$ M ovie $ Nba$Rest aura nt$ Univers ity$ WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$ Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$ Evaluation 100 websites 10 domains 4 wrapper generation systems. Precision, Recall, F1-Score computed before and after repair. WADaR boosts F1-Score between 15% and 60%. Performance consistently close to or above 80%. Metrics computed considering exact matches. WADaR against WEIR. WADaR is highly robust to errors of the NERs. WADaR scales linearly with the size of the input relation. Optimal joint-repair approximations computed in polynomial time. Optimality WADaR provably produces relations of maximum fitness, provided that the number of correctly annotated tuples is more than the maximum error rate of the annotators. More questions? Come to the poster later!!! T. Furche, G. Gottlob, L. Libkin, G. Orsi, N. Paton. Data Wrangling for Big Data. EDBT ’16

×