Your SlideShare is downloading. ×
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Wrapper Generation Supervised by a Noisy Crowd
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Wrapper Generation Supervised by a Noisy Crowd

324

Published on

We present solutions based on crowdsourcing platforms to support large-scale production of accurate wrappers around data-intensive websites. …

We present solutions based on crowdsourcing platforms to support large-scale production of accurate wrappers around data-intensive websites.
Our approach is based on supervised wrapper induction algorithms which demand the burden of generating the training data to the workers of a crowdsourcing platform. Workers are paid for answering simple membership queries chosen by the system. We present two algorithms: a single worker algorithm (ALF) and a multiple workers algorithm (ALFRED). Both the algorithms deal with the inherent uncertainty of the responses and use an active learning approach to select the most informative queries.
ALFRED estimates the workers’ error rate to decide at runtime how many workers are needed. The experiments that we conducted on real and synthetic data are encouraging: our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
324
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Wrapper Generation Supervised by a Noisy Crowd Valter Crescenzi, Paolo Merialdo, Disheng Qiu Dipartimento di Ingegneria Università degli Studi Roma Tre Via della Vasca Navale, 79, Rome disheng@dia.uniroma3.it
  • 2. Extracting Data 2M pages from IMDB, and we want to extract ... titles, directors etc .... 2
  • 3. Extracting Data 2M pages from IMDB, and we want to extract ... titles, directors etc .... DB#Wrapper! 2
  • 4. Extracting Data 2M pages from IMDB, and we want to extract ... titles, directors etc .... Inference algorithm! DB#Wrapper! 2
  • 5. r1 = /html/table/tr[1]/td/text() r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text() r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text() .... Single Page Other pages 3 Wrapper as XPath To generate wrappers: • From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page • Some of the rules do not work correctly in all the target pages
  • 6. page0 r1 r2 r3 Spirited Away Spirited Away Spirited Away r1 = /html/table/tr[1]/td/text() r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text() r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text() .... Single Page Other pages 3 Wrapper as XPath To generate wrappers: • From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page • Some of the rules do not work correctly in all the target pages
  • 7. page0 page1 page2 .. r1 r2 r3 Spirited Away City of God Howl’s Moving Castle .. Spirited Away - 9.3 .. Spirited Away City of God null .. r1 = /html/table/tr[1]/td/text() r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text() r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text() .... Single Page Other pages 3 Wrapper as XPath To generate wrappers: • From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page • Some of the rules do not work correctly in all the target pages
  • 8. page0 page1 page2 .. r1 r2 r3 Spirited Away City of God Howl’s Moving Castle .. Spirited Away - 9.3 .. Spirited Away City of God null .. r1 = /html/table/tr[1]/td/text() r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text() r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text() .... Single Page Other pages 3 Wrapper as XPath To generate wrappers: • From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page • Some of the rules do not work correctly in all the target pages Which one is correct?
  • 9. Extracting Data Inference algorithm! DB#Wrapper! Scalability Accuracy Coverage Supervised Unsupervised Sup.+Annot. NO OK High OK NO High OK OK Low 4
  • 10. Crowdsourcing An opportunity to scale supervised approaches Inference algorithm! DB#Wrapper! 5
  • 11. Scaling Wrapper Inference Scaling out with crowdsourcing platforms opens new challenges: Issues: Contributions: Non-expert workers • Simple interactions • Membership Query (yes/no answer) • Redundant tasks and worker error rate estimation • Active Learning* • Dynamically engaging workers Costs Quality • Quality Model • Sampling algorithm* 6 *[Crescenzi WWW2013]
  • 12. page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null Inference Algorithm r1 = /html/table/tr[1]/td/text() r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text() r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text() .... Yes/No ! First annotation Sample Worker’s answers 7
  • 13. page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null Inference Algorithm r1 = /html/table/tr[1]/td/text() r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text() r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text() .... Yes/No ! First annotation Sample Worker’s answers 7 Quality Model: P(r1)
  • 14. page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null Inference Algorithm • Rules compatible with the answer more likely to be correct For each new answer r1 = /html/table/tr[1]/td/text() r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text() r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text() .... Yes/No ! First annotation Sample Worker’s answers 7 Quality Model: P(r1)
  • 15. page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null Inference Algorithm • Rules compatible with the answer more likely to be correct For each new answer • If no rule is good enough: • a new query is selected (Active Learning)* r1 = /html/table/tr[1]/td/text() r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text() r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text() .... Yes/No ! First annotation Sample Worker’s answers 7 *[Crescenzi WWW2013] Quality Model: P(r1)
  • 16. page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null Inference Algorithm • Rules compatible with the answer more likely to be correct For each new answer • If no rule is good enough: • a new query is selected (Active Learning)* r1 = /html/table/tr[1]/td/text() r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text() r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text() .... Yes/No ! First annotation Sample Worker’s answers 7 *[Crescenzi WWW2013] Quality Model: P(r1)
  • 17. Termination Strategies 8 Quality Costs HALTᵣ Expected quality of the wrapper (probability of correctness) HALTMQ Number of used MQ Quality Costs HALTH Uncertainty of the questioned value (trade-off quality/costs) Different termination strategies:
  • 18. Multiple Workers Workers can make mistakes We engage multiple workers on the same task, but how many? ? 9
  • 19. Multiple Workers Workers can make mistakes We engage multiple workers on the same task, but how many? Too many workers Not enough workers Waste of money Quality loss ? 9
  • 20. Multiple Workers Workers can make mistakes We engage multiple workers on the same task, but how many? Too many workers Not enough workers Waste of money Quality loss We apply our quality model at runtime to: • Estimate the workers’ error rates • Select the right number of redundant tasks ? 9
  • 21. Dynamically Engaging Workers Workers answers Most Likely Rule Is it good enough?• Starts with minimal amount of redundancy • Collects workers’ answers • Estimates rule quality and workers’ error rate. Use • workers’ error rate to estimate rule quality • rule quality to estimate workers’ error rate • If no rule is good enough a new worker is engaged Error rate estimation 10 Algorithm main steps:
  • 22. Dynamically Engaging Workers Workers answers Most Likely Rule Is it good enough?• Starts with minimal amount of redundancy • Collects workers’ answers • Estimates rule quality and workers’ error rate. Use • workers’ error rate to estimate rule quality • rule quality to estimate workers’ error rate • If no rule is good enough a new worker is engaged Error rate estimation + 10 Algorithm main steps:
  • 23. Answers “Spirited Away” “-” “9.3” η 0.1 0.1 0.1 NoYes No Answers “Spirited Away” “City of God” “9.3” η 0.1 0.1 0.1 NoYes No Answers “Spirited Away” “City of God” “9.3” “Spirited Away” “-” “9.3” η 0.1 0.1 0.1 0.1 0.1 0.1 NoYes No Yes No No • Two real workers are engaged • A new sequence is defined considering the union of all the answers 11 η = expected error rate Dynamically Engaging Workers
  • 24. Answers “Spirited Away” “-” “9.3” η 0.1 0.1 0.1 NoYes No Answers “Spirited Away” “City of God” “9.3” η 0.1 0.1 0.1 NoYes No Answers “Spirited Away” “City of God” “9.3” “Spirited Away” “-” “9.3” η 0.1 0.1 0.1 0.1 0.1 0.1 NoYes No Yes No No • Two real workers are engaged • A new sequence is defined considering the union of all the answers 11 η = expected error rate Dynamically Engaging Workers
  • 25. • The most likely rule and its values are returned • The most likely rule and its probability is adopted to estimate the η page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null es: P(r1) = 0.9 12 η = expected error rate Answers “Spirited Away” “-” “9.3” η 0.1 0.1 0.1 NoYes No Answers “Spirited Away” “City of God” “9.3” η 0.1 0.1 0.1 NoYes No Dynamically Engaging Workers
  • 26. • The most likely rule and its values are returned • The most likely rule and its probability is adopted to estimate the η page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null es: P(r1) = 0.9 12 η = expected error rate Answers “Spirited Away” “-” “9.3” η 0.1 0.1 0.1 NoYes No Answers “Spirited Away” “City of God” “9.3” η 0.37 0.37 0.37 NoYes No Dynamically Engaging Workers
  • 27. • The most likely rule and its values are returned • The most likely rule and its probability is adopted to estimate the η page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null es: P(r1) = 0.9 P(r1) = 0.93 12 η = expected error rate Answers “Spirited Away” “-” “9.3” η 0.1 0.1 0.1 NoYes No Answers “Spirited Away” “City of God” “9.3” η 0.37 0.37 0.37 NoYes No Dynamically Engaging Workers
  • 28. page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null P(r1) = 0.95 • When the computation converges, the system checks the termination condition • If it is not met, a new worker is considered and the computation starts again 13 η = expected error rate Answers “Spirited Away” “-” “9.3” η 0.05 0.05 0.05 NoYes No Answers “Spirited Away” “City of God” “9.3” η 0.35 0.35 0.35 NoYes No Dynamically Engaging Workers
  • 29. page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null P(r1) = 0.95 P(r1) = 0.95 • When the computation converges, the system checks the termination condition • If it is not met, a new worker is considered and the computation starts again 13 η = expected error rate Answers “Spirited Away” “-” “9.3” η 0.05 0.05 0.05 NoYes No Answers “Spirited Away” “City of God” “9.3” η 0.35 0.35 0.35 NoYes No Dynamically Engaging Workers
  • 30. Experiments - Dataset Site Entity |Pages| www.imdb.com Actor 500k www.imdb.com Movies 500k www.allmusic.com Band 500k www.allmusic.com Albums 500k www.nasdaq.com Stock Quotes 7k 40 attributes manually crafted golden rules Measures: • Costs #MQ • Quality Precision, Recall and F-measure 14
  • 31. Simulating Real Workers 0% 10% 20% 30% 40% 0.00 0.10 0.20 0.30 0.40 0.50 error ratee x 100 Real (and noisy) AMT workers Real workers: 1/3 perfect Average η* = 10% ση* = 11% We simulated the error rate distribution with an exponential function 15
  • 32. η* > η (optimistic) η* = η (correct) η* < η (pessimistic) MQ ~10 ~10 ~30 F ~0.65 ~1 ~1 Wrong Estimation Noisy single worker: - η expected error rate - η* observed error rate 16
  • 33. η* > η (optimistic) η* = η (correct) η* < η (pessimistic) MQ ~10 ~10 ~30 F ~0.65 ~1 ~1 Wrong Estimation Noisy single worker: - η expected error rate - η* observed error rate 16 η close to η*: (good estimation) - few MQ - good F
  • 34. η* > η (optimistic) η* = η (correct) η* < η (pessimistic) MQ ~10 ~10 ~30 F ~0.65 ~1 ~1 Wrong Estimation Noisy single worker: - η expected error rate - η* observed error rate 16 η close to η*: (good estimation) - few MQ - good F η* > η: (too optimistic) - too few MQ - low F
  • 35. η* > η (optimistic) η* = η (correct) η* < η (pessimistic) MQ ~10 ~10 ~30 F ~0.65 ~1 ~1 Wrong Estimation Noisy single worker: - η expected error rate - η* observed error rate 16 η close to η*: (good estimation) - few MQ - good F η* > η: (too optimistic) - too few MQ - low F η > η*: (too pessimistic) - too many MQ - same F
  • 36. η* > η (optimistic) η* = η (correct) η* < η (pessimistic) MQ ~10 ~10 ~30 F ~0.65 ~1 ~1 Wrong Estimation Noisy single worker: - η expected error rate - η* observed error rate 16 η close to η*: (good estimation) - few MQ - good F η* > η: (too optimistic) - too few MQ - low F η > η*: (too pessimistic) - too many MQ - same F Need to estimate the workers’ error rate
  • 37. Dynamically Engaging Workers Algorithm F σF #MQ max-MQ max-|W| |η-η*| ALFη one worker 0.92 17% 7.58 11 1 - ALFREDno 1 1% 18.6 83 9 - ALFRED 1 1% 16.1 44 4 0.8% ALFRED* 1 1% 16.07 40 4 0% Synthetic (and noisy) workers |W| = # workers 17
  • 38. Dynamically Engaging Workers Algorithm F σF #MQ max-MQ max-|W| |η-η*| ALFη one worker 0.92 17% 7.58 11 1 - ALFREDno 1 1% 18.6 83 9 - ALFRED 1 1% 16.1 44 4 0.8% ALFRED* 1 1% 16.07 40 4 0% Synthetic (and noisy) workers |W| = # workers 17 lower quality, less MQ
  • 39. Dynamically Engaging Workers Algorithm F σF #MQ max-MQ max-|W| |η-η*| ALFη one worker 0.92 17% 7.58 11 1 - ALFREDno 1 1% 18.6 83 9 - ALFRED 1 1% 16.1 44 4 0.8% ALFRED* 1 1% 16.07 40 4 0% Synthetic (and noisy) workers |W| = # workers 17 lower quality, less MQ Almost perfect wrapper
  • 40. Dynamically Engaging Workers Algorithm F σF #MQ max-MQ max-|W| |η-η*| ALFη one worker 0.92 17% 7.58 11 1 - ALFREDno 1 1% 18.6 83 9 - ALFRED 1 1% 16.1 44 4 0.8% ALFRED* 1 1% 16.07 40 4 0% Synthetic (and noisy) workers |W| = # workers 17 lower quality, less MQ correct estimation required Almost perfect wrapper
  • 41. Dynamically Engaging Workers Algorithm F σF #MQ max-MQ max-|W| |η-η*| ALFη one worker 0.92 17% 7.58 11 1 - ALFREDno 1 1% 18.6 83 9 - ALFRED 1 1% 16.1 44 4 0.8% ALFRED* 1 1% 16.07 40 4 0% Synthetic (and noisy) workers |W| = # workers 17 lower quality, less MQ correct estimation required accurate estimation, but achieved only at the end Almost perfect wrapper
  • 42. Dynamically Engaging Workers Algorithm F σF #MQ max-MQ max-|W| |η-η*| ALFη one worker 0.92 17% 7.58 11 1 - ALFREDno 1 1% 18.6 83 9 - ALFRED 1 1% 16.1 44 4 0.8% ALFRED* 1 1% 16.07 40 4 0% Synthetic (and noisy) workers |W| = # workers 17 lower quality, less MQ correct estimation required accurate estimation, but achieved only at the end Almost perfect wrapper 2 3 4 0% 25% 50% 75% 100% 2% 6% 92% % |W| |W| |W|
  • 43. Background in solid machine learning and computational learning theories* Conclusions 18 We proposed a framework for wrapper generation: • simple tasks can be completed by non expert workers • cost effective wrapper generation • highly predictable quality of the output wrapper The proposed framework can be applied to other learning tasks: • Crawling • NLP *[Angluin-Laird1988, Angluin2001]
  • 44. Thank you for the attention !! 19
  • 45. Future development Learning framework applied to problems (NLP, Entity Linkage) ALFRED adopted to learn structure-driven crawling algorithm Hybrid approaches human annotations and automatic annotations Alternative models of truth/error rate Optimizing the initial number of workers 20
  • 46. Wrong Estimation Noisy single worker: - η = 0.1 - η* = from 0.05 to 0.4 21 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4F HALTr HALTH HALTMQ 4 6 8 10 12 14 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 MQ HALTr HALTH
  • 47. Wrong Estimation Noisy single worker: - η = from 0 to 0.4 - η* = 0.1 22 0.5 0.6 0.7 0.8 0.9 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 F HALTr HALTH HALTMQ 3 10 100 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 MQ HALTr HALTH
  • 48. Sampling & Quality page0 r1 r2 r3 Spirited Away Spirited Away Spirited Away r1 = r2 = r3 23
  • 49. Sampling & Quality page0 r1 r2 r3 Spirited Away Spirited Away Spirited Away r1 = r2 = r3 page0 page1 r1 r2 r3 Spirited Away City of God Spirited Away - Spirited Away City of God r1 = r3 ≠ r2 23
  • 50. Sampling & Quality page0 r1 r2 r3 Spirited Away Spirited Away Spirited Away r1 = r2 = r3 page0 page1 r1 r2 r3 Spirited Away City of God Spirited Away - Spirited Away City of God r1 = r3 ≠ r2 page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null r1 ≠ r3 ≠ r2 23
  • 51. Sampling & Quality page0 r1 r2 r3 Spirited Away Spirited Away Spirited Away r1 = r2 = r3 page0 page1 r1 r2 r3 Spirited Away City of God Spirited Away - Spirited Away City of God r1 = r3 ≠ r2 page0 page1 page2 r1 r2 r3 Spirited Away City of God Howl’s Moving Castle Spirited Away - 9.3 Spirited Away City of God null r1 ≠ r3 ≠ r2 Pages make apparent the differences among the rules Find a small set that makes apparent the same differences observed in the whole set of pages 23
  • 52. Sampling & Quality The problem. Find the smallest set that makes apparent the differences among the rules: (e.g., 100 pages that make apparent the same differences that we would observe in 2M pages). It is a NP-Hard problem !! Reduction to SET-Cover problem: Find the smallest set of pages that cover all the group of rules (group = equivalent rules). The smallest set is not needed: A greedy algorithm O(|Pages|) in time and O(1) in space works very well in practice. 24
  • 53. XPath rules For every page p: if (p makes apparent new differences) representative pages += p An offline algorithm that can be easily parallelized Sampling & Quality 25
  • 54. Sampling Entity Sampling |Pages| P R Movies Biased 250 0.98 0.71 Movies Random 250 0.99 0.99Movies Representative 42 1.00 1.00 Actors Biased 250 1.00 1.00 Actors Random 250 1.00 0.96Actors Representative 30 1.00 1.00 Stocks Biased 86 1.00 0.98 Stocks Random 86 1.00 0.99Stocks Representative 15 1.00 1.00 Albums Biased 258 1.00 0.99 Albums Random 258 1.00 1.00Albums Representative 59 1.00 1.00 Bands Biased 289 1.00 0.68 Bands Random 289 1.00 1.00Bands Representative 36 1.00 1.00 26
  • 55. Sampling Entity Sampling |Pages| P R Movies Biased 250 0.98 0.71 Movies Random 250 0.99 0.99Movies Representative 42 1.00 1.00 Actors Biased 250 1.00 1.00 Actors Random 250 1.00 0.96Actors Representative 30 1.00 1.00 Stocks Biased 86 1.00 0.98 Stocks Random 86 1.00 0.99Stocks Representative 15 1.00 1.00 Albums Biased 258 1.00 0.99 Albums Random 258 1.00 1.00Albums Representative 59 1.00 1.00 Bands Biased 289 1.00 0.68 Bands Random 289 1.00 1.00Bands Representative 36 1.00 1.00 Representative perfect 26
  • 56. Sampling Entity Sampling |Pages| P R Movies Biased 250 0.98 0.71 Movies Random 250 0.99 0.99Movies Representative 42 1.00 1.00 Actors Biased 250 1.00 1.00 Actors Random 250 1.00 0.96Actors Representative 30 1.00 1.00 Stocks Biased 86 1.00 0.98 Stocks Random 86 1.00 0.99Stocks Representative 15 1.00 1.00 Albums Biased 258 1.00 0.99 Albums Random 258 1.00 1.00Albums Representative 59 1.00 1.00 Bands Biased 289 1.00 0.68 Bands Random 289 1.00 1.00Bands Representative 36 1.00 1.00 Biased: recall loss 26
  • 57. Sampling Entity Sampling |Pages| P R Movies Biased 250 0.98 0.71 Movies Random 250 0.99 0.99Movies Representative 42 1.00 1.00 Actors Biased 250 1.00 1.00 Actors Random 250 1.00 0.96Actors Representative 30 1.00 1.00 Stocks Biased 86 1.00 0.98 Stocks Random 86 1.00 0.99Stocks Representative 15 1.00 1.00 Albums Biased 258 1.00 0.99 Albums Random 258 1.00 1.00Albums Representative 59 1.00 1.00 Bands Biased 289 1.00 0.68 Bands Random 289 1.00 1.00Bands Representative 36 1.00 1.00 Random: better than biased but not perfect 26
  • 58. 27 Related Wrapper Generation Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi et. al VLDB2011 DIADEM T. Furche G. Gottlob ... etc WWW2012 Web Data Extraction Based on Partial Tree Alignment Yanhong Zhai WWW2005 Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina SIGMOD 2003 RoadRunner Crescenzi VLDB2001 Wrapper Induction for information extraction Kushmerick IJCAI97 Active Learning with Multiple Views Ion Muslea JAIR2006 Interactive Wrapper Generation with Minimal User Effort Utku Irmak WWW2006

×