1. DIADEMDomain-centric, Intelligent, Automated Data Extraction Tim Furche, Georg Gottlob, Giorgio Orsi May 11th, 2011@ Oxford University Computing Laboratories joint work with Giovanni Grasso, Omer Gunes, XiaonanGuo, AndreyKravchenko, Thomas Lukasiewicz, Christian Schallhart, Andrew Sellers, Gerardo Simaris, Cheng Wang
4. 4 Section 1: Web Data Extraction Data on the Web there is more of it than we can use no longer availability, but finding, integrating, analysing, …
5. 5 Section 1: Web Data Extraction Surface vs. Deep Web estimated 500 × surface web estimated 400000 deep web databases What? Products (stores) Directories (yellow pages) Catalogs (libraries) Public DBs (publications, census, data.gov,…) Public services (weather, location, …)
12. 12 Section 1: Web Data Extraction The Web is more than HTML
13. 13 Section 1: Web Data Extraction Overview Introducing Web Data Extraction Scenarios Why now? Supervised Web Data Extraction Unsupervised Web Data Extraction DIADEM OPAL AMBER OXPath IVLIA Datalog±
15. 15 Section 1: Web Data Extraction The Need of Web Data Extraction information drives business (decision making, trend analysis, …) available in troves on the internet but: as HTML made for humans, not as structured data companies need product specifications pricing information market trends regulatory information
18. 18 Section 1: Web Data Extraction Scenario ➀: Electronics retailer electronics retailer: online market intelligence comprehensive overview of the market daily information on price, shipping costs, trends, product mix by product, geographical region, or competitor thousands of products hundreds of competitors nowadays: specialised companies mostly manual, interpolation large cost
19. 19 Section 1: Web Data Extraction Scenario ➁: Supermarket chain supermarket chain competitors’product prices special offer or promotion (time sensitive) new products, product formats & packaging
20. 20 Section 1: Web Data Extraction Scenario ➂: Hotel Agency online travel agency best price guarantee prices of competing agencies average market price
21. 21 Section 1: Web Data Extraction Scenario ➃: Hedge Fund house price index published in regular intervals by national statistics agency affects share values of various industries hedge fund online market intelligence to predict the house price index
22. 22 Section 1: Web Data Extraction And a lot more … monitor blogs and forums market intelligence, e.g., complaints, common problems customer opinions ranking and analysing product reviews financial analysts monitor trends and stats for products of a certain company / category interest rates from financial institutions press releases and financial reports patent search & analysis …
33. 33 Section 1: Web Data Extraction Why Web Data Extraction Now? Why now? Trends Trend ➊: scale—every business is online automation at scale Trend ➋: web applications rather than web documents automated form filling (deep web navigation) Trend ➌: structured, common-sense data available allows more sophisticated automated analysis also a tool for improved data extraction?
35. 35 manual: (e.g., Web Harvest) user writes the wrapper, sometimes using wrapping libraries supervised: (e.g., Lixto) user provides examples and refines the wrapper semi-supervised: user provides examples (per site), wrapper is automatically learned unsupervised: entirely automated (e.g., DIADEM) some systems omit examples and run analysis directly on all pages some systems automatically guess examples
36. 36 Section 2: Supervised Web Data Extraction Supervised Web Data Extraction User interaction needed to rather than manually writing in a programming language record interaction sequences (such as form fillings) visually select examples for data Current gold standard for high-accuracy extraction Examples: Lixto Automation Anywhere Web Harvest …
40. 40 Section 1: Supervised Web Data Extraction Lixto: Extraction & Analysis Lixto: sophisticated, visual semi-automated extraction tool visually select, automatically derives patterns, verification highly scalable extraction and processing with Lixto server but also: data integration & business analytics suite data cleaning data flow scenarios: merge & filter from different web sites market intelligence & analytics
47. 47 Section 3: Unsupervised Web Data Extraction … and we really need it! search engine providers (Google, Microsoft, Yahoo!) all work on information and data extraction for “vertical”, “object” and “semantic” search turn search engines into knowledge bases for decision support
48. 48 “no one really has done this successfully at scale yet” Raghu Ramakrishnan, Yahoo!, March 2009 “Current technologies are not good enough yet to provide what search engines really need. [...] Any successful approach would probably need a combination of knowledge and learning.” Alon Halevy, Google, Feb. 2009
49. 49 Section 3: Unsupervised Web Data Extraction Unsupervised: The Story so Far Key observation: “database” web sites are generated using templates wrapper generators need to automatically identifying templates Two major approaches machine learning from a few hand-labeled examples similar to semi-supervised, but only one set of examples for an entire domain high precision only for simple domains (single entity type, few attributes) fully automatically exploit the repeated structure of result pages good precision needs a lot of data (many records per page, many pages) doesn’t work for forms (no repetition)
59. 59 Section 4: DIADEM DIADEM: Overview DIADEM combines host of domain-specific annotators with gives us a first “guess” to automatically generate examples high-level ontology about domain entities and their phenomenology on web sites of the domain allows us to verify & refine examples + advances in existing techniques for repeated structure analysis page & block classification bottom-up understanding & top-down reasoning
66. 66 Section 4: DIADEM Achievements in Numbers 15k-150k facts (5-50MB) generated per web page time: usually between 30-60 sec, at most few minutes 300-400 predicates Some numbers on the prototype: Java files: 293 with 44993 lines of code DLV rules: over 500 rules, over 200 predicates Gazetteers: 111 gazetteers with 48000 entries JAPE rules: 23 rules files with 30 rules
74. 74 Section 4: DIADEM » OPAL OPAL: Overview Three step process: browser extraction and annotation labelling & segmentation classification (phenomenological mapping) Model-based, knowledge driven latter two steps are model transformations thin layer of domain-dependent concepts field types and labels triggers for field & form creation
83. AMBER: Overview Three step process like OPAL browser extraction and annotation classification (phenomenological mapping) record segmentation (much harder than in OPAL) Model-based, knowledge driven latter two steps are model transformations thin layer of domain-dependent concepts record and attribute types triggers for record & attribute creation 83 Section 4: DIADEM » AMBER
94. How to find a flat with OXPath Section 4: DIADEM » OXPath Start at rightmove.co.uk: doc("rightmove.co.uk") Fill “oxford’ into the first visible field/descendant::field()[1]/{"oxford"} Click on the second next button/following::field()[2]/{click /} On the refinement form just continue by clicking on the last field/descendant::field()[last()]/{click /} Grab all the prices//p.price 94
95. State of Web Extraction No interaction with rich, scripted interfaces no actions other than form filling and submission ➀ Imperative extraction scripts explicit variable assignments, flow control, etc. either proprietary selection language or mix of XPath & external flow control ➁ Focus on automation and visual interfaces no or very limited extraction language, only ad-hoc extractions no multiway navigation, no optimization 95 Section 4: DIADEM » OXPath
96. Why OXPath? 96 Section 4: DIADEM » OXPath scalability familiarity there is no XPath for data extraction simplicity web applications
117. 117 Our goal … DB technology + constraints Datalog DLs (DL-Lite, EL, Flogic Lite) Unifying Framework Section 4: DIADEM » Datalog± while maintaining query answering tractable in data complexity!
118. 118 employee(X), inProject(X,Y) ->∃Zemployee(Z),supervises(Z,X) reports(X,Y),reports(Z,X)->Y = Z employee(X),customer(X) -> ⊥ Section 4: DIADEM » Datalog± Extend Datalog by allowing in the head: existential (∃) variables Tuple-generating dependencies (TGDs) equality (=) Equality-generating dependencies (EGDs) constant false (⊥) Negative constraints (NCs) What we get is Datalog[∃,=,⊥] Datalog+ Datalog±
119. 119 Linear DL-Lite Sticky-join FO-rewritable Guarded EL PTIME Datalog±: Overview Section 4: DIADEM » Datalog±
120. 120 Section 4: DIADEM » Datalog± Comparison with existing semantic data management solutions IBM IODT [Ma et Al. SIGMOD ‘08] Ontotext BigOWLLim [Kiryakov WWW ‘06] Requiem [Horrocks et Al. ISWC ‘09] Prototype implementation: Nyaya (http://mais.dia.uniroma3.it/Nyaya/Home.html) Implements guarded, weakly-acyclic, linear and sticky Datalog ± Couples a Datalog ± engine with efficient storage mechanism Datalog±: In practice (experiments)
121. 121 Section 4: DIADEM » Datalog± Paper Semantic Data Markets: Store, Reason and Query by R. De Virgilio, G. Orsi, L. Tanca and R. Torlone (submitted) Findings: commercial systems do not identify FO-rewritable fragments they could answer queries much faster than they do now testing FO-rewritability conditions is easy Datalog±: In practice (experiments)
122. 122 Section 4: DIADEM » Datalog± If the language of Σis FO-rewritable fact updates reduce to updates in a RDBMS predicate updates reduce to re-compute the rewriting Datalog±: Updates