3. Web data extraction WEB HTML pages layout Corporate edp apps structured data, Databases, XML WRAPPER Goal: Make web contents accessible to electronic data processing Wrappers: HTML select extract annotate XML
4.
5. MSO Monadic Datalog Elog Lixto Visual Wrapper = Suite Logic Database theory DB programming Application design
6. Lixto Visual Developer (VD) Navigation Steps Mozilla Web Browser Extraction Configuration
9. Need for Automatic Extraction Technology Example: Real Estate UK 17,000 sites Many not covered by aggregators We do have a list of all homepages (Yellow Pgs. UK) Manual or semi-automatic wrapping too expensive - wrapper construction - testing - keeping track of changes No tool or method can do it fully automatically. Other domains: Hospitals,restaurants, schools, travel agents, airlines, hospitals, pharmaceutical companies and retail companies such as supermarket chains…..
10. Need for Automatic Extraction Technology (2) All search engine providers need it! Many work on it. Keywords: Vertical search, object search, semantic search. Raghu Ramakrishnan , Yahoo!, March 2009: “ no one really has done this successfully at scale yet ” Alon Halevy , Google, Feb. 2009: “ Current technologies are not good enough yet to provide what search engines really need. […] any successful approach would probably need a combination of knowledge and learning.”
11. The Blackbox we want to construct BLACKBOX Application domain with thousands of websites URL Application relevant Structured data (XML or RDF) To achieve this, we plan to combine a host of annotators with a new knowledge-based approach.
12. Real estate Restauran t s Relationship to SeCo & Webdam Q: Find apartments in Milan whose prices are average in quarters were restaurant quality > average. Results Web service A Web service
16. Bottom-up (low-level) annotation Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Geo-Price-Searchbox ISA [(02873,227) (03900,417)]
17. Top-down reasoning Property Search Facility Property List Single Property Description Specially highlighted property part-of m 1
18. Bottom-up processing Top-down reasoning Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Property Search Facility Property List Single Property Description Geo-Price-Searchbox ISA [(02873,227) (03900,417)] Specially highlighted property Phenomenology part-of m 1
19. table(T) & occurs_in(T,areaselection) & occurs_in(T,priceselection) goodtable(T). goodtable(T) & child(Parent,T) containsgoodtable(Parent). goodtable(T) & containsgoodtable(T) propertysearchmask(T). If a table contains an area selection input field and a price selection field, both of which are not simultaneously contained in a smaller table, then this table is the property search mask Datalog for Web-Object Reasoning
27. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs)
28. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs) Datalog : require guardedness of rule bodies. Decidable, linear-time data complexity.
29. Datalog Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3) containedin (T1,T3)
30. Datalog Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3) containedin (T1,T3) unguarded!
45. Result Extraction <XQ> : For each atomic result A Let price = A/.../.../text() description = A/.../.../../text() ........ Return <rental area=Oxford> <price> 1,200 </price> <bedrooms> 3 </bedrooms> <bathrooms> 1 </bathrooms> <type> Flat </type> <location> George Street,OX1 </location> <description> ... </description> <otherInfo> Furnished; Long let - more than six months </otherInfo> ... <ental> price description Type OtherInfo Bathrooms location type = A/.../.../../text()
46.
Editor's Notes
The XQuery statement (FLWOR expression) outputs a single XML document per result page Produces a flat (i.e. no hierarchy) XML document per some prescribed schema Key intuition – notion of atomic result, regardless of presentation (list, table, etc.) Atomic results analogous to RDBMS query returns (attributes form tuples); field inputs are preserved as element attribute values