Web Data Extraction Como2010

DIADEM A Short Overview Georg Gottlob

Web data extraction WEB HTML pages layout Corporate edp apps structured data, Databases, XML WRAPPER Goal: Make web contents accessible to electronic data processing Wrappers: HTML  select  extract  annotate  XML

MSO Monadic Datalog Elog Lixto Visual Wrapper   =  Suite Logic Database theory DB programming Application design

Lixto Visual Developer (VD) Navigation Steps Mozilla Web Browser Extraction Configuration

Need for Automatic Extraction Technology Example: Real Estate UK 17,000 sites Many not covered by aggregators We do have a list of all homepages (Yellow Pgs. UK) Manual or semi-automatic wrapping too expensive - wrapper construction - testing - keeping track of changes No tool or method can do it fully automatically. Other domains: Hospitals,restaurants, schools, travel agents, airlines, hospitals, pharmaceutical companies and retail companies such as supermarket chains…..

Need for Automatic Extraction Technology (2) All search engine providers need it! Many work on it. Keywords:  Vertical search,  object search,  semantic search. Raghu Ramakrishnan , Yahoo!, March 2009: “ no one really has done this successfully at scale yet ” Alon Halevy , Google, Feb. 2009: “ Current technologies are not good enough yet to provide what search engines really need. […] any successful approach would probably need a combination of knowledge and learning.”

The Blackbox we want to construct BLACKBOX Application domain with thousands of websites URL Application relevant Structured data (XML or RDF) To achieve this, we plan to combine a host of annotators with a new knowledge-based approach.

Real estate Restauran t s Relationship to SeCo & Webdam Q: Find apartments in Milan whose prices are  average in quarters were restaurant quality > average. Results Web service A Web service

How to achieve it? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

High level reasoning: - Goal oriented - Conceptual domain objects. - Conceptual interaction elements - High-level object ontology - Domain knowledge

Bottom-up (low-level) annotation Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Geo-Price-Searchbox ISA [(02873,227) (03900,417)]

Top-down reasoning Property Search Facility Property List Single Property Description Specially highlighted property part-of m 1

Bottom-up processing Top-down reasoning Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Property Search Facility Property List Single Property Description Geo-Price-Searchbox ISA [(02873,227) (03900,417)] Specially highlighted property Phenomenology part-of m 1

table(T) & occurs_in(T,areaselection) & occurs_in(T,priceselection)  goodtable(T). goodtable(T) & child(Parent,T)  containsgoodtable(Parent). goodtable(T) &  containsgoodtable(T)  propertysearchmask(T). If a table contains an area selection input field and a price selection field, both of which are not simultaneously contained in a smaller table, then this table is the property search mask Datalog for Web-Object Reasoning

Crucial steps ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

The Data Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390

Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390 T1 T2

Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390 PRICE 480 360 470 390 T1 T2

Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs)

Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs) Datalog  : require guardedness of rule bodies. Decidable, linear-time data complexity.

Datalog  Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3)  containedin (T1,T3)

Datalog  Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3)  containedin (T1,T3) unguarded!

DL-LITE DL-LITE Datalog[  ,  ;Lin] Professor   TeachesTo Professsor(x)   y TeachesTo(x,y)  TeachesTo -  Student TeachesTo(x,y)  Student(y) HasTutor -  TeachesTo HasTutor(x,y) ->TeachesTo(y,x) funct(HasTutor) HasTutor(x,y) & HasTutor(x,y’) (always innocuous!) & Neq(y,y’)   Professor   Student Professor(x) & Student(x)   DL-Lite core DL-Lite R DL-Lite F

We will use various existing tools and techniques (rather than re-invent the wheel) Low & Intermediate Level Annotation ,[object Object],[object Object],[object Object],[object Object],[object Object]

Extraction from PDF Tamir Hassan

OXPath ,[object Object],[object Object],[object Object],[object Object],[object Object]

Result Extraction ,[object Object],/<XQ> Atomic results regardless of presentation (list, table, etc.)

Result Extraction <XQ> : For each atomic result A Let price = A/.../.../text() description = A/.../.../../text() ........ Return <rental area=Oxford> <price> 1,200 </price> <bedrooms> 3 </bedrooms> <bathrooms> 1 </bathrooms> <type> Flat </type> <location> George Street,OX1 </location> <description> ... </description> <otherInfo> Furnished; Long let - more than six months </otherInfo> ... <ental> price description Type OtherInfo Bathrooms location type = A/.../.../../text()

Web Data Extraction Como2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Web Data Extraction Como2010

Similar to Web Data Extraction Como2010 (20)

More from Giorgio Orsi

More from Giorgio Orsi (20)

Web Data Extraction Como2010

Editor's Notes