Web Data Extraction Como2010
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Web Data Extraction Como2010

on

  • 1,460 views

 

Statistics

Views

Total Views
1,460
Views on SlideShare
653
Embed Views
807

Actions

Likes
0
Downloads
7
Comments
0

2 Embeds 807

http://diadem.cs.ox.ac.uk 806
http://diadem.comlab.ox.ac.uk 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The XQuery statement (FLWOR expression) outputs a single XML document per result page Produces a flat (i.e. no hierarchy) XML document per some prescribed schema Key intuition – notion of atomic result, regardless of presentation (list, table, etc.) Atomic results analogous to RDBMS query returns (attributes form tuples); field inputs are preserved as element attribute values

Web Data Extraction Como2010 Presentation Transcript

  • 1. DIADEM A Short Overview Georg Gottlob
  • 2.  
  • 3. Web data extraction WEB HTML pages layout Corporate edp apps structured data, Databases, XML WRAPPER Goal: Make web contents accessible to electronic data processing Wrappers: HTML  select  extract  annotate  XML
  • 4.  
  • 5. MSO Monadic Datalog Elog Lixto Visual Wrapper   =  Suite Logic Database theory DB programming Application design
  • 6. Lixto Visual Developer (VD) Navigation Steps Mozilla Web Browser Extraction Configuration
  • 7.  
  • 8. . .
  • 9. Need for Automatic Extraction Technology Example: Real Estate UK 17,000 sites Many not covered by aggregators We do have a list of all homepages (Yellow Pgs. UK) Manual or semi-automatic wrapping too expensive - wrapper construction - testing - keeping track of changes No tool or method can do it fully automatically. Other domains: Hospitals,restaurants, schools, travel agents, airlines, hospitals, pharmaceutical companies and retail companies such as supermarket chains…..
  • 10. Need for Automatic Extraction Technology (2) All search engine providers need it! Many work on it. Keywords:  Vertical search,  object search,  semantic search. Raghu Ramakrishnan , Yahoo!, March 2009: “ no one really has done this successfully at scale yet ” Alon Halevy , Google, Feb. 2009: “ Current technologies are not good enough yet to provide what search engines really need. […] any successful approach would probably need a combination of knowledge and learning.”
  • 11. The Blackbox we want to construct BLACKBOX Application domain with thousands of websites URL Application relevant Structured data (XML or RDF) To achieve this, we plan to combine a host of annotators with a new knowledge-based approach.
  • 12. Real estate Restauran t s Relationship to SeCo & Webdam Q: Find apartments in Milan whose prices are  average in quarters were restaurant quality > average. Results Web service A Web service
  • 13. How to achieve it?
    • Rationale: Combine existing and new “low level” annotators with “high level” AI and reasoning.
    • Low level annotators:
    • - Bottom-up page analysis.
    • - ML-based entity recognizers
    • - NLP & ontological text annotation
    • - Web page classification & analysis
    • - Basic link analysis
  • 14. High level reasoning: - Goal oriented - Conceptual domain objects. - Conceptual interaction elements - High-level object ontology - Domain knowledge
  • 15.  
  • 16. Bottom-up (low-level) annotation Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Geo-Price-Searchbox ISA [(02873,227) (03900,417)]
  • 17. Top-down reasoning Property Search Facility Property List Single Property Description Specially highlighted property part-of m 1
  • 18. Bottom-up processing Top-down reasoning Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Property Search Facility Property List Single Property Description Geo-Price-Searchbox ISA [(02873,227) (03900,417)] Specially highlighted property Phenomenology part-of m 1
  • 19. table(T) & occurs_in(T,areaselection) & occurs_in(T,priceselection)  goodtable(T). goodtable(T) & child(Parent,T)  containsgoodtable(Parent). goodtable(T) &  containsgoodtable(T)  propertysearchmask(T). If a table contains an area selection input field and a price selection field, both of which are not simultaneously contained in a smaller table, then this table is the property search mask Datalog for Web-Object Reasoning
  • 20. Crucial steps
    • WP1 data model (KRR model)
    • WP2 low & intermediate level annotation
    • WP3 High level ontology and Rules (top down)
    • + mapping HL to Int. Level: Phenomenology
    • WP4 Access, interaction, & navigation
    • WP5 Compilation; Learning Xpath expressions
    • WP6 Highly parallel execution on clouds
    • WP7 General methodology
  • 21.  
  • 22. Crucial steps
    • WP1 data model (KRR model)
    • WP2 low & intermediate level annotation
    • WP3 High level ontology and Rules (top down)
    • + mapping HL to Int. Level: Phenomenology
    • WP4 Access, interaction, & navigation
    • WP5 Compilation; Learning Xpath expressions
    • WP6 Highly parallel execution on clouds
    • WP7 General methodology
  • 23. The Data Model
    • Datalog is good but does not suffice.
    • On top of it:
    • Need for object creation
    • Need for ontological reasoning
    • Need for probabilistic reasoning
    • Need for default reasoning
  • 24. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390
  • 25. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390 T1 T2
  • 26. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390 PRICE 480 360 470 390 T1 T2
  • 27. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs)
  • 28. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs) Datalog  : require guardedness of rule bodies. Decidable, linear-time data complexity.
  • 29. Datalog  Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3)  containedin (T1,T3)
  • 30. Datalog  Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3)  containedin (T1,T3) unguarded!
  • 31. DL-LITE DL-LITE Datalog[  ,  ;Lin] Professor   TeachesTo Professsor(x)   y TeachesTo(x,y)  TeachesTo -  Student TeachesTo(x,y)  Student(y) HasTutor -  TeachesTo HasTutor(x,y) ->TeachesTo(y,x) funct(HasTutor) HasTutor(x,y) & HasTutor(x,y’) (always innocuous!) & Neq(y,y’)   Professor   Student Professor(x) & Student(x)   DL-Lite core DL-Lite R DL-Lite F
  • 32. Crucial steps
    • WP1 data model (KRR model)
    • WP2 low & intermediate level annotation
    • WP3 High level ontology and Rules (top down)
    • + mapping HL to Int. Level: Phenomenology
    • WP4 Access, interaction, & navigation
    • WP5 Compilation; Learning Xpath expressions
    • WP6 Highly parallel execution on clouds
    • WP7 General methodology
  • 33. We will use various existing tools and techniques (rather than re-invent the wheel) Low & Intermediate Level Annotation
    • Named entity recognizers
    • Machine learning
    • Computational linguistics
    • Page layout analysis
    • PDF- Extraction
  • 34.  
  • 35.  
  • 36.  
  • 37. Extraction from PDF Tamir Hassan
  • 38. Crucial steps
    • WP1 data model (KRR model)
    • WP2 low & intermediate level annotation
    • WP3 High level ontology and Rules (top down)
    • + mapping HL to Int. Level: Phenomenology
    • WP4 Access, interaction, & navigation
    • WP5 Compilation; Learning Xpath expressions
    • WP6 Highly parallel execution on clouds
    • WP7 General methodology
  • 39. Navigation & Interaction
  • 40. Crucial steps
    • WP1 data model (KRR model)
    • WP2 low & intermediate level annotation
    • WP3 High level ontology and Rules (top down)
    • + mapping HL to Int. Level: Phenomenology
    • WP4 Access, interaction, & navigation
    • WP5 Compilation; Learning Xpath expressions
    • WP6 Highly parallel execution on clouds
    • WP7 General methodology
  • 41.  
  • 42.  
  • 43. OXPath
    • Extension of XPath
    • Facilitates querying web form and retrieving returned data
    • Simulates a user filling out web forms
    • Highly parallelizable (geared towds cloud computing)
    • Navigation and collecting data across multiple pages
  • 44. Result Extraction
    • .... /next-field::*/ {“Renting”} / ... / {...} /.../ {“Submit”}
    /<XQ> Atomic results regardless of presentation (list, table, etc.)
  • 45. Result Extraction <XQ> : For each atomic result A Let price = A/.../.../text() description = A/.../.../../text() ........ Return <rental area=Oxford> <price> 1,200 </price> <bedrooms> 3 </bedrooms> <bathrooms> 1 </bathrooms> <type> Flat </type> <location> George Street,OX1 </location> <description> ... </description> <otherInfo> Furnished; Long let - more than six months </otherInfo> ... <rental> price description Type OtherInfo Bathrooms location type = A/.../.../../text()
  • 46. Crucial steps
    • WP1 data model (KRR model)
    • WP2 low & intermediate level annotation
    • WP3 High level ontology and Rules (top down)
    • + mapping HL to Int. Level: Phenomenology
    • WP4 Access, interaction, & navigation
    • WP5 Compilation; Learning Xpath expressions
    • WP6 Highly parallel execution on clouds
    • WP7 General methodology