DIADEM A Short Overview  Georg Gottlob
 
Web data extraction WEB HTML pages layout   Corporate edp apps structured data, Databases, XML WRAPPER Goal:  Make web con...
 
MSO Monadic Datalog Elog Lixto Visual Wrapper   =  Suite Logic  Database theory DB programming Application design
Lixto Visual Developer (VD) Navigation Steps Mozilla Web Browser Extraction Configuration
 
. .
Need for Automatic Extraction Technology Example:  Real Estate UK 17,000 sites Many not covered by aggregators We do have ...
Need for Automatic Extraction Technology (2)   All search engine providers need it!  Many work on it. Keywords:      Vert...
The Blackbox we want to construct BLACKBOX Application domain with  thousands of websites URL Application relevant  Struct...
Real estate Restauran t s Relationship to SeCo & Webdam Q:  Find apartments in Milan whose prices are    average in  quar...
How to achieve it? <ul><li>Rationale:  Combine existing and new “low level” annotators with “high level” AI and reasoning....
High level reasoning: -  Goal oriented - Conceptual domain objects. - Conceptual interaction elements - High-level object ...
 
Bottom-up (low-level) annotation Monochromatic Rectangle  Georaphic search  facility   Postcode input field Active map  … ...
Top-down reasoning   Property Search Facility   Property List   Single Property Description   Specially highlighted proper...
Bottom-up processing  Top-down reasoning   Monochromatic Rectangle  Georaphic search  facility   Postcode input field Acti...
table(T) &  occurs_in(T,areaselection) &  occurs_in(T,priceselection)    goodtable(T). goodtable(T) & child(Parent,T)   ...
Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation  </li></ul><ul>...
 
Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation  </li></ul><ul>...
The Data Model <ul><li>Datalog is good but does not suffice. </li></ul><ul><li>On top of it: </li></ul><ul><li>Need for ob...
Object creation in Datalog + table(T1) &  table(T2) &  sameColor(T1,T2) & isNeighbourRight(T1,T2)        X (tablebox(X) ...
Object creation in Datalog + table(T1) &  table(T2) &  sameColor(T1,T2) & isNeighbourRight(T1,T2)        X (tablebox(X) ...
Object creation in Datalog + table(T1) &  table(T2) &  sameColor(T1,T2) & isNeighbourRight(T1,T2)        X (tablebox(X) ...
Object creation in Datalog + table(T1) &  table(T2) &  sameColor(T1,T2) & isNeighbourRight(T1,T2)        X (tablebox(X) ...
Object creation in Datalog + table(T1) &  table(T2) &  sameColor(T1,T2) & isNeighbourRight(T1,T2)        X (tablebox(X) ...
Datalog  Family  of languages. Incorporates  ontological reasoning  (>DL-LITE) Further research needed for extending it  ...
Datalog  Family  of languages. Incorporates  ontological reasoning  (>DL-LITE) Further research needed for extending it  ...
DL-LITE DL-LITE  Datalog[    ,  ;Lin]   Professor        TeachesTo  Professsor(x)       y TeachesTo(x,y)     Teache...
Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation   </li></ul><ul...
We will use various existing tools and techniques (rather than re-invent the wheel)  Low & Intermediate Level Annotation <...
 
 
 
Extraction from PDF Tamir Hassan
Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation  </li></ul><ul>...
Navigation & Interaction
Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation  </li></ul><ul>...
 
 
OXPath <ul><li>Extension of XPath  </li></ul><ul><li>Facilitates querying web form and retrieving returned data </li></ul>...
Result Extraction <ul><li>.... /next-field::*/ {“Renting”} / ... / {...} /.../ {“Submit”} </li></ul>/<XQ>  Atomic results ...
Result Extraction <XQ> :  For  each atomic  result  A Let price  = A/.../.../text() description  = A/.../.../../text() ......
Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation  </li></ul><ul>...
Upcoming SlideShare
Loading in …5
×

Web Data Extraction Como2010

1,873 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,873
On SlideShare
0
From Embeds
0
Number of Embeds
976
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • The XQuery statement (FLWOR expression) outputs a single XML document per result page Produces a flat (i.e. no hierarchy) XML document per some prescribed schema Key intuition – notion of atomic result, regardless of presentation (list, table, etc.) Atomic results analogous to RDBMS query returns (attributes form tuples); field inputs are preserved as element attribute values
  • Web Data Extraction Como2010

    1. 1. DIADEM A Short Overview Georg Gottlob
    2. 3. Web data extraction WEB HTML pages layout Corporate edp apps structured data, Databases, XML WRAPPER Goal: Make web contents accessible to electronic data processing Wrappers: HTML  select  extract  annotate  XML
    3. 5. MSO Monadic Datalog Elog Lixto Visual Wrapper   =  Suite Logic Database theory DB programming Application design
    4. 6. Lixto Visual Developer (VD) Navigation Steps Mozilla Web Browser Extraction Configuration
    5. 8. . .
    6. 9. Need for Automatic Extraction Technology Example: Real Estate UK 17,000 sites Many not covered by aggregators We do have a list of all homepages (Yellow Pgs. UK) Manual or semi-automatic wrapping too expensive - wrapper construction - testing - keeping track of changes No tool or method can do it fully automatically. Other domains: Hospitals,restaurants, schools, travel agents, airlines, hospitals, pharmaceutical companies and retail companies such as supermarket chains…..
    7. 10. Need for Automatic Extraction Technology (2) All search engine providers need it! Many work on it. Keywords:  Vertical search,  object search,  semantic search. Raghu Ramakrishnan , Yahoo!, March 2009: “ no one really has done this successfully at scale yet ” Alon Halevy , Google, Feb. 2009: “ Current technologies are not good enough yet to provide what search engines really need. […] any successful approach would probably need a combination of knowledge and learning.”
    8. 11. The Blackbox we want to construct BLACKBOX Application domain with thousands of websites URL Application relevant Structured data (XML or RDF) To achieve this, we plan to combine a host of annotators with a new knowledge-based approach.
    9. 12. Real estate Restauran t s Relationship to SeCo & Webdam Q: Find apartments in Milan whose prices are  average in quarters were restaurant quality > average. Results Web service A Web service
    10. 13. How to achieve it? <ul><li>Rationale: Combine existing and new “low level” annotators with “high level” AI and reasoning. </li></ul><ul><li>Low level annotators: </li></ul><ul><li>- Bottom-up page analysis. </li></ul><ul><li>- ML-based entity recognizers </li></ul><ul><li>- NLP & ontological text annotation </li></ul><ul><li>- Web page classification & analysis </li></ul><ul><li>- Basic link analysis </li></ul>
    11. 14. High level reasoning: - Goal oriented - Conceptual domain objects. - Conceptual interaction elements - High-level object ontology - Domain knowledge
    12. 16. Bottom-up (low-level) annotation Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Geo-Price-Searchbox ISA [(02873,227) (03900,417)]
    13. 17. Top-down reasoning Property Search Facility Property List Single Property Description Specially highlighted property part-of m 1
    14. 18. Bottom-up processing Top-down reasoning Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Property Search Facility Property List Single Property Description Geo-Price-Searchbox ISA [(02873,227) (03900,417)] Specially highlighted property Phenomenology part-of m 1
    15. 19. table(T) & occurs_in(T,areaselection) & occurs_in(T,priceselection)  goodtable(T). goodtable(T) & child(Parent,T)  containsgoodtable(Parent). goodtable(T) &  containsgoodtable(T)  propertysearchmask(T). If a table contains an area selection input field and a price selection field, both of which are not simultaneously contained in a smaller table, then this table is the property search mask Datalog for Web-Object Reasoning
    16. 20. Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation </li></ul><ul><li>WP3 High level ontology and Rules (top down) </li></ul><ul><li>+ mapping HL to Int. Level: Phenomenology </li></ul><ul><li>WP4 Access, interaction, & navigation </li></ul><ul><li>WP5 Compilation; Learning Xpath expressions </li></ul><ul><li>WP6 Highly parallel execution on clouds </li></ul><ul><li>WP7 General methodology </li></ul>
    17. 22. Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation </li></ul><ul><li>WP3 High level ontology and Rules (top down) </li></ul><ul><li>+ mapping HL to Int. Level: Phenomenology </li></ul><ul><li>WP4 Access, interaction, & navigation </li></ul><ul><li>WP5 Compilation; Learning Xpath expressions </li></ul><ul><li>WP6 Highly parallel execution on clouds </li></ul><ul><li>WP7 General methodology </li></ul>
    18. 23. The Data Model <ul><li>Datalog is good but does not suffice. </li></ul><ul><li>On top of it: </li></ul><ul><li>Need for object creation </li></ul><ul><li>Need for ontological reasoning </li></ul><ul><li>Need for probabilistic reasoning </li></ul><ul><li>Need for default reasoning </li></ul>
    19. 24. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390
    20. 25. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390 T1 T2
    21. 26. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390 PRICE 480 360 470 390 T1 T2
    22. 27. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs)
    23. 28. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs) Datalog  : require guardedness of rule bodies. Decidable, linear-time data complexity.
    24. 29. Datalog  Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3)  containedin (T1,T3)
    25. 30. Datalog  Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3)  containedin (T1,T3) unguarded!
    26. 31. DL-LITE DL-LITE Datalog[  ,  ;Lin] Professor   TeachesTo Professsor(x)   y TeachesTo(x,y)  TeachesTo -  Student TeachesTo(x,y)  Student(y) HasTutor -  TeachesTo HasTutor(x,y) ->TeachesTo(y,x) funct(HasTutor) HasTutor(x,y) & HasTutor(x,y’) (always innocuous!) & Neq(y,y’)   Professor   Student Professor(x) & Student(x)   DL-Lite core DL-Lite R DL-Lite F
    27. 32. Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation </li></ul><ul><li>WP3 High level ontology and Rules (top down) </li></ul><ul><li>+ mapping HL to Int. Level: Phenomenology </li></ul><ul><li>WP4 Access, interaction, & navigation </li></ul><ul><li>WP5 Compilation; Learning Xpath expressions </li></ul><ul><li>WP6 Highly parallel execution on clouds </li></ul><ul><li>WP7 General methodology </li></ul>
    28. 33. We will use various existing tools and techniques (rather than re-invent the wheel) Low & Intermediate Level Annotation <ul><li>Named entity recognizers </li></ul><ul><li>Machine learning </li></ul><ul><li>Computational linguistics </li></ul><ul><li>Page layout analysis </li></ul><ul><li>PDF- Extraction </li></ul>
    29. 37. Extraction from PDF Tamir Hassan
    30. 38. Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation </li></ul><ul><li>WP3 High level ontology and Rules (top down) </li></ul><ul><li>+ mapping HL to Int. Level: Phenomenology </li></ul><ul><li>WP4 Access, interaction, & navigation </li></ul><ul><li>WP5 Compilation; Learning Xpath expressions </li></ul><ul><li>WP6 Highly parallel execution on clouds </li></ul><ul><li>WP7 General methodology </li></ul>
    31. 39. Navigation & Interaction
    32. 40. Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation </li></ul><ul><li>WP3 High level ontology and Rules (top down) </li></ul><ul><li>+ mapping HL to Int. Level: Phenomenology </li></ul><ul><li>WP4 Access, interaction, & navigation </li></ul><ul><li>WP5 Compilation; Learning Xpath expressions </li></ul><ul><li>WP6 Highly parallel execution on clouds </li></ul><ul><li>WP7 General methodology </li></ul>
    33. 43. OXPath <ul><li>Extension of XPath </li></ul><ul><li>Facilitates querying web form and retrieving returned data </li></ul><ul><li>Simulates a user filling out web forms </li></ul><ul><li>Highly parallelizable (geared towds cloud computing) </li></ul><ul><li>Navigation and collecting data across multiple pages </li></ul>
    34. 44. Result Extraction <ul><li>.... /next-field::*/ {“Renting”} / ... / {...} /.../ {“Submit”} </li></ul>/<XQ> Atomic results regardless of presentation (list, table, etc.)
    35. 45. Result Extraction <XQ> : For each atomic result A Let price = A/.../.../text() description = A/.../.../../text() ........ Return <rental area=Oxford> <price> 1,200 </price> <bedrooms> 3 </bedrooms> <bathrooms> 1 </bathrooms> <type> Flat </type> <location> George Street,OX1 </location> <description> ... </description> <otherInfo> Furnished; Long let - more than six months </otherInfo> ... <rental> price description Type OtherInfo Bathrooms location type = A/.../.../../text()
    36. 46. Crucial steps <ul><li>WP1 data model (KRR model) </li></ul><ul><li>WP2 low & intermediate level annotation </li></ul><ul><li>WP3 High level ontology and Rules (top down) </li></ul><ul><li>+ mapping HL to Int. Level: Phenomenology </li></ul><ul><li>WP4 Access, interaction, & navigation </li></ul><ul><li>WP5 Compilation; Learning Xpath expressions </li></ul><ul><li>WP6 Highly parallel execution on clouds </li></ul><ul><li>WP7 General methodology </li></ul>

    ×