Table Recognition

483 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
483
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Table Recognition

  1. 1. The DIADEM Ontology DIADEM 1.0 Yiyang Bao 2 , Xiaonan Guo 2 , Giorgio Orsi 1,2 , Christian Schallhart 2 , Cheng Wang 2 1 Institute for the Future of Computing University of Oxford 2 Department of Computer Science University of Oxford
  2. 2. The languages of the web <ul><li>HTML objects provide the data model of a web-page. </li></ul><ul><li>CSS boxes and properties provide the layout. </li></ul><ul><li>Javascript provides web dynamics. </li></ul><html> <head> </head> <body> <title> </title> <div> … </div> </body> </html> ox:Property xsd:string ox:address Real World Web this.value.toLowerCase(); <ul><li>… ? </li></ul><ul><li>RDF annotations provide the conceptualization of the domain. </li></ul>
  3. 3. Why ontology? <ul><li>Ontologies provide a conceptualization of a domain of interest (Gruber ‘93) </li></ul>ox:Property xsd:string ox:address ox:minPrice ox:partOf ox:priceSegment <ul><li>But… we do not only want to model the application domain </li></ul><ul><li>We must model the domain of its web representations , i.e., its phenomenology . </li></ul><ul><li>In the end, it is also an ontology </li></ul>
  4. 4. Why ontology? <ul><li>Can be used to complete an incomplete model. </li></ul><ul><li>Can be used to verify a model. </li></ul><ul><li>Must tolerate uncertainty and inconsistency . </li></ul>
  5. 5. A logical model for web extraction <ul><li>Logical model for web entities </li></ul><ul><ul><li>input and refinement forms . </li></ul></ul><ul><ul><li>result pages </li></ul></ul><ul><ul><li>page blocks (e.g., ads) </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Phenomenological model </li></ul><ul><ul><li>How logical entities are concretely represented </li></ul></ul>
  6. 6. The building blocks <ul><li>HTML entities </li></ul><ul><ul><li>labels </li></ul></ul><ul><ul><li>fields (included links) </li></ul></ul><ul><ul><li>text -nodes and text attributes </li></ul></ul><form> < label for=&quot;male&quot;>Male</label> < input type=&quot;radio&quot; name=&quot;sex&quot; id=&quot;male&quot; /> < label for=&quot;female&quot;>Female</label> < input type=&quot;radio&quot; name=&quot;sex&quot; id=&quot;female&quot; /> </form> <div> <span> Price: </span> <span> £ 250 </span> </div> Price: £ 250 <ul><li>Logical entities </li></ul><ul><ul><li>constructs of our data model </li></ul></ul><ul><li>Rules </li></ul><ul><ul><li>describe the phenomenology </li></ul></ul>
  7. 7. The form model <ul><li>Goal: model web form phenomenology </li></ul>
  8. 8. The form model <ul><li>Areas : </li></ul><ul><ul><li>button </li></ul></ul><ul><ul><li>location </li></ul></ul><ul><ul><li>price </li></ul></ul><ul><ul><li>room </li></ul></ul><ul><ul><li>type </li></ul></ul><ul><ul><li>buy/rent </li></ul></ul><ul><ul><li>order-by </li></ul></ul><ul><ul><li>display </li></ul></ul><ul><li>Root entity: </li></ul><ul><ul><li>RealEstateForm </li></ul></ul><ul><li>Properties : </li></ul><ul><ul><li>partOf  hierarchical structures </li></ul></ul>
  9. 9. The form model: elements <ul><ul><li>price </li></ul></ul><ul><ul><ul><ul><li>type = {min, max} </li></ul></ul></ul></ul><ul><ul><ul><ul><li>purpose = {buy, rent} </li></ul></ul></ul></ul><ul><ul><li>currency </li></ul></ul><ul><ul><li>room </li></ul></ul><ul><ul><ul><li>category = {bathroom, bedroom, …} </li></ul></ul></ul><ul><ul><ul><li>type = {min, max} </li></ul></ul></ul>
  10. 10. The form model: elements <ul><li>display </li></ul><ul><li>per page </li></ul><ul><li>add-in-time </li></ul><ul><ul><li>property type </li></ul></ul><ul><li>button </li></ul><ul><ul><li>submit </li></ul></ul><ul><ul><li>reset </li></ul></ul><ul><ul><li>map search </li></ul></ul><ul><ul><li>advance submit </li></ul></ul><ul><ul><li>link button </li></ul></ul><ul><li>order-by </li></ul><ul><li>buy </li></ul><ul><li>rent </li></ul><ul><li>buy/rent </li></ul><ul><li>new/resale </li></ul><ul><li>SSTC </li></ul><ul><li>other </li></ul>
  11. 11. The form model: phenomenology <ul><li>Based on linguistic annotations and (visual) heuristics . </li></ul>buyElement(X,F) :- visibleField(X), hasAnnotationFeature (X,&quot;majorType&quot;, &quot;reform.label&quot;), hasAnnotationFeature (X,&quot;minorType&quot;, &quot;buy&quot;), not hasAnnotationFeature (X,&quot;minorType&quot;, &quot;rent&quot;), not hasAnnotationFeature (X,&quot;minorType&quot;, &quot;includeSSTC&quot;), group(Ns,_,_,F),#member(X,Ns). radiusElement(X,F) :- visibleField(X), hasAnnotationFeature (X,&quot;majorType&quot;,&quot;reform.label&quot;), hasAnnotationFeature (X,&quot;minorType&quot;,&quot;radius&quot;), group(Ns,_,_,F),#member(X,Ns).
  12. 12. The form model: segments <ul><li>A segment is: </li></ul><ul><ul><li>a single element </li></ul></ul><ul><ul><li>a group of elements </li></ul></ul><ul><ul><li>a group of segments </li></ul></ul><ul><ul><li>a pair <segment, label> </li></ul></ul><ul><li>Segments </li></ul><ul><ul><li>buttons </li></ul></ul><ul><ul><li>geographic </li></ul></ul><ul><ul><li>price </li></ul></ul><ul><ul><li>Room </li></ul></ul><ul><ul><li>property type </li></ul></ul><ul><ul><li>buy/rent </li></ul></ul><ul><ul><li>order-by </li></ul></ul><ul><ul><li>display </li></ul></ul><ul><ul><li>per page </li></ul></ul><ul><ul><li>add in time </li></ul></ul><ul><ul><li>new/resale </li></ul></ul><ul><ul><li>SSTC </li></ul></ul><ul><li>Form </li></ul><ul><ul><li>real-estate </li></ul></ul>
  13. 13. The result-page model <ul><li>Goal: model result-pages phenomenology </li></ul>
  14. 14. The result-page model <ul><li>Attributes and values </li></ul><ul><ul><li>e.g., < price , £ 250,000 > </li></ul></ul><ul><li>Record </li></ul><ul><ul><li>groups of pairs < attribute, value > </li></ul></ul><ul><li>Data area </li></ul><ul><ul><li>groups of records </li></ul></ul><ul><li>Mandatory attribute(s) </li></ul><ul><ul><li>must be present in a record </li></ul></ul><ul><ul><li>sanity check purposes </li></ul></ul>
  15. 15. A Conceptual Model for Data Extraction <ul><li>Conceptual Modelling on the Web </li></ul><ul><ul><li>Software modelling e.g., UML and stereotypes </li></ul></ul><ul><ul><li>Ad hoc languages e.g., WebML </li></ul></ul>
  16. 16. Linking the domain ontology: OntoX
  17. 17. DIADEM Ontology: discussion <ul><li>Expressive power </li></ul><ul><ul><li>safe nr-datalog with stratified negation and aggregation </li></ul></ul><ul><ul><li>pros: easy to compute </li></ul></ul><ul><ul><li>cons: not robust to uncertainty and inconsistencies </li></ul></ul><ul><li>Adaptability </li></ul><ul><ul><li>result-page model is substantially domain independent </li></ul></ul><ul><ul><li>Form model is domain dependent (entity types ) </li></ul></ul><ul><ul><ul><li>The number of entities is limited </li></ul></ul></ul>
  18. 18. Uncertainty, Vagueness and Inconsistencies
  19. 19. <ul><li>Origin </li></ul><ul><ul><li>annotations are noisy </li></ul></ul><ul><ul><li>entity types are uncertain </li></ul></ul><ul><li>Multiple models </li></ul><ul><ul><li>probabilistic models </li></ul></ul><ul><ul><ul><li>Markov Logic Networks (Lukasiewicz and Simari) </li></ul></ul></ul><ul><ul><ul><li>C-tables, Bayesian Networks (Olteanu) </li></ul></ul></ul><ul><ul><li>ASP </li></ul></ul><ul><ul><ul><li>disjunctive models </li></ul></ul></ul><ul><ul><ul><li>weak constraints </li></ul></ul></ul>Uncertainty, Vagueness and Inconsistencies
  20. 20. Thank you!

×