Your SlideShare is downloading. ×
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Web Data Extraction Como2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Web Data Extraction Como2010

1,312

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,312
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • The XQuery statement (FLWOR expression) outputs a single XML document per result page Produces a flat (i.e. no hierarchy) XML document per some prescribed schema Key intuition – notion of atomic result, regardless of presentation (list, table, etc.) Atomic results analogous to RDBMS query returns (attributes form tuples); field inputs are preserved as element attribute values
  • Transcript

    • 1. DIADEM A Short Overview Georg Gottlob
    • 2.  
    • 3. Web data extraction WEB HTML pages layout Corporate edp apps structured data, Databases, XML WRAPPER Goal: Make web contents accessible to electronic data processing Wrappers: HTML  select  extract  annotate  XML
    • 4.  
    • 5. MSO Monadic Datalog Elog Lixto Visual Wrapper   =  Suite Logic Database theory DB programming Application design
    • 6. Lixto Visual Developer (VD) Navigation Steps Mozilla Web Browser Extraction Configuration
    • 7.  
    • 8. . .
    • 9. Need for Automatic Extraction Technology Example: Real Estate UK 17,000 sites Many not covered by aggregators We do have a list of all homepages (Yellow Pgs. UK) Manual or semi-automatic wrapping too expensive - wrapper construction - testing - keeping track of changes No tool or method can do it fully automatically. Other domains: Hospitals,restaurants, schools, travel agents, airlines, hospitals, pharmaceutical companies and retail companies such as supermarket chains…..
    • 10. Need for Automatic Extraction Technology (2) All search engine providers need it! Many work on it. Keywords:  Vertical search,  object search,  semantic search. Raghu Ramakrishnan , Yahoo!, March 2009: “ no one really has done this successfully at scale yet ” Alon Halevy , Google, Feb. 2009: “ Current technologies are not good enough yet to provide what search engines really need. […] any successful approach would probably need a combination of knowledge and learning.”
    • 11. The Blackbox we want to construct BLACKBOX Application domain with thousands of websites URL Application relevant Structured data (XML or RDF) To achieve this, we plan to combine a host of annotators with a new knowledge-based approach.
    • 12. Real estate Restauran t s Relationship to SeCo & Webdam Q: Find apartments in Milan whose prices are  average in quarters were restaurant quality > average. Results Web service A Web service
    • 13. How to achieve it?
      • Rationale: Combine existing and new “low level” annotators with “high level” AI and reasoning.
      • Low level annotators:
      • - Bottom-up page analysis.
      • - ML-based entity recognizers
      • - NLP & ontological text annotation
      • - Web page classification & analysis
      • - Basic link analysis
    • 14. High level reasoning: - Goal oriented - Conceptual domain objects. - Conceptual interaction elements - High-level object ontology - Domain knowledge
    • 15.  
    • 16. Bottom-up (low-level) annotation Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Geo-Price-Searchbox ISA [(02873,227) (03900,417)]
    • 17. Top-down reasoning Property Search Facility Property List Single Property Description Specially highlighted property part-of m 1
    • 18. Bottom-up processing Top-down reasoning Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Property Search Facility Property List Single Property Description Geo-Price-Searchbox ISA [(02873,227) (03900,417)] Specially highlighted property Phenomenology part-of m 1
    • 19. table(T) & occurs_in(T,areaselection) & occurs_in(T,priceselection)  goodtable(T). goodtable(T) & child(Parent,T)  containsgoodtable(Parent). goodtable(T) &  containsgoodtable(T)  propertysearchmask(T). If a table contains an area selection input field and a price selection field, both of which are not simultaneously contained in a smaller table, then this table is the property search mask Datalog for Web-Object Reasoning
    • 20. Crucial steps
      • WP1 data model (KRR model)
      • WP2 low & intermediate level annotation
      • WP3 High level ontology and Rules (top down)
      • + mapping HL to Int. Level: Phenomenology
      • WP4 Access, interaction, & navigation
      • WP5 Compilation; Learning Xpath expressions
      • WP6 Highly parallel execution on clouds
      • WP7 General methodology
    • 21.  
    • 22. Crucial steps
      • WP1 data model (KRR model)
      • WP2 low & intermediate level annotation
      • WP3 High level ontology and Rules (top down)
      • + mapping HL to Int. Level: Phenomenology
      • WP4 Access, interaction, & navigation
      • WP5 Compilation; Learning Xpath expressions
      • WP6 Highly parallel execution on clouds
      • WP7 General methodology
    • 23. The Data Model
      • Datalog is good but does not suffice.
      • On top of it:
      • Need for object creation
      • Need for ontological reasoning
      • Need for probabilistic reasoning
      • Need for default reasoning
    • 24. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390
    • 25. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390 T1 T2
    • 26. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). PRODUCT Toshiba Protégé cx Dell 25416 Dell 23233 Acer 78987 PRICE 480 360 470 390 PRICE 480 360 470 390 T1 T2
    • 27. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs)
    • 28. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)     X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs) Datalog  : require guardedness of rule bodies. Decidable, linear-time data complexity.
    • 29. Datalog  Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3)  containedin (T1,T3)
    • 30. Datalog  Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3)  containedin (T1,T3) unguarded!
    • 31. DL-LITE DL-LITE Datalog[  ,  ;Lin] Professor   TeachesTo Professsor(x)   y TeachesTo(x,y)  TeachesTo -  Student TeachesTo(x,y)  Student(y) HasTutor -  TeachesTo HasTutor(x,y) ->TeachesTo(y,x) funct(HasTutor) HasTutor(x,y) & HasTutor(x,y’) (always innocuous!) & Neq(y,y’)   Professor   Student Professor(x) & Student(x)   DL-Lite core DL-Lite R DL-Lite F
    • 32. Crucial steps
      • WP1 data model (KRR model)
      • WP2 low & intermediate level annotation
      • WP3 High level ontology and Rules (top down)
      • + mapping HL to Int. Level: Phenomenology
      • WP4 Access, interaction, & navigation
      • WP5 Compilation; Learning Xpath expressions
      • WP6 Highly parallel execution on clouds
      • WP7 General methodology
    • 33. We will use various existing tools and techniques (rather than re-invent the wheel) Low & Intermediate Level Annotation
      • Named entity recognizers
      • Machine learning
      • Computational linguistics
      • Page layout analysis
      • PDF- Extraction
    • 34.  
    • 35.  
    • 36.  
    • 37. Extraction from PDF Tamir Hassan
    • 38. Crucial steps
      • WP1 data model (KRR model)
      • WP2 low & intermediate level annotation
      • WP3 High level ontology and Rules (top down)
      • + mapping HL to Int. Level: Phenomenology
      • WP4 Access, interaction, & navigation
      • WP5 Compilation; Learning Xpath expressions
      • WP6 Highly parallel execution on clouds
      • WP7 General methodology
    • 39. Navigation & Interaction
    • 40. Crucial steps
      • WP1 data model (KRR model)
      • WP2 low & intermediate level annotation
      • WP3 High level ontology and Rules (top down)
      • + mapping HL to Int. Level: Phenomenology
      • WP4 Access, interaction, & navigation
      • WP5 Compilation; Learning Xpath expressions
      • WP6 Highly parallel execution on clouds
      • WP7 General methodology
    • 41.  
    • 42.  
    • 43. OXPath
      • Extension of XPath
      • Facilitates querying web form and retrieving returned data
      • Simulates a user filling out web forms
      • Highly parallelizable (geared towds cloud computing)
      • Navigation and collecting data across multiple pages
    • 44. Result Extraction
      • .... /next-field::*/ {“Renting”} / ... / {...} /.../ {“Submit”}
      /<XQ> Atomic results regardless of presentation (list, table, etc.)
    • 45. Result Extraction <XQ> : For each atomic result A Let price = A/.../.../text() description = A/.../.../../text() ........ Return <rental area=Oxford> <price> 1,200 </price> <bedrooms> 3 </bedrooms> <bathrooms> 1 </bathrooms> <type> Flat </type> <location> George Street,OX1 </location> <description> ... </description> <otherInfo> Furnished; Long let - more than six months </otherInfo> ... <rental> price description Type OtherInfo Bathrooms location type = A/.../.../../text()
    • 46. Crucial steps
      • WP1 data model (KRR model)
      • WP2 low & intermediate level annotation
      • WP3 High level ontology and Rules (top down)
      • + mapping HL to Int. Level: Phenomenology
      • WP4 Access, interaction, & navigation
      • WP5 Compilation; Learning Xpath expressions
      • WP6 Highly parallel execution on clouds
      • WP7 General methodology

    ×