CS-690 Fall-2009 Querying Capability Modelling and Construction of Deep Web Sources Paper by: Liangcai Shu, Weiyi Meng, Hai Fe, and Clement Yu Presentation by: Fabian Alenius
Problem definition Deep Web is the information hidden behind HTML forms
Information in Deep Web is 400-550 times larger than the surface web
Crawlers need to extract
information
Problem definition Task:  Build a model of query interfaces of Deep Web sources, i.e. what types of form submissions are accepted
A form submission consists of a set of values, one for every field of the form
Querying Capability Querying capability specifies what queries are valid
An attribute is a field in a form
A query is a set of attributes
A set query instances are a set of attribute and value pairs, e.g. {<Author, Tolkien>, <Title, Bilbo>}
Four types of attributes Functional Attribute Range Attribute Categorical Attribute Value-Infinite Attribute
Conditional Dependency Given Attribute  A  and Condition  C
Conditional dependency – value of  C  insignificant if  A  is null
A -> C
Attribute Unions Attribute A , Value type T, and Constraints C together form union AU(A, T, C 1 , ..., C n )
T is not attribute type, but value type (e.g. phone, zip code, etc.)
Valid queries  – accepted;  Invalid queries  – rejected

Querying Capability Modeling

Editor's Notes

  • #2 blabla
  • #6 Functional Attribute – only affects the display of results Range Attribute – specifies a range, for example price or year Categorical Attribute – fixed set of distinct value, usually selections lists Value-Infinite Attribute – Infinite number of possible values, typically textboxes Value-Infinite can be categorized into more types, like phone-number, zip code, etc. Not addressed in this paper
  • #10 Enough to fill in one field
  • #11 A query is deemed accepted even if it returns no results Goal is find the types of queries that are accepted
  • #13 First extract HTML form and output XML Feed into Query Capability Analyzer Submit Queries and get result Output is a set of atomic queries
  • #17 Recall is lower than precision Method is conservative, if no strong evidence the system treats as rejected