Querying Capability Modeling

218
-1

Published on

Presentation for CS 690 course at Purdue of Querying Capability Modeling and Construction
of Deep Web Sources.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
218
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • blabla
  • Functional Attribute – only affects the display of results Range Attribute – specifies a range, for example price or year Categorical Attribute – fixed set of distinct value, usually selections lists Value-Infinite Attribute – Infinite number of possible values, typically textboxes Value-Infinite can be categorized into more types, like phone-number, zip code, etc. Not addressed in this paper
  • Enough to fill in one field
  • A query is deemed accepted even if it returns no results Goal is find the types of queries that are accepted
  • First extract HTML form and output XML Feed into Query Capability Analyzer Submit Queries and get result Output is a set of atomic queries
  • Recall is lower than precision Method is conservative, if no strong evidence the system treats as rejected
  • Querying Capability Modeling

    1. 1. CS-690 Fall-2009 Querying Capability Modelling and Construction of Deep Web Sources Paper by: Liangcai Shu, Weiyi Meng, Hai Fe, and Clement Yu Presentation by: Fabian Alenius
    2. 2. Problem definition <ul><li>Deep Web is the information hidden behind HTML forms
    3. 3. Information in Deep Web is 400-550 times larger than the surface web
    4. 4. Crawlers need to extract
    5. 5. information </li></ul>
    6. 6. Problem definition <ul><li>Task: Build a model of query interfaces of Deep Web sources, i.e. what types of form submissions are accepted
    7. 7. A form submission consists of a set of values, one for every field of the form </li></ul>
    8. 8. Querying Capability <ul><li>Querying capability specifies what queries are valid
    9. 9. An attribute is a field in a form
    10. 10. A query is a set of attributes
    11. 11. A set query instances are a set of attribute and value pairs, e.g. {<Author, Tolkien>, <Title, Bilbo>} </li></ul>
    12. 12. Four types of attributes Functional Attribute Range Attribute Categorical Attribute Value-Infinite Attribute
    13. 13. Conditional Dependency <ul><li>Given Attribute A and Condition C
    14. 14. Conditional dependency – value of C insignificant if A is null
    15. 15. A -> C </li></ul>
    16. 16. Attribute Unions <ul><li>Attribute A , Value type T, and Constraints C together form union AU(A, T, C 1 , ..., C n )
    17. 17. T is not attribute type, but value type (e.g. phone, zip code, etc.)
    18. 18. Valid queries – accepted; Invalid queries – rejected </li></ul>
    19. 19. Atomic Queries <ul><li>Atomic Query – a minimum query consisting of Attribute Unions
    20. 20. Formally: Given valid query S = {AU 1 ,..., Au n }, atomic iff any query T ⊂ S is invalid
    21. 21. Goal is to find Atomic Queries
    22. 22. Assign value only if attribute is part of query </li></ul>
    23. 23. Atomic Queries Example <ul><li>{<Keywords, Crime>}
    24. 24. {<Author, Tolkien>}
    25. 25. {<Title, Bilbo>}
    26. 26. {<ISBN, 9780140285000 >}
    27. 27. {<Publisher, Penguin>} </li></ul>
    28. 28. Querying Capability <ul><li>Query Capability described by BNF grammar
    29. 29. <Query> ::= <AtomicQuery> [<OptionalQuery>]
    30. 30.
    31. 31. <CheckboxGroup> ::= <Checkbox> | ADJ(<Checkbox>, <CheckboxGroup>)
    32. 32. Used to determine if Query is valid </li></ul>
    33. 33. Interpretation tree example
    34. 34. System overview
    35. 35. Searching algorithm <ul><li>Brute force to inefficient
    36. 36. Exploit proper subset property
    37. 37. Start with empty set and add attributes, stop at acceptance </li></ul>
    38. 38. Result Page Classification <ul><li>Classify based on features
    39. 39. Acceptance indicators </li><ul><li>Sequence numbers, 1,2,3
    40. 40. Large result page size
    41. 41. Patterns, showing 1 – 10 of 999 results </li></ul><li>Rejection indicators </li><ul><li>Error messages, no valid search words
    42. 42. Similarity to original search interface
    43. 43. Small result page size </li></ul></ul>
    44. 44. Experiments <ul><li>Used two datasets from three domains, books, music and jobs
    45. 45. TEL-I dataset
    46. 46. Total of 64 sources </li></ul>
    47. 47. Result
    48. 48. Future work <ul><li>Add value type identification
    49. 49. Bigger dataset </li></ul>
    50. 50. Q&A <ul><li>Questions? </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×