Our Visual Approach • Mimic human intui.on • To make use of the common sources of evidence on displayed pages that humans use, including – Structural regularity – Visual and content similarity between data records
Previous Approaches Need to Iden.fy Data Rich Sec.on PiWalls: How to iden.fy the Data Rich Sec.on DRS does not contain all the records DRS contains noise as well as records
Our Approach • We ﬁnd records, not the Data Rich Sec.on • Extract data records individually on displayed query result pages, while excluding noise items • Records in a grid or a column • Use clustering algorithms and a set of similarity measures to: Iden.fy records Exclude noise
Selec.ng Other Candidate Containers Filter the set of all container blocks on the page (blue blocks) and Discard blocks that don’t match the width of any candidate container block (orange blocks). Cluster the remaining blocks by width. Why width? Web pages designed for ver.cal, not horizontal, scrolling.
Selec.ng Record Containers Block content similarly measure Block A – Candidate record block (orange) Block B – Container block (block) with the same width as A The cluster with the maximum number of similar blocks is the winner!
Conclusions: Main Contribu.ons • Visual approach to directly access a rendering engine to get posi.onal and visual features rather than codes or tag trees • No need to iden.fy data rich sec.on • Use observa.ons on visual and content similarity, and structural regularity to group data items into records
Future Work • Use a domain schema from schema.org, or a domain ontology to annotate data records • Use a domain schema or ontology to annotate query forms too • Solve Label incompleteness and inconsistency issues • Similarity threshold – Set by machine learning.