Result Page Analysis (Cheng Wang)

493 views
423 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
493
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Result Page Analysis (Cheng Wang)

  1. 1. Cheng Wang
  2. 2. ²  A list of results decorated with ³  Ø Side bars ³  Ø Branding banners ³  Ø Advertisement ³  Ø Merchant Information ³  Ø Search forms ³  Ø Navigation part
  3. 3. ²  Data Area Identification²  Record Segmentation²  Data Alignment
  4. 4. ²  Visual Information ³  Ø ViDE, VIPER²  Ontology ³  Ø ODE²  HTML Page based ³  Ø FiVaTech²  Regular Expression ³  Ø EXALG, DELA
  5. 5. ²  Weifeng Su, Jiying Wang, Frederick H.Lochvsky. 2009.²  1: Domain ontology construction ³  Ø query interface ³  Ø query result pages²  2. Data Extraction using the ontology ³  Ø Identify data area ³  Ø Segments record ³  Ø Data Value alignment
  6. 6. ²  Multiple Query Result Page ³  Ø PADE
  7. 7. ²  1: Match query interface element to data values. Ø title=“%orientalism%”²  2. Search for voluntary labels in table headers.²  3. Search for voluntary labels encoded together with data values. ³  Ø ISBN No: 0814756654 ³  Ø ISBN No: 0789204592²  4. Data values formats ³  Ø 18/09/2008 : 20080918 ³  Ø 03/18/98 : 19980318
  8. 8. ²  1. Value level matching ³  Ø Data value similarity²  2. Label level matching ³  Ø Label co-occurrence²  3. Label-value matching ³  Ø Check assigned label ³  Ø Assign a suitable label for columns ³  Ø Matching conflict resolution
  9. 9. ²  1. Matching is unique ð create attribute²  2. Matching is 1:1 ð alias ³  Ø Category : Subject²  3. Matching is 1:n ð n+1 attributes ³  Ø Author: {Last Name, First Name}²  4. Matching is n:m ð n:1 + 1:m
  10. 10. ²  One result page ð One data area²  Maximum Entropy Model ³  Maximum Correlation Subtree Identification
  11. 11. ²  Ø 1 result²  Ø several results (CABABABAD) ³  Ø find continuous repeated patterns ³  Ø Visual gap
  12. 12. ²  Each data value is assigned a label Ø Maximum Entropy Model Ø Match with Ontology²  ØLabel ð Column
  13. 13. ²  Wei Liu, Xiaofeng Meng and Weiyi Meng. 2009.²  ViDRE: Data Record Extractor²  ViDIE: Data Item Extractor²  New measure: revision
  14. 14. ²  1. Build a Visual Block tree²  2. Extract data records ³  Ø Noise block filtering ³  Ø Blocks clustering ³  Ø Regroup blocks²  3. Partition data records into data items and alignment
  15. 15. ²  Mandatory data items²  Optional data items²  Static data items
  16. 16. ²  Simple one-pass clustering algorithm ³  Ø Take the first block from the list, use it to form a cluster. ³  Ø For each remaining blocks, compute similarities to existing clusters.
  17. 17. ²  ViDE assumes ³  1. blocks in the same cluster all come from different data records ³  2. the cluster which has maximum number n of blocks may contain the mandatory value of data records.
  18. 18. ²  Step 1: Rearranges blocks in each cluster.²  Step 2: A cluster with n blocks is used as seed. Initialize n groups, each contains one seed block.²  Step 3: For all blocks (in all clusters), determines which group it belongs.
  19. 19. ²  WDBt: total number of web databases processed²  WDBc: number of web databases whose precision and recall are both 100%
  20. 20. Root Data Area (LCA) Record Separator Record Separator Record£ £ £ £
  21. 21. ²  Real-estate domain²  60 agents’ websites ³  Ø MRP: 95.0% ³  Ø ERP: 90.0%
  22. 22. Root Data AreaRecord Record Record Record Record Record 1 1 2 2 3 3Part A Part B Part A Part B Part A Part B £ £ £
  23. 23. ²  DIADEM 0.1 : ³  Ø Construct Real-estate result page ontology ³  Ø Ontological Record Segmentation °  (More features) ³  Ø Data labeling and data alignment²  After: ³  Ø Add visual information

×