SlideShare a Scribd company logo
1 of 40
Download to read offline
Cheng Wang
²    A list of results decorated with
      ³  Ø Side bars
      ³  Ø Branding banners

      ³  Ø Advertisement

      ³  Ø Merchant Information

      ³  Ø Search forms

      ³  Ø Navigation part
²  Data Area Identification
²  Record Segmentation

²  Data Alignment
²    Visual Information
      ³    Ø ViDE, VIPER
²    Ontology
      ³    Ø ODE
²    HTML Page based
      ³    Ø FiVaTech
²    Regular Expression
      ³    Ø EXALG, DELA
²    Weifeng Su, Jiying Wang, Frederick H.Lochvsky. 2009.

²    1: Domain ontology construction
      ³  Ø query interface
      ³  Ø query result pages

²    2. Data Extraction using the ontology
      ³  Ø Identify data area
      ³  Ø Segments record

      ³  Ø Data Value alignment
²    Multiple Query Result Page
      ³    Ø PADE
²    1: Match query interface element to data values.
       Ø title=“%orientalism%”

²    2. Search for voluntary labels in table headers.

²    3. Search for voluntary labels encoded together with data
      values.
      ³    Ø ISBN No: 0814756654
      ³    Ø ISBN No: 0789204592

²    4. Data values formats
      ³    Ø 18/09/2008 : 20080918
      ³    Ø 03/18/98   : 19980318
²    1. Value level matching
      ³    Ø Data value similarity
²    2. Label level matching
      ³    Ø Label co-occurrence
²    3. Label-value matching
      ³  Ø Check assigned label
      ³  Ø Assign a suitable label for columns

      ³  Ø Matching conflict resolution
²  1. Matching is unique ð create attribute
²  2. Matching is 1:1 ð alias

      ³    Ø Category : Subject
²    3. Matching is 1:n ð n+1 attributes
      ³    Ø Author: {Last Name, First Name}
²    4. Matching is n:m ð n:1 + 1:m
²  One result page ð One data area
²  Maximum Entropy Model

      ³    Maximum Correlation Subtree Identification
²  Ø 1 result
²  Ø several results (CABABABAD)

      ³  Ø find continuous repeated patterns
      ³  Ø Visual gap
²    Each data value is assigned a label
         Ø Maximum Entropy Model
         Ø Match with Ontology
²    ØLabel ð Column
²    Wei Liu, Xiaofeng Meng and Weiyi Meng. 2009.

²  ViDRE: Data Record Extractor
²  ViDIE: Data Item Extractor



²    New measure: revision
²  1. Build a Visual Block tree
²  2. Extract data records

      ³  Ø Noise block filtering
      ³  Ø Blocks clustering

      ³  Ø Regroup blocks

²    3. Partition data records into data items and
      alignment
²  Mandatory data items
²  Optional data items

²  Static data items
²    Simple one-pass clustering algorithm
      ³  Ø Take the first block from the list, use it to form a
          cluster.
      ³  Ø For each remaining blocks, compute similarities
          to existing clusters.
²    ViDE assumes
      ³  1. blocks in the same cluster all come from different
          data records
      ³  2. the cluster which has maximum number n of
          blocks may contain the mandatory value of data
          records.
²  Step 1: Rearranges blocks in each cluster.
²  Step 2: A cluster with n blocks is used as seed.
    Initialize n groups, each contains one seed
    block.
²  Step 3: For all blocks (in all clusters),
    determines which group it belongs.
²    WDBt: total number of web databases processed

²    WDBc: number of web databases whose precision
      and recall are both 100%
Root




                         Data Area (LCA)


    Record       Separator   Record       Separator   Record




£            £                        £                        £
²  Real-estate domain
²  60 agents’ websites

      ³  Ø MRP: 95.0%
      ³  Ø ERP: 90.0%
Root




                      Data Area


Record       Record    Record         Record   Record       Record
  1            1         2              2        3            3
Part A       Part B    Part A         Part B   Part A       Part B




         £                        £                     £
²    DIADEM 0.1 :
      ³  Ø Construct Real-estate result page ontology
      ³  Ø Ontological Record Segmentation

             °    (More features)
      ³    Ø Data labeling and data alignment
²    After:
      ³    Ø Add visual information
Result Page Analysis (Cheng Wang)

More Related Content

Similar to Result Page Analysis (Cheng Wang)

2006 Esug Omnibrowser
2006 Esug Omnibrowser2006 Esug Omnibrowser
2006 Esug Omnibrowserbergel
 
Distribute Storage System May-2014
Distribute Storage System May-2014Distribute Storage System May-2014
Distribute Storage System May-2014Công Lợi Dương
 
Network Coding for Distributed Storage Systems(Group Meeting Talk)
Network Coding for Distributed Storage Systems(Group Meeting Talk)Network Coding for Distributed Storage Systems(Group Meeting Talk)
Network Coding for Distributed Storage Systems(Group Meeting Talk)Jayant Apte, PhD
 
MongoDB Roadmap
MongoDB RoadmapMongoDB Roadmap
MongoDB RoadmapMongoDB
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
NVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityNVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityMark Kilgard
 
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2Antonios Giannopoulos
 
Troubleshooting .NET Applications on Cloud Foundry
Troubleshooting .NET Applications on Cloud FoundryTroubleshooting .NET Applications on Cloud Foundry
Troubleshooting .NET Applications on Cloud FoundryAltoros
 
SEMLA_logging_infra
SEMLA_logging_infraSEMLA_logging_infra
SEMLA_logging_infraswy351
 
Presentation hybrid store-ewsn-2013
Presentation hybrid store-ewsn-2013Presentation hybrid store-ewsn-2013
Presentation hybrid store-ewsn-2013Baobing Wang
 
Cassandra For Online Systems, What actually works
Cassandra For Online Systems, What actually worksCassandra For Online Systems, What actually works
Cassandra For Online Systems, What actually worksjdsumsion
 

Similar to Result Page Analysis (Cheng Wang) (20)

2006 Esug Omnibrowser
2006 Esug Omnibrowser2006 Esug Omnibrowser
2006 Esug Omnibrowser
 
Binary Instance Loading
Binary Instance LoadingBinary Instance Loading
Binary Instance Loading
 
Logging for Containers
Logging for ContainersLogging for Containers
Logging for Containers
 
Technical presentation
Technical presentationTechnical presentation
Technical presentation
 
Distribute Storage System May-2014
Distribute Storage System May-2014Distribute Storage System May-2014
Distribute Storage System May-2014
 
Network Coding for Distributed Storage Systems(Group Meeting Talk)
Network Coding for Distributed Storage Systems(Group Meeting Talk)Network Coding for Distributed Storage Systems(Group Meeting Talk)
Network Coding for Distributed Storage Systems(Group Meeting Talk)
 
Containers and Logging
Containers and LoggingContainers and Logging
Containers and Logging
 
MongoDB Roadmap
MongoDB RoadmapMongoDB Roadmap
MongoDB Roadmap
 
70 536
70 53670 536
70 536
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
NVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityNVIDIA's OpenGL Functionality
NVIDIA's OpenGL Functionality
 
OpenGL 4 for 2010
OpenGL 4 for 2010OpenGL 4 for 2010
OpenGL 4 for 2010
 
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
 
Ag32224229
Ag32224229Ag32224229
Ag32224229
 
Troubleshooting .NET Applications on Cloud Foundry
Troubleshooting .NET Applications on Cloud FoundryTroubleshooting .NET Applications on Cloud Foundry
Troubleshooting .NET Applications on Cloud Foundry
 
Postgres indexes
Postgres indexesPostgres indexes
Postgres indexes
 
SEMLA_logging_infra
SEMLA_logging_infraSEMLA_logging_infra
SEMLA_logging_infra
 
Presentation hybrid store-ewsn-2013
Presentation hybrid store-ewsn-2013Presentation hybrid store-ewsn-2013
Presentation hybrid store-ewsn-2013
 
Cassandra For Online Systems, What actually works
Cassandra For Online Systems, What actually worksCassandra For Online Systems, What actually works
Cassandra For Online Systems, What actually works
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 

Result Page Analysis (Cheng Wang)

  • 2.
  • 3. ²  A list of results decorated with ³  Ø Side bars ³  Ø Branding banners ³  Ø Advertisement ³  Ø Merchant Information ³  Ø Search forms ³  Ø Navigation part
  • 4. ²  Data Area Identification ²  Record Segmentation ²  Data Alignment
  • 5.
  • 6.
  • 7. ²  Visual Information ³  Ø ViDE, VIPER ²  Ontology ³  Ø ODE ²  HTML Page based ³  Ø FiVaTech ²  Regular Expression ³  Ø EXALG, DELA
  • 8. ²  Weifeng Su, Jiying Wang, Frederick H.Lochvsky. 2009. ²  1: Domain ontology construction ³  Ø query interface ³  Ø query result pages ²  2. Data Extraction using the ontology ³  Ø Identify data area ³  Ø Segments record ³  Ø Data Value alignment
  • 9.
  • 10.
  • 11. ²  Multiple Query Result Page ³  Ø PADE
  • 12.
  • 13. ²  1: Match query interface element to data values. Ø title=“%orientalism%” ²  2. Search for voluntary labels in table headers. ²  3. Search for voluntary labels encoded together with data values. ³  Ø ISBN No: 0814756654 ³  Ø ISBN No: 0789204592 ²  4. Data values formats ³  Ø 18/09/2008 : 20080918 ³  Ø 03/18/98 : 19980318
  • 14. ²  1. Value level matching ³  Ø Data value similarity ²  2. Label level matching ³  Ø Label co-occurrence ²  3. Label-value matching ³  Ø Check assigned label ³  Ø Assign a suitable label for columns ³  Ø Matching conflict resolution
  • 15.
  • 16.
  • 17. ²  1. Matching is unique ð create attribute ²  2. Matching is 1:1 ð alias ³  Ø Category : Subject ²  3. Matching is 1:n ð n+1 attributes ³  Ø Author: {Last Name, First Name} ²  4. Matching is n:m ð n:1 + 1:m
  • 18.
  • 19. ²  One result page ð One data area ²  Maximum Entropy Model ³  Maximum Correlation Subtree Identification
  • 20. ²  Ø 1 result ²  Ø several results (CABABABAD) ³  Ø find continuous repeated patterns ³  Ø Visual gap
  • 21. ²  Each data value is assigned a label Ø Maximum Entropy Model Ø Match with Ontology ²  ØLabel ð Column
  • 22.
  • 23. ²  Wei Liu, Xiaofeng Meng and Weiyi Meng. 2009. ²  ViDRE: Data Record Extractor ²  ViDIE: Data Item Extractor ²  New measure: revision
  • 24. ²  1. Build a Visual Block tree ²  2. Extract data records ³  Ø Noise block filtering ³  Ø Blocks clustering ³  Ø Regroup blocks ²  3. Partition data records into data items and alignment
  • 25.
  • 26.
  • 27. ²  Mandatory data items ²  Optional data items ²  Static data items
  • 28. ²  Simple one-pass clustering algorithm ³  Ø Take the first block from the list, use it to form a cluster. ³  Ø For each remaining blocks, compute similarities to existing clusters.
  • 29. ²  ViDE assumes ³  1. blocks in the same cluster all come from different data records ³  2. the cluster which has maximum number n of blocks may contain the mandatory value of data records.
  • 30. ²  Step 1: Rearranges blocks in each cluster. ²  Step 2: A cluster with n blocks is used as seed. Initialize n groups, each contains one seed block. ²  Step 3: For all blocks (in all clusters), determines which group it belongs.
  • 31.
  • 32.
  • 33. ²  WDBt: total number of web databases processed ²  WDBc: number of web databases whose precision and recall are both 100%
  • 34.
  • 35.
  • 36. Root Data Area (LCA) Record Separator Record Separator Record £ £ £ £
  • 37. ²  Real-estate domain ²  60 agents’ websites ³  Ø MRP: 95.0% ³  Ø ERP: 90.0%
  • 38. Root Data Area Record Record Record Record Record Record 1 1 2 2 3 3 Part A Part B Part A Part B Part A Part B £ £ £
  • 39. ²  DIADEM 0.1 : ³  Ø Construct Real-estate result page ontology ³  Ø Ontological Record Segmentation °  (More features) ³  Ø Data labeling and data alignment ²  After: ³  Ø Add visual information