SlideShare a Scribd company logo
1 of 42
Download to read offline
1




 UNDERSTANDING
    DEEP WEB
SEARCH INTERFACES


  Prospectus-Presentation
      April 03 2009
        Ritu Khare
This presentation
                                                               uses this space

    Presentation Order                                         for        writing
                                                               additional facts.

2


       Problem Statement
         The Deep Web & Challenges                    Problem:
                                                Understanding Semantics
         Search Interface Understanding          of Search Interfaces
         Challenges and Significance
       Literature Review Results
         Settings                            How do existing approaches
         Reductionist Analysis                   solve this problem?

         Holistic Analysis
       Research Questions and Design Ideas              What are the
         Techniques Vs Heterogeneity                 research gaps? How
                                                          to fill them?
         Semantics and Artificial Designer
3   PROBLEM STATEMENT
    The Deep Web
    Challenges in Accessing Deep Web
    Doors of Opportunity
    The SIU Process
    About the Stages of the Process
    Why SIU is Challenging?
    Why SIU is Significant?
HAS MANY OTHER NAMES!!
                                                       Hidden Web, Dark Web,

        The Deep Web                                   Invisible  Web,      Subject-
                                                       specific databases, Data-
                                                       intensive Web sites.
4


       What is DEEP WEB?
                                                                 WEB
         The   portion of Web        SURFACE                 (as seen by

         resources that is not          WEB                      search
                                                                engines)

         returned     by     search
                                                ALMOST
         engines            through              VISIBLE
                                                  WEB             DEEP
         traditional crawling and                                 WEB

         indexing.
       Where do the contents
        LIE?
         Online   Databases
HAS MANY OTHER NAMES!!
                                     Hidden Web, Dark Web,

        The Deep Web                 Invisible  Web,      Subject-
                                     specific databases, Data-
                                     intensive Web sites.
5



       How are the contents
        ACCESSED?
         By filling up HTML forms
         on search interfaces.


       How      are       they
        PRESENTED to users?
         Dynamic  Pages /Result
         Pages /Response Pages
QUICK FACT!

    Challenges                                                                         Deep Web includes
                                                                                       307,000 sites
                                                                                       450,000 databases
    in Accessing Deep Web Contents                                                     1,258,000 interfaces
6


         500 times more than                                                   Increase of 3-7
          that of the rest of                                                  times from 2000-
               the Web                                                         2004 (He et al., 2007a).
            (BrightPlanet.com, 2001).




                                         visits several        manually reconciles
    The deep Web                        interfaces before     information obtained
    remains invisible                   finding the right     from diff. sources.
    on the Web                          information.



                  Alternative approaches:                   Not Scalable
                  Invisible Web directories
                  and Search engine browse
                  directories cover only
                  37% of the deep Web
                  (He et al., 2007a).
INTERESTING FACT!

    Opportunities                                         There exist at least 10
                                                          million high quality
                                                          HTML forms on the
    in Accessing Deep Web Contents                        deep Web
7


       HTML Forms on search interfaces provide a useful way
        of discovering the underlying database structure.
         The labels attached to fields are very expressive and
          meaningful.
         Instructions for users to enter data may provide information
          on data constraints (such as range of data, domain of data),
          and integrity constraints (mandatory /optional attributes).

       In the last decade, several prominent researchers have
        focused on the PROBLEM OF UNDERSTANDING
        SEARCH INTERFACES.
Search Interface Understanding
(SIU) Process
8



                        Search                            B. Parsing
           Online      Interface
            DB          (Input)
                                      A. Representation


                      Manually                          C. Segmentation
                    Tagged Search   E. Evaluation
                      Interface

                                                          D. Segment Processing



                                                    System-Tagged
                                                    Search Interface         Extracted
                                                       (Output)                DB


    The SIU process is challenging because search interfaces are designed
    autonomously by different designers and thus, do not have a standard
    structure (Halevy, 2005).
This stage builds
A. Representation and Modeling              up the foundation
                                            for the process
9



   This stage formalizes the information
    to be extracted from a search
    interface
     interface components: Any text or form element
     semantic label: Meaning of a component from a
      user’s standpoint.
     segment: a composite component formed with a
      group of related components
     segment label: semantic label of the segment
This stage builds
A. Representation and Modeling                               up the foundation
                                                             for the process
10


    Zhang et al. (2004)
     represent an interface
     as a list of query
     conditions
      Segment    Label = Query
       Condition                       Attribute-name Operator    Value
      Segment consists of following
       semantic
       labels:
         An attribute name
         Operator
         Value
This stage is the
                                           first task physically
     B. Parsing                            performed on the
                                           interface.
11



        The interface is parsed into a
         workable memory structure. It can be
         done in two modes:
         by  reading the HTML source code;
         by rendering the page on a Web
          browser either manually or automatically
          using a visual layout engine.
This stage is the
                                             first task physically
     B. Parsing                              performed on the
                                             interface.
12


      He et al. (2007b) parse the interface into an
       interface expression (IEXP) with constructs t
       (corresponding to any text), e (corresponding
       to any form element), and | (corresponding to
       a row limiter).
      IEXP for figure is:

          t|te|teee
Techniques used :
                                              Rules

     C. Segmentation                          Heuristics
                                              Machine Learning.

13



        A segment has a semantic existence
         but no physically defined boundaries
         making this stage a challenging one.
          Grouping  of semantically related
           components. (A sub-problem is to associate a
           surrounding text with a form element)
          Assignment of semantic labels to components
Techniques used :
                                                                   Rules

     C. Segmentation                                               Heuristics
                                                                   Machine Learning.

14


        He et al. (2007b) use a heuristic-based method LEX to group
         elements and text labels together. One heuristic used by LEX is
         that the text and form element that lie on the same line are
         likely to belong to one segment. In Figure , the 3 components
         “Gene Name”, radio button with options ‘Exact Match’ and
         ‘Ignore Case’, and the textbox belong to one segment.



                                                                  Logical Attribute



                    Attribute-label   Constraint Element   Domain Element
Techniques used :
                                                                Rules

     D. Segment Processing                                      Heuristics
                                                                Machine Learning.

15


        In this stage,
          Each segment is further tagged with additional meta-
           information regarding itself and its components.
          Post-processing of extracted information: normalization,
           stemming, removal of stop words.
        He et al. (2007b)’s LEX extracts meta-information about
         each extracted segment using Naïve Bayes classification
         technique.
            The extracted information for the segment includes domain
             type (finite, infinite); Unit (miles, sec); value type (numeric,
             character, etc.); layout order position (in IEXP)
An approach is
                                                  usually tested on

     E. Evaluation                                 a set of interfaces
                                                   belonging to a
                                                  particular domain.
16


        How accurate extracted information is?
        The system-generated segmented and tagged
         interface is compared with the manually segmented
         and tagged interface
        The results are evaluated based on standard
         metrics (precision, recall, accuracy, etc).
SIGNIFICANCE:
                                                                                      SIU is a pre-requisite

     Why SIU is Significant?                                                          for several advanced
                                                                                      deep Web applications.
17


        Researchers have proposed solutions to make the deep Web
         contents more useful to the users. These solutions can be divided
         into following categories based on goals:
        To Increase Content Visibility on Search Engines
            Building Dynamic Page Repository: Raghavan and Garcia-Molina (2001)
            Building Database Content Repository: Madhavan et al. (2008)
        To Increase Domain-specific Usability
            Meta-search Engines: Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005), Pei et al.
             (2006), He and Chang (2003), and Wang et al. (2004)

        To Attain Knowledge Organization
            Derivation of Ontologies: Benslimane et al. (2007)
        These solutions can only be materialized by leveraging the
         opportunities provided by search interface.
18   LITERATURE REVIEW RESULTS
     Review Settings
     Reductionist Analysis
     Holistic Analysis
     Progress Made
ALIAS:
                                                                    For quick reference

     Literature Review Process                                      each work is
                                                                    assigned an alias.

19


        Reviewed research works that propose approaches for
         performing the SIU process in the context of the deep Web.
             S.No.   Reference                           Alias
             1       Raghavan and Garcia Molina (2001)   LITE
             2       Kalijuvee et al. (2001)             CombMatch
             3       Wu et al. (2004)                    FieldTree
             4       Zhang et al. (2004)                 HSP
             5       Shestakov et al.(2005)              DEQUE
             6       Pei et al. (2006)                   AttrList
             7       He et al. (2007)                    LEX
             8       Benslimane et al. (2007)            FormModel
             9       Nguyen et al. (2008)                LabelEx
DIMENSION:
                                                          facilitates comparison

     Literature Review Process                            among different works
                                                          by placing them under
                                                           the same umbrella
20


        Review was done in 2 phases:
          Reductionist Analysis: The                works         were
           decomposed into small pieces.
            Each work was visualized as a 2-dimensional grid where
             the horizontal sections refer to stages of the SIU process.
             For each stage the works were analyzed in vertical
             degrees of analysis known as Stage-specific dimensions.
          Holistic Analysis: Each work was studied in its
           entirety within a big picture context.
            Composite   dimensions were created out of the stage-
             specific dimensions.
A: Representation
  Reductionist Analysis : Representation                                            B: Parsing
                                                                                    C: Segmentation
                                                                                    D: Segment Processing
                                                                                    E: Evaluation
Work       Segment: Segment Contents        Text Label: Form Meta-information
 21                                         Element
HSP        Conditional Pattern: Attribute- 1:M
           name, Operator*, and Value+
DEQUE      Field segment : f,      Name(f), 1:1              JavaScript Functions,
           Label(f)                                          Visible and invisible values,
           f = field                                         Subinfo(F) = {action, method, enctype)
                                                             Iset(F) =initial field set, that can be submittied
                                                             without completing a form
                                                             domain(f), type(f)
                                                             F=form
AttrList   Attribute:         Attribute-name, 1:1              Domain information for each attribute (set of
           description and form element                        values and data types)
LEX        Logical Attribute Ai: Attr-label L, 1:M              site information and form constraint
           List of domain elements {Ej,…Ek}, (1:1 in case of Ai=(P, U, Re, Ca, DT, DF, VT)
           and element labels .                “element label” Ai=ith attribute,
                                               :form element) P = layout order position, U = Unit, Re=
                                                               relationship type, Ca = domain element
                                                               constraint, DT = domain type, DF = default
                                                               value, VT = value type
                                                               Ei=(N, Fe, V, DV)
                                                                N = internal name, Fe = format, V = set of
                                                               values, DV = default value.
A: Representation
                                                                                    B: Parsing

 Reductionist Analysis : Parsing                                                    C: Segmentation
                                                                                    D: Segment Processing
                                                                                    E: Evaluation

 22
Work        Input Mode         Basic Step: Description       Cleaning Up                 Resulting Structure
LITE        HTML source code   Pruning                       Discard        Images,      Pruned Page
            And                Isolate elements that         ignore            styling
            Visual Interface   directly influence layout     information such as
                               of form elements and          font size, font style,
                               labels.                       and style sheets.
CombMatch   HTML source code   Chunk Partitioning, and       Stop                        Chunk List and Table
                               finding meta-information      Phrases(“optional”,         Index List. Each chunk is
                               about each chunk: Find        “required”, “*”, Text       represented as an 8-
                               bounding HTML tags, and       formatting HTML tags.       tuple describing meta-
                               text strings delimited by                                 information.
                               table cell tags, etc.
DEQUE       HTML text          Preparing             Form    Ignore    font    size, Pruned Tree
            And                Database: A DOM tree is       typefaces, and styling
            Visual interface   created for each FORM         information.
                               element
LEX         HTML source code   Interface        Expression                               String
                               Generation:
                               t=text, e = element,
                               I=row delimiter (<BR>,
                               <P>, or </TR>
A: Representation
                                                                              B: Parsing

      Reductionist Analysis: Segmentation                                     C: Segmentation
                                                                              D: Segment Processing
                                                                              E: Evaluation
 23
Work        Problem Description         Segmentation Criteria            Technique

CombMatch   Assigning text Label to an Combination of string similarity Heuristics (String properties,
            input element              and spatial similarity algorithms Proximity and Layout)

HSP         Finding the 3-tuple <attribute Grammar (set of rules) based on Rules (Best Effort Parser to
            name, operators, values>       productions and preferences    build a parse tree)

LEX         Assigning text labels to Ending colon, textual similarity Heuristics
            attributes, and assigning with element name, vertical (String Properties, Layout and
            element labels to domain alignment, distance, preference Proximity)
            elements                  to current row

LabelEx     Assigning text Label to a Classifiers (Naïve Bayes, and Supervised Machine Learning
            form element              Decision      Tree).     Features
                                      considered      include    spatial
                                      features, element type, font type,
                                      internal, similarity, alignment,
                                      label placement, distance.
A: Representation
                                                                                   B: Parsing
                                                                                   C: Segmentation
 Reductionist Analysis : Segment Processing                                        D: Segment Processing
                                                                                   E: Evaluation
24

     Work        Technique for extracting Post-processing
                 meta-information

     HSP                                        The Merger module reports conflicting (that occur in
                                                two query conditions) and missing tokens (they do not
                                                occur in any query condition).

     LEX         Naïve Bayesian Classification Meaningless stopwords (the, with, any, etc.)
                 (Supervised        Machine
                 Learning)


     FormModel   Learning by Examples
                 (Machine Learning)


     LabelEx                                    Heuristics for reconciliation of multiple assigned labels
                                                to an element; and to handle dangling elements.
A: Representation
                                                                                                          B: Parsing
                                                                                                          C: Segmentation
                                                                                                          D: Segment Processing
                                                                                                          E: Evaluation

          25                        Work    Test Domain          Yahoo Subject Category Comparison with…            Metrics
                                    LITE    Semiconductor         Science,  Entertainment, CombMatch (in terms of Accuracy
Reductionist Analysis: Evaluation


                                            Industry,     Movies, Computers & Internet     methodology)
                                            Database
                                            Technology.
                                    HSP     Airfare, automobile, Business & Economy, 4 datasets from different Precision, Recall
                                            book, job, real estate, Recreation & Sports, sources  collected by
                                            car rental, hotel, Entertainment             authors.
                                            movies, music records.
                                    LabelEx Airfare, Automobiles, Business & Economy, Barbosa et al. (2007)’s Recall,
                                            Books, Movies.        Recreation & Sports, and HSP ( in terms of Precision,
                                                                  Entertainment        datasets)                    F-Measure
                                                                                       Using Classifier Ensemble
                                                                                       with or without Mapping
                                                                                       Reconciliation (MR).
                                                                                       Generic      Classifier   Vs
                                                                                       Domain-specific Classifier
                                                                                       Generic Classifier with MR
                                                                                       Vs           Domain-specific
                                                                                       Classifier with MR
                                                                                       HSP and LEX ( in terms of
                                                                                       methodology)
Work        Type of semantics            Techniques           Human               Target Application
                                                                                  Involvement
                    LITE        Partial form capabilities    Heuristics           None                Deep Web Crawler
                                (Label associated with                                                (search engine
    26
                                form element)                                                         visibility)
                    HSP         Query capability             Rules                Manual              Meta-searchers
                                (attribute name, operator                         Specification of    (domain-specific
                                and values)                                       Grammar Rules       usability)
                    LEX         Components belonging to      Heuristics           None                Meta-searchers
                                same logical attribute                                                (domain-specific
                                (labels and form elements)                                            usability)
                                Meta-information          Supervised Machine      Training data for
                                                          Learning                classifier
Holistic Analysis




                    FormModel Structural Units (groups of NOT REPORTED            Unknown             Ontology Derivation
                              fields belonging to same                                                (Knowledge
                              entity)                                                                 Organization)
                              Partial form capabilities Heuristics                None
                              (Label associated with
                              form element)
                              Meta-information            Supervised Machine      Training data for
                                                          Learning                learning          by
                                                                                  examples.
                    LabelEx     Partial form capabilities    Supervised Machine   Classifier Training Deep Web in general
                                (Label associated with       Learning             data was manually (search engine
                                form element)                                     tagged.              visibility domain-
                                                                                                       specific usability)
A: Representation
                                                                            B: Parsing

     Progress Made                                                          C: Segmentation
                                                                            D: Segment Processing
                                                                            E: Evaluation
27


        SEMANTICS modeled and extracted. (Stages A and B)
            from merely stating what we see, to stating what is meant by what we see
            from merely associating labels to form elements, to discovering query
             capabilities
            from no meta-information to a lot of meta-information which might be useful for
             target application.
        TECHNIQUES employed (Stages C and D)
            A mild transitioning from naïve techniques (rules-based and heuristic-based) to
             sophisticated techniques (supervised machine learning).
        DOMAINS explored (Stage E)
            Only Commercial Domains: books, used cars, movies, etc.
            Still Unexplored Non-Commercial Domains: yahoo.com subject categories such as
             regional, society and culture, education, arts and humanities, science, reference,
             and others
28   RESEARCH QUESTIONS
     Techniques Vs Design Heterogeneity
     Techniques Vs Domain Heterogeneity

     Simulating a Human Designer
Derived from

         Research Questions
                                                                             Holistic and
                                                                             Reductionist
                                                                             Analysis
29


        R.Q.#1 Technique Vs Design Heterogeneity
            What is the correlation between the technique employed and the ability to
             handle heterogeneity in design of interfaces?


        R.Q.#2 Technique Vs Domains
            How can we design approaches that work well for arbitrary domains, and
             thus prevent the need to design domain-specific approaches?


        R.Q.#3 Simulating a Human Designer
          How can we make a machine understand an interface in the same way as
             a human designer does?
Technique is a
Research Question #1                                                        dimension of
What is the correlation between the technique employed and the ability to   Stages Segmentation
                                                                            & Segment
handle heterogeneity in design of interfaces?                               Processing
30


      Elaborating the Question
        Techniques: Rules, Heuristics, and Machine Learning.
        Design: Arrangement of interface components.
        Handling Heterogeneity in design: Being able to perform
         the following tasks for any kind of design.
           Segmentation

             Semantic Tagging
             Grouping (Label Assignment is a part of this)
           Segment Processing
Technique is a
Research Question #1                                                        dimension of
What is the correlation between the technique employed and the ability to   Stages Segmentation
                                                                            & Segment
handle heterogeneity in design of interfaces?                               Processing
31


     Heterogeneity: Automobile Domain           Heterogeneity: Movie Domain




                                                  Multiple Attribute-name


     Operator   Attribute-name
                                   Operand
This question has
Research Question #1                                                        been only partially
What is the correlation between the technique employed and the ability to   explored.

handle heterogeneity in design of interfaces?
32



     Existing Efforts to Answer

             A 2002 study (Kushmerick, 2002) suggests the superiority of
              machine learning techniques over rule-based and
              heuristic-based techniques for handling design
              heterogeneity in general.
             A 2008 study (Nguyen et al., 2008) compared the label
              assignment accuracy (a part of grouping accuracy) of
              the three approaches: rule-based (HSP), heuristic-
              based (LEX), and machine learning based (LabelEx).
              Machine learning technique outperformed the other
              two.
Tasks to test:
                                                                                                 Segmentation
 Investigating R.Q.#1                                                                            •Grouping
 Technique Vs Design Heterogeneity                                                               •Semantic Tagging
                                                                                                 Segment Processing
33

         However, there is NO comparative study in terms of overall
         grouping, semantic tagging, and segment processing.

 Experiment Description               Evaluation         Result       Compared With            Improvement
                                      Metrics
 A machine learning                   Grouping           86%          Heuristic-based          10%
 technique based on Hidden            Accuracy (label                 state-of-the-art
 Markov Models (HMMs) was             assignment                      approach LEX
 designed and tested on a             included)
 dataset belonging to biology         Semantic Tagging   90%          A Heuristic-based        17%
 domain.                              Accuracy                        algorithm was
                                                                      designed

         Compare Segmentation Performance:                   Compare Segment Processing Performances:
               Machine Learning Vs. Rule-Based                   Rules Vs. heuristics Vs. machine learning
               Various machine learning techniques
                Classification Vs. HMM Vs…
Human
                                                                              Intervention is a
 Investigating R.Q.#1                                                         dimension of
 Technique Vs Design Heterogeneity                                            Holistic analysis

34


     There is NO comparative study to measure human intervention in these techniques.

         Experiment Description    Evaluation    Result                Compared With
                                   Metrics
         Monitoring Human                        Rule-based: Manual Rule Based Vs
         Intervention                            Crafting           Heuristics Vs
         (IN PROGRESS)                           Heuristics: Manual Machine Learning
                                                 Observations
                                                 Machine Learning:
                                                 Manual Tagging
         The HMM was trained       P(O|λ)        Not promising
         using unsupervised
         training algorithm Baum
         Welch

             Designing Unsupervised Techniques
Research Question #2                                                     Domain tested
                                                                         is a dimension of
How can we design approaches that work well for arbitrary domains,
                                                                         Evaluation stage.
and thus prevent the need to design domain-specific approaches?
35



     Elaborating the Question

        Domain Heterogeneity: Deep Web is heterogeneous in
         terms of domains, i.e. has databases belonging to all the
         14 subject categories of Yahoo (Arts & Humanities, Business and
         Economy, Computers and Internet, Education, Entertainment, Government,
         Health, etc. )
        How to design generic approaches that work for many
         domains?
            How do interface designs differ across domains?
            Which technique should be employed?
Research Question #2                                                               Deep Web has a
                                                                                       balanced domain
    How can we design approaches that work well for arbitrary domains,
                                                                                       distribution
    and thus prevent the need to design domain-specific approaches?
36



     Existing Efforts to Answer

    2004: A single grammar (rule-based) generates reasonably good segmentation
     performance (grouping & semantic tagging) for all domains. (Zhang et al., 2004)
    Higher accuracy can be attained using domain-specific techniques which are not
     feasible to be designed using rules (Nguyen et al., 2008) .
    2008: For label assignment (a portion of grouping), domain-specific classifiers
     result in higher accuracy than generic classifiers. (Nguyen et al., 2008)
    Still missing:
         A comparison of domain-specific and generic approaches for overall segmentation
          performance
         The design differences across domains
         generic approaches that result in equally good results for as many domains as possible.
0.41                                                                     Design tendencies
                                                           Investigating R.Q. #2                           of designers from
 0.09
                 0.21           Text-           0.35                                                       different domains
                               trivial
                        0.30                                                                               are different.
                               0.56          0.08          Attribute-                             0.57
     Operand
37                                                           name
                               0.83                                                                                         0.12
                                                                        0.40    0.20            Text-       0.14
               0.62                                                                            trivial
                                                    0.37                               0.20              0.22          Attribute-
      Movie                Operator                                Operand                    0.34                       name
                                                                                              0.64
                                      0.31

 0.15                                                           0.21         0.88
                 0.21           Text-           0.44                                                            0.11
                                                                                           Operator
                               trivial
                        0.23                 0.16
                                                                                                         References &
                                                           Attribute-
     Operand                   0.59                          name                                        Education
                               0.54                                                               0.64
               0.89                           0.08

                                                                                                Text-                       0.08
                                              0.09          0.09                0.17                        0.24
Biology                    Operator
                                                                                               trivial
                                                                                       0.11              0.05          Attribute-
                                                                   Operand                    0. 51                      name
                                                                                              0.83
                                                                                    0.08
                                                                          1.0                            Automobile
                                                                                           Operator
Investigating R.Q. #2:                                                    All experiments
                                                                            done using the
  Technique Vs Domain                                                       Machine learning
                                                                            technique, HMM.
38

     Domain       Exp Description          Evaluation                 Winner
                                                                      (improvement)
     Movie        Domain-Specific HMM Segmentation Accuracy Generic HMM
                  Vs. Generic HMM                           (4.4%)

     Ref & Edu    Domain-Specific HMM Segmentation Accuracy Domain-Specific
                  Vs. Generic HMM                           HMM (7%)

     Automobile Domain-Specific HMM Segmentation Accuracy Domain-Specific
                Vs. Generic HMM                           HMM (8%)

     Biology      Domain-Specific HMM Segmentation Accuracy Domain-Specific
                  Vs. Generic HMM                           HMM (36%)


What is the correlation between design topology and performance of domain-specific model?
Research Question #3
 How can we make a machine understand the interface and extract
 semantics from it in the same way as a human designer does?
39

     A    human-designer/user naturally understands the
       design and semantics of an interface based on visual
       cues and based on his prior experiences.
      A machine cannot really “see” an interface and does
       not have any implicit Web search experience. (How
       much do visual layout engines assist?)
      Hence, there is a difference between the way a
       machine perceives an interface and the way a designer
       perceives the interface.
      How can we reconcile these differences?
Existing methods have been
     Investigating R.Q. #3                                               able to: understand design,
                                                                         attach semantic labels, derive
     Simulating a Human Designer                                         segments and query
                                                                         capabilities.
40

         Hypothesis: A machine can be made to understand the interface in the
          same way as a human designer does if it is enabled to discover the deep
          source of knowledge that created the interface in the first place.
                                                             Attach
               Search              Understand                                        Derive
                                                            Semantic
              Interface              design                                         Segments
                                                             Labels


Understands
/ Designs                                                                            Derive
                           Web Design
                                                                                     Query
                           Knowledge
                                                                                   Capabilities


                                     Conceptual
          Designer                                                                  Recover DB
                                       Model
          /Modeler                                                                   Schema

                     Extracting DB schema and conceptual model is still an open question.
Connecting the dots                                                      R.Q. 1

                                                               Attach
                  Search              Understand                               Derive
   Search                                                     Semantic
41               Interface              design                                Segments
  Interface                                                    Labels


                             Web Design
                             Knowledge                                         Derive
                                                                               Query
                                Web Design                                   Capabilities
                                Knowledge

              Designer                   Conceptual
                                                                              Recover DB
                                           Model
  R.Q. 2                                                                       Schema


                                                                                  R.Q. 3

                     Search          ?           Conceptual
                                                Model based
                    Interface
                                                  Interface
42       THANK YOU !
     Suggestions, Comments, Thoughts, Ideas, Questions…

     Acknowledgements:
     To My Prospectus Committee Members

     References:
     [1] to [42] (in prospectus report).

More Related Content

Similar to Prospectus presentation

Web 3.0: The Upcoming Revolution
Web 3.0: The Upcoming RevolutionWeb 3.0: The Upcoming Revolution
Web 3.0: The Upcoming RevolutionNitin Godawat
 
The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...Peter Haase
 
Everything Self-Service:Linked Data Applications with the Information Workbench
Everything Self-Service:Linked Data Applications with the Information WorkbenchEverything Self-Service:Linked Data Applications with the Information Workbench
Everything Self-Service:Linked Data Applications with the Information WorkbenchPeter Haase
 
Deep Web Presentation April 25
Deep Web Presentation April 25Deep Web Presentation April 25
Deep Web Presentation April 25nagold
 
Extracting Multilingual Natural-Language Patterns for RDF Predicates
Extracting Multilingual Natural-Language Patterns for RDF PredicatesExtracting Multilingual Natural-Language Patterns for RDF Predicates
Extracting Multilingual Natural-Language Patterns for RDF PredicatesDaniel Gerber
 
Linked Data as a Service
Linked Data as a ServiceLinked Data as a Service
Linked Data as a ServicePeter Haase
 
WT - Web & Working of Search Engine
WT - Web & Working of Search EngineWT - Web & Working of Search Engine
WT - Web & Working of Search Enginevinay arora
 
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...IDES Editor
 
Www 2 ggg Athanassios Hatzis
Www 2 ggg Athanassios HatzisWww 2 ggg Athanassios Hatzis
Www 2 ggg Athanassios HatzisIgnite_Athens
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 
Web Queries: From a Web of Data to a Semantic Web
Web Queries: From a Web of Data to a Semantic WebWeb Queries: From a Web of Data to a Semantic Web
Web Queries: From a Web of Data to a Semantic WebTim Furche
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0animove
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Discovery platforms: Technology, tools and issues
Discovery platforms: Technology, tools and issuesDiscovery platforms: Technology, tools and issues
Discovery platforms: Technology, tools and issuessaiful76
 
Poster Semantic Web - Abhijit Chandrasen Manepatil
Poster Semantic Web - Abhijit Chandrasen ManepatilPoster Semantic Web - Abhijit Chandrasen Manepatil
Poster Semantic Web - Abhijit Chandrasen Manepatilap
 

Similar to Prospectus presentation (20)

Web 3.0: The Upcoming Revolution
Web 3.0: The Upcoming RevolutionWeb 3.0: The Upcoming Revolution
Web 3.0: The Upcoming Revolution
 
The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...The Information Workbench as a Self-Service Platform for Linked Data Applicat...
The Information Workbench as a Self-Service Platform for Linked Data Applicat...
 
Everything Self-Service:Linked Data Applications with the Information Workbench
Everything Self-Service:Linked Data Applications with the Information WorkbenchEverything Self-Service:Linked Data Applications with the Information Workbench
Everything Self-Service:Linked Data Applications with the Information Workbench
 
Web Quality in the Age of Big Content
Web Quality in the Age of Big ContentWeb Quality in the Age of Big Content
Web Quality in the Age of Big Content
 
Saadallah vtls
Saadallah vtlsSaadallah vtls
Saadallah vtls
 
Deep Web Presentation April 25
Deep Web Presentation April 25Deep Web Presentation April 25
Deep Web Presentation April 25
 
Extracting Multilingual Natural-Language Patterns for RDF Predicates
Extracting Multilingual Natural-Language Patterns for RDF PredicatesExtracting Multilingual Natural-Language Patterns for RDF Predicates
Extracting Multilingual Natural-Language Patterns for RDF Predicates
 
Linked Data as a Service
Linked Data as a ServiceLinked Data as a Service
Linked Data as a Service
 
WT - Web & Working of Search Engine
WT - Web & Working of Search EngineWT - Web & Working of Search Engine
WT - Web & Working of Search Engine
 
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
 
Www 2 ggg Athanassios Hatzis
Www 2 ggg Athanassios HatzisWww 2 ggg Athanassios Hatzis
Www 2 ggg Athanassios Hatzis
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
Web Queries: From a Web of Data to a Semantic Web
Web Queries: From a Web of Data to a Semantic WebWeb Queries: From a Web of Data to a Semantic Web
Web Queries: From a Web of Data to a Semantic Web
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Discovery platforms: Technology, tools and issues
Discovery platforms: Technology, tools and issuesDiscovery platforms: Technology, tools and issues
Discovery platforms: Technology, tools and issues
 
Integration visualization
Integration visualizationIntegration visualization
Integration visualization
 
IT Governance Portals
IT Governance   PortalsIT Governance   Portals
IT Governance Portals
 
L017447590
L017447590L017447590
L017447590
 
Poster Semantic Web - Abhijit Chandrasen Manepatil
Poster Semantic Web - Abhijit Chandrasen ManepatilPoster Semantic Web - Abhijit Chandrasen Manepatil
Poster Semantic Web - Abhijit Chandrasen Manepatil
 

Prospectus presentation

  • 1. 1 UNDERSTANDING DEEP WEB SEARCH INTERFACES Prospectus-Presentation April 03 2009 Ritu Khare
  • 2. This presentation uses this space Presentation Order for writing additional facts. 2  Problem Statement  The Deep Web & Challenges Problem: Understanding Semantics  Search Interface Understanding of Search Interfaces  Challenges and Significance  Literature Review Results  Settings How do existing approaches  Reductionist Analysis solve this problem?  Holistic Analysis  Research Questions and Design Ideas What are the  Techniques Vs Heterogeneity research gaps? How to fill them?  Semantics and Artificial Designer
  • 3. 3 PROBLEM STATEMENT The Deep Web Challenges in Accessing Deep Web Doors of Opportunity The SIU Process About the Stages of the Process Why SIU is Challenging? Why SIU is Significant?
  • 4. HAS MANY OTHER NAMES!! Hidden Web, Dark Web, The Deep Web Invisible Web, Subject- specific databases, Data- intensive Web sites. 4  What is DEEP WEB? WEB  The portion of Web SURFACE (as seen by resources that is not WEB search engines) returned by search ALMOST engines through VISIBLE WEB DEEP traditional crawling and WEB indexing.  Where do the contents LIE?  Online Databases
  • 5. HAS MANY OTHER NAMES!! Hidden Web, Dark Web, The Deep Web Invisible Web, Subject- specific databases, Data- intensive Web sites. 5  How are the contents ACCESSED?  By filling up HTML forms on search interfaces.  How are they PRESENTED to users?  Dynamic Pages /Result Pages /Response Pages
  • 6. QUICK FACT! Challenges Deep Web includes 307,000 sites 450,000 databases in Accessing Deep Web Contents 1,258,000 interfaces 6 500 times more than Increase of 3-7 that of the rest of times from 2000- the Web 2004 (He et al., 2007a). (BrightPlanet.com, 2001). visits several manually reconciles The deep Web interfaces before information obtained remains invisible finding the right from diff. sources. on the Web information. Alternative approaches: Not Scalable Invisible Web directories and Search engine browse directories cover only 37% of the deep Web (He et al., 2007a).
  • 7. INTERESTING FACT! Opportunities There exist at least 10 million high quality HTML forms on the in Accessing Deep Web Contents deep Web 7  HTML Forms on search interfaces provide a useful way of discovering the underlying database structure.  The labels attached to fields are very expressive and meaningful.  Instructions for users to enter data may provide information on data constraints (such as range of data, domain of data), and integrity constraints (mandatory /optional attributes).  In the last decade, several prominent researchers have focused on the PROBLEM OF UNDERSTANDING SEARCH INTERFACES.
  • 8. Search Interface Understanding (SIU) Process 8 Search B. Parsing Online Interface DB (Input) A. Representation Manually C. Segmentation Tagged Search E. Evaluation Interface D. Segment Processing System-Tagged Search Interface Extracted (Output) DB The SIU process is challenging because search interfaces are designed autonomously by different designers and thus, do not have a standard structure (Halevy, 2005).
  • 9. This stage builds A. Representation and Modeling up the foundation for the process 9  This stage formalizes the information to be extracted from a search interface  interface components: Any text or form element  semantic label: Meaning of a component from a user’s standpoint.  segment: a composite component formed with a group of related components  segment label: semantic label of the segment
  • 10. This stage builds A. Representation and Modeling up the foundation for the process 10  Zhang et al. (2004) represent an interface as a list of query conditions  Segment Label = Query Condition Attribute-name Operator Value  Segment consists of following semantic labels:  An attribute name  Operator  Value
  • 11. This stage is the first task physically B. Parsing performed on the interface. 11  The interface is parsed into a workable memory structure. It can be done in two modes: by reading the HTML source code; by rendering the page on a Web browser either manually or automatically using a visual layout engine.
  • 12. This stage is the first task physically B. Parsing performed on the interface. 12  He et al. (2007b) parse the interface into an interface expression (IEXP) with constructs t (corresponding to any text), e (corresponding to any form element), and | (corresponding to a row limiter).  IEXP for figure is:  t|te|teee
  • 13. Techniques used : Rules C. Segmentation Heuristics Machine Learning. 13  A segment has a semantic existence but no physically defined boundaries making this stage a challenging one.  Grouping of semantically related components. (A sub-problem is to associate a surrounding text with a form element)  Assignment of semantic labels to components
  • 14. Techniques used : Rules C. Segmentation Heuristics Machine Learning. 14  He et al. (2007b) use a heuristic-based method LEX to group elements and text labels together. One heuristic used by LEX is that the text and form element that lie on the same line are likely to belong to one segment. In Figure , the 3 components “Gene Name”, radio button with options ‘Exact Match’ and ‘Ignore Case’, and the textbox belong to one segment. Logical Attribute Attribute-label Constraint Element Domain Element
  • 15. Techniques used : Rules D. Segment Processing Heuristics Machine Learning. 15  In this stage,  Each segment is further tagged with additional meta- information regarding itself and its components.  Post-processing of extracted information: normalization, stemming, removal of stop words.  He et al. (2007b)’s LEX extracts meta-information about each extracted segment using Naïve Bayes classification technique.  The extracted information for the segment includes domain type (finite, infinite); Unit (miles, sec); value type (numeric, character, etc.); layout order position (in IEXP)
  • 16. An approach is usually tested on E. Evaluation a set of interfaces belonging to a particular domain. 16  How accurate extracted information is?  The system-generated segmented and tagged interface is compared with the manually segmented and tagged interface  The results are evaluated based on standard metrics (precision, recall, accuracy, etc).
  • 17. SIGNIFICANCE: SIU is a pre-requisite Why SIU is Significant? for several advanced deep Web applications. 17  Researchers have proposed solutions to make the deep Web contents more useful to the users. These solutions can be divided into following categories based on goals:  To Increase Content Visibility on Search Engines  Building Dynamic Page Repository: Raghavan and Garcia-Molina (2001)  Building Database Content Repository: Madhavan et al. (2008)  To Increase Domain-specific Usability  Meta-search Engines: Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005), Pei et al. (2006), He and Chang (2003), and Wang et al. (2004)  To Attain Knowledge Organization  Derivation of Ontologies: Benslimane et al. (2007)  These solutions can only be materialized by leveraging the opportunities provided by search interface.
  • 18. 18 LITERATURE REVIEW RESULTS Review Settings Reductionist Analysis Holistic Analysis Progress Made
  • 19. ALIAS: For quick reference Literature Review Process each work is assigned an alias. 19  Reviewed research works that propose approaches for performing the SIU process in the context of the deep Web. S.No. Reference Alias 1 Raghavan and Garcia Molina (2001) LITE 2 Kalijuvee et al. (2001) CombMatch 3 Wu et al. (2004) FieldTree 4 Zhang et al. (2004) HSP 5 Shestakov et al.(2005) DEQUE 6 Pei et al. (2006) AttrList 7 He et al. (2007) LEX 8 Benslimane et al. (2007) FormModel 9 Nguyen et al. (2008) LabelEx
  • 20. DIMENSION: facilitates comparison Literature Review Process among different works by placing them under the same umbrella 20  Review was done in 2 phases:  Reductionist Analysis: The works were decomposed into small pieces.  Each work was visualized as a 2-dimensional grid where the horizontal sections refer to stages of the SIU process. For each stage the works were analyzed in vertical degrees of analysis known as Stage-specific dimensions.  Holistic Analysis: Each work was studied in its entirety within a big picture context.  Composite dimensions were created out of the stage- specific dimensions.
  • 21. A: Representation Reductionist Analysis : Representation B: Parsing C: Segmentation D: Segment Processing E: Evaluation Work Segment: Segment Contents Text Label: Form Meta-information 21 Element HSP Conditional Pattern: Attribute- 1:M name, Operator*, and Value+ DEQUE Field segment : f, Name(f), 1:1 JavaScript Functions, Label(f) Visible and invisible values, f = field Subinfo(F) = {action, method, enctype) Iset(F) =initial field set, that can be submittied without completing a form domain(f), type(f) F=form AttrList Attribute: Attribute-name, 1:1 Domain information for each attribute (set of description and form element values and data types) LEX Logical Attribute Ai: Attr-label L, 1:M site information and form constraint List of domain elements {Ej,…Ek}, (1:1 in case of Ai=(P, U, Re, Ca, DT, DF, VT) and element labels . “element label” Ai=ith attribute, :form element) P = layout order position, U = Unit, Re= relationship type, Ca = domain element constraint, DT = domain type, DF = default value, VT = value type Ei=(N, Fe, V, DV) N = internal name, Fe = format, V = set of values, DV = default value.
  • 22. A: Representation B: Parsing Reductionist Analysis : Parsing C: Segmentation D: Segment Processing E: Evaluation 22 Work Input Mode Basic Step: Description Cleaning Up Resulting Structure LITE HTML source code Pruning Discard Images, Pruned Page And Isolate elements that ignore styling Visual Interface directly influence layout information such as of form elements and font size, font style, labels. and style sheets. CombMatch HTML source code Chunk Partitioning, and Stop Chunk List and Table finding meta-information Phrases(“optional”, Index List. Each chunk is about each chunk: Find “required”, “*”, Text represented as an 8- bounding HTML tags, and formatting HTML tags. tuple describing meta- text strings delimited by information. table cell tags, etc. DEQUE HTML text Preparing Form Ignore font size, Pruned Tree And Database: A DOM tree is typefaces, and styling Visual interface created for each FORM information. element LEX HTML source code Interface Expression String Generation: t=text, e = element, I=row delimiter (<BR>, <P>, or </TR>
  • 23. A: Representation B: Parsing Reductionist Analysis: Segmentation C: Segmentation D: Segment Processing E: Evaluation 23 Work Problem Description Segmentation Criteria Technique CombMatch Assigning text Label to an Combination of string similarity Heuristics (String properties, input element and spatial similarity algorithms Proximity and Layout) HSP Finding the 3-tuple <attribute Grammar (set of rules) based on Rules (Best Effort Parser to name, operators, values> productions and preferences build a parse tree) LEX Assigning text labels to Ending colon, textual similarity Heuristics attributes, and assigning with element name, vertical (String Properties, Layout and element labels to domain alignment, distance, preference Proximity) elements to current row LabelEx Assigning text Label to a Classifiers (Naïve Bayes, and Supervised Machine Learning form element Decision Tree). Features considered include spatial features, element type, font type, internal, similarity, alignment, label placement, distance.
  • 24. A: Representation B: Parsing C: Segmentation Reductionist Analysis : Segment Processing D: Segment Processing E: Evaluation 24 Work Technique for extracting Post-processing meta-information HSP The Merger module reports conflicting (that occur in two query conditions) and missing tokens (they do not occur in any query condition). LEX Naïve Bayesian Classification Meaningless stopwords (the, with, any, etc.) (Supervised Machine Learning) FormModel Learning by Examples (Machine Learning) LabelEx Heuristics for reconciliation of multiple assigned labels to an element; and to handle dangling elements.
  • 25. A: Representation B: Parsing C: Segmentation D: Segment Processing E: Evaluation 25 Work Test Domain Yahoo Subject Category Comparison with… Metrics LITE Semiconductor Science, Entertainment, CombMatch (in terms of Accuracy Reductionist Analysis: Evaluation Industry, Movies, Computers & Internet methodology) Database Technology. HSP Airfare, automobile, Business & Economy, 4 datasets from different Precision, Recall book, job, real estate, Recreation & Sports, sources collected by car rental, hotel, Entertainment authors. movies, music records. LabelEx Airfare, Automobiles, Business & Economy, Barbosa et al. (2007)’s Recall, Books, Movies. Recreation & Sports, and HSP ( in terms of Precision, Entertainment datasets) F-Measure Using Classifier Ensemble with or without Mapping Reconciliation (MR). Generic Classifier Vs Domain-specific Classifier Generic Classifier with MR Vs Domain-specific Classifier with MR HSP and LEX ( in terms of methodology)
  • 26. Work Type of semantics Techniques Human Target Application Involvement LITE Partial form capabilities Heuristics None Deep Web Crawler (Label associated with (search engine 26 form element) visibility) HSP Query capability Rules Manual Meta-searchers (attribute name, operator Specification of (domain-specific and values) Grammar Rules usability) LEX Components belonging to Heuristics None Meta-searchers same logical attribute (domain-specific (labels and form elements) usability) Meta-information Supervised Machine Training data for Learning classifier Holistic Analysis FormModel Structural Units (groups of NOT REPORTED Unknown Ontology Derivation fields belonging to same (Knowledge entity) Organization) Partial form capabilities Heuristics None (Label associated with form element) Meta-information Supervised Machine Training data for Learning learning by examples. LabelEx Partial form capabilities Supervised Machine Classifier Training Deep Web in general (Label associated with Learning data was manually (search engine form element) tagged. visibility domain- specific usability)
  • 27. A: Representation B: Parsing Progress Made C: Segmentation D: Segment Processing E: Evaluation 27  SEMANTICS modeled and extracted. (Stages A and B)  from merely stating what we see, to stating what is meant by what we see  from merely associating labels to form elements, to discovering query capabilities  from no meta-information to a lot of meta-information which might be useful for target application.  TECHNIQUES employed (Stages C and D)  A mild transitioning from naïve techniques (rules-based and heuristic-based) to sophisticated techniques (supervised machine learning).  DOMAINS explored (Stage E)  Only Commercial Domains: books, used cars, movies, etc.  Still Unexplored Non-Commercial Domains: yahoo.com subject categories such as regional, society and culture, education, arts and humanities, science, reference, and others
  • 28. 28 RESEARCH QUESTIONS Techniques Vs Design Heterogeneity Techniques Vs Domain Heterogeneity Simulating a Human Designer
  • 29. Derived from Research Questions Holistic and Reductionist Analysis 29  R.Q.#1 Technique Vs Design Heterogeneity  What is the correlation between the technique employed and the ability to handle heterogeneity in design of interfaces?  R.Q.#2 Technique Vs Domains  How can we design approaches that work well for arbitrary domains, and thus prevent the need to design domain-specific approaches?  R.Q.#3 Simulating a Human Designer  How can we make a machine understand an interface in the same way as a human designer does?
  • 30. Technique is a Research Question #1 dimension of What is the correlation between the technique employed and the ability to Stages Segmentation & Segment handle heterogeneity in design of interfaces? Processing 30 Elaborating the Question  Techniques: Rules, Heuristics, and Machine Learning.  Design: Arrangement of interface components.  Handling Heterogeneity in design: Being able to perform the following tasks for any kind of design.  Segmentation  Semantic Tagging  Grouping (Label Assignment is a part of this)  Segment Processing
  • 31. Technique is a Research Question #1 dimension of What is the correlation between the technique employed and the ability to Stages Segmentation & Segment handle heterogeneity in design of interfaces? Processing 31 Heterogeneity: Automobile Domain Heterogeneity: Movie Domain Multiple Attribute-name Operator Attribute-name Operand
  • 32. This question has Research Question #1 been only partially What is the correlation between the technique employed and the ability to explored. handle heterogeneity in design of interfaces? 32 Existing Efforts to Answer  A 2002 study (Kushmerick, 2002) suggests the superiority of machine learning techniques over rule-based and heuristic-based techniques for handling design heterogeneity in general.  A 2008 study (Nguyen et al., 2008) compared the label assignment accuracy (a part of grouping accuracy) of the three approaches: rule-based (HSP), heuristic- based (LEX), and machine learning based (LabelEx). Machine learning technique outperformed the other two.
  • 33. Tasks to test: Segmentation Investigating R.Q.#1 •Grouping Technique Vs Design Heterogeneity •Semantic Tagging Segment Processing 33 However, there is NO comparative study in terms of overall grouping, semantic tagging, and segment processing. Experiment Description Evaluation Result Compared With Improvement Metrics A machine learning Grouping 86% Heuristic-based 10% technique based on Hidden Accuracy (label state-of-the-art Markov Models (HMMs) was assignment approach LEX designed and tested on a included) dataset belonging to biology Semantic Tagging 90% A Heuristic-based 17% domain. Accuracy algorithm was designed  Compare Segmentation Performance:  Compare Segment Processing Performances:  Machine Learning Vs. Rule-Based  Rules Vs. heuristics Vs. machine learning  Various machine learning techniques  Classification Vs. HMM Vs…
  • 34. Human Intervention is a Investigating R.Q.#1 dimension of Technique Vs Design Heterogeneity Holistic analysis 34 There is NO comparative study to measure human intervention in these techniques. Experiment Description Evaluation Result Compared With Metrics Monitoring Human Rule-based: Manual Rule Based Vs Intervention Crafting Heuristics Vs (IN PROGRESS) Heuristics: Manual Machine Learning Observations Machine Learning: Manual Tagging The HMM was trained P(O|λ) Not promising using unsupervised training algorithm Baum Welch  Designing Unsupervised Techniques
  • 35. Research Question #2 Domain tested is a dimension of How can we design approaches that work well for arbitrary domains, Evaluation stage. and thus prevent the need to design domain-specific approaches? 35 Elaborating the Question  Domain Heterogeneity: Deep Web is heterogeneous in terms of domains, i.e. has databases belonging to all the 14 subject categories of Yahoo (Arts & Humanities, Business and Economy, Computers and Internet, Education, Entertainment, Government, Health, etc. )  How to design generic approaches that work for many domains?  How do interface designs differ across domains?  Which technique should be employed?
  • 36. Research Question #2 Deep Web has a balanced domain How can we design approaches that work well for arbitrary domains, distribution and thus prevent the need to design domain-specific approaches? 36 Existing Efforts to Answer  2004: A single grammar (rule-based) generates reasonably good segmentation performance (grouping & semantic tagging) for all domains. (Zhang et al., 2004)  Higher accuracy can be attained using domain-specific techniques which are not feasible to be designed using rules (Nguyen et al., 2008) .  2008: For label assignment (a portion of grouping), domain-specific classifiers result in higher accuracy than generic classifiers. (Nguyen et al., 2008)  Still missing:  A comparison of domain-specific and generic approaches for overall segmentation performance  The design differences across domains  generic approaches that result in equally good results for as many domains as possible.
  • 37. 0.41 Design tendencies Investigating R.Q. #2 of designers from 0.09 0.21 Text- 0.35 different domains trivial 0.30 are different. 0.56 0.08 Attribute- 0.57 Operand 37 name 0.83 0.12 0.40 0.20 Text- 0.14 0.62 trivial 0.37 0.20 0.22 Attribute- Movie Operator Operand 0.34 name 0.64 0.31 0.15 0.21 0.88 0.21 Text- 0.44 0.11 Operator trivial 0.23 0.16 References & Attribute- Operand 0.59 name Education 0.54 0.64 0.89 0.08 Text- 0.08 0.09 0.09 0.17 0.24 Biology Operator trivial 0.11 0.05 Attribute- Operand 0. 51 name 0.83 0.08 1.0 Automobile Operator
  • 38. Investigating R.Q. #2: All experiments done using the Technique Vs Domain Machine learning technique, HMM. 38 Domain Exp Description Evaluation Winner (improvement) Movie Domain-Specific HMM Segmentation Accuracy Generic HMM Vs. Generic HMM (4.4%) Ref & Edu Domain-Specific HMM Segmentation Accuracy Domain-Specific Vs. Generic HMM HMM (7%) Automobile Domain-Specific HMM Segmentation Accuracy Domain-Specific Vs. Generic HMM HMM (8%) Biology Domain-Specific HMM Segmentation Accuracy Domain-Specific Vs. Generic HMM HMM (36%) What is the correlation between design topology and performance of domain-specific model?
  • 39. Research Question #3 How can we make a machine understand the interface and extract semantics from it in the same way as a human designer does? 39 A human-designer/user naturally understands the design and semantics of an interface based on visual cues and based on his prior experiences.  A machine cannot really “see” an interface and does not have any implicit Web search experience. (How much do visual layout engines assist?)  Hence, there is a difference between the way a machine perceives an interface and the way a designer perceives the interface.  How can we reconcile these differences?
  • 40. Existing methods have been Investigating R.Q. #3 able to: understand design, attach semantic labels, derive Simulating a Human Designer segments and query capabilities. 40  Hypothesis: A machine can be made to understand the interface in the same way as a human designer does if it is enabled to discover the deep source of knowledge that created the interface in the first place. Attach Search Understand Derive Semantic Interface design Segments Labels Understands / Designs Derive Web Design Query Knowledge Capabilities Conceptual Designer Recover DB Model /Modeler Schema Extracting DB schema and conceptual model is still an open question.
  • 41. Connecting the dots R.Q. 1 Attach Search Understand Derive Search Semantic 41 Interface design Segments Interface Labels Web Design Knowledge Derive Query Web Design Capabilities Knowledge Designer Conceptual Recover DB Model R.Q. 2 Schema R.Q. 3 Search ? Conceptual Model based Interface Interface
  • 42. 42 THANK YOU ! Suggestions, Comments, Thoughts, Ideas, Questions… Acknowledgements: To My Prospectus Committee Members References: [1] to [42] (in prospectus report).