Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Prospectus presentation


Published on

  • Be the first to comment

  • Be the first to like this

Prospectus presentation

  1. 1. 1 UNDERSTANDING DEEP WEB SEARCH INTERFACES Prospectus-Presentation April 03 2009 Ritu Khare
  2. 2. This presentation uses this space Presentation Order for writing additional facts. 2  Problem Statement  The Deep Web & Challenges Problem: Understanding Semantics  Search Interface Understanding of Search Interfaces  Challenges and Significance  Literature Review Results  Settings How do existing approaches  Reductionist Analysis solve this problem?  Holistic Analysis  Research Questions and Design Ideas What are the  Techniques Vs Heterogeneity research gaps? How to fill them?  Semantics and Artificial Designer
  3. 3. 3 PROBLEM STATEMENT The Deep Web Challenges in Accessing Deep Web Doors of Opportunity The SIU Process About the Stages of the Process Why SIU is Challenging? Why SIU is Significant?
  4. 4. HAS MANY OTHER NAMES!! Hidden Web, Dark Web, The Deep Web Invisible Web, Subject- specific databases, Data- intensive Web sites. 4  What is DEEP WEB? WEB  The portion of Web SURFACE (as seen by resources that is not WEB search engines) returned by search ALMOST engines through VISIBLE WEB DEEP traditional crawling and WEB indexing.  Where do the contents LIE?  Online Databases
  5. 5. HAS MANY OTHER NAMES!! Hidden Web, Dark Web, The Deep Web Invisible Web, Subject- specific databases, Data- intensive Web sites. 5  How are the contents ACCESSED?  By filling up HTML forms on search interfaces.  How are they PRESENTED to users?  Dynamic Pages /Result Pages /Response Pages
  6. 6. QUICK FACT! Challenges Deep Web includes 307,000 sites 450,000 databases in Accessing Deep Web Contents 1,258,000 interfaces 6 500 times more than Increase of 3-7 that of the rest of times from 2000- the Web 2004 (He et al., 2007a). (, 2001). visits several manually reconciles The deep Web interfaces before information obtained remains invisible finding the right from diff. sources. on the Web information. Alternative approaches: Not Scalable Invisible Web directories and Search engine browse directories cover only 37% of the deep Web (He et al., 2007a).
  7. 7. INTERESTING FACT! Opportunities There exist at least 10 million high quality HTML forms on the in Accessing Deep Web Contents deep Web 7  HTML Forms on search interfaces provide a useful way of discovering the underlying database structure.  The labels attached to fields are very expressive and meaningful.  Instructions for users to enter data may provide information on data constraints (such as range of data, domain of data), and integrity constraints (mandatory /optional attributes).  In the last decade, several prominent researchers have focused on the PROBLEM OF UNDERSTANDING SEARCH INTERFACES.
  8. 8. Search Interface Understanding (SIU) Process 8 Search B. Parsing Online Interface DB (Input) A. Representation Manually C. Segmentation Tagged Search E. Evaluation Interface D. Segment Processing System-Tagged Search Interface Extracted (Output) DB The SIU process is challenging because search interfaces are designed autonomously by different designers and thus, do not have a standard structure (Halevy, 2005).
  9. 9. This stage builds A. Representation and Modeling up the foundation for the process 9  This stage formalizes the information to be extracted from a search interface  interface components: Any text or form element  semantic label: Meaning of a component from a user’s standpoint.  segment: a composite component formed with a group of related components  segment label: semantic label of the segment
  10. 10. This stage builds A. Representation and Modeling up the foundation for the process 10  Zhang et al. (2004) represent an interface as a list of query conditions  Segment Label = Query Condition Attribute-name Operator Value  Segment consists of following semantic labels:  An attribute name  Operator  Value
  11. 11. This stage is the first task physically B. Parsing performed on the interface. 11  The interface is parsed into a workable memory structure. It can be done in two modes: by reading the HTML source code; by rendering the page on a Web browser either manually or automatically using a visual layout engine.
  12. 12. This stage is the first task physically B. Parsing performed on the interface. 12  He et al. (2007b) parse the interface into an interface expression (IEXP) with constructs t (corresponding to any text), e (corresponding to any form element), and | (corresponding to a row limiter).  IEXP for figure is:  t|te|teee
  13. 13. Techniques used : Rules C. Segmentation Heuristics Machine Learning. 13  A segment has a semantic existence but no physically defined boundaries making this stage a challenging one.  Grouping of semantically related components. (A sub-problem is to associate a surrounding text with a form element)  Assignment of semantic labels to components
  14. 14. Techniques used : Rules C. Segmentation Heuristics Machine Learning. 14  He et al. (2007b) use a heuristic-based method LEX to group elements and text labels together. One heuristic used by LEX is that the text and form element that lie on the same line are likely to belong to one segment. In Figure , the 3 components “Gene Name”, radio button with options ‘Exact Match’ and ‘Ignore Case’, and the textbox belong to one segment. Logical Attribute Attribute-label Constraint Element Domain Element
  15. 15. Techniques used : Rules D. Segment Processing Heuristics Machine Learning. 15  In this stage,  Each segment is further tagged with additional meta- information regarding itself and its components.  Post-processing of extracted information: normalization, stemming, removal of stop words.  He et al. (2007b)’s LEX extracts meta-information about each extracted segment using Naïve Bayes classification technique.  The extracted information for the segment includes domain type (finite, infinite); Unit (miles, sec); value type (numeric, character, etc.); layout order position (in IEXP)
  16. 16. An approach is usually tested on E. Evaluation a set of interfaces belonging to a particular domain. 16  How accurate extracted information is?  The system-generated segmented and tagged interface is compared with the manually segmented and tagged interface  The results are evaluated based on standard metrics (precision, recall, accuracy, etc).
  17. 17. SIGNIFICANCE: SIU is a pre-requisite Why SIU is Significant? for several advanced deep Web applications. 17  Researchers have proposed solutions to make the deep Web contents more useful to the users. These solutions can be divided into following categories based on goals:  To Increase Content Visibility on Search Engines  Building Dynamic Page Repository: Raghavan and Garcia-Molina (2001)  Building Database Content Repository: Madhavan et al. (2008)  To Increase Domain-specific Usability  Meta-search Engines: Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005), Pei et al. (2006), He and Chang (2003), and Wang et al. (2004)  To Attain Knowledge Organization  Derivation of Ontologies: Benslimane et al. (2007)  These solutions can only be materialized by leveraging the opportunities provided by search interface.
  18. 18. 18 LITERATURE REVIEW RESULTS Review Settings Reductionist Analysis Holistic Analysis Progress Made
  19. 19. ALIAS: For quick reference Literature Review Process each work is assigned an alias. 19  Reviewed research works that propose approaches for performing the SIU process in the context of the deep Web. S.No. Reference Alias 1 Raghavan and Garcia Molina (2001) LITE 2 Kalijuvee et al. (2001) CombMatch 3 Wu et al. (2004) FieldTree 4 Zhang et al. (2004) HSP 5 Shestakov et al.(2005) DEQUE 6 Pei et al. (2006) AttrList 7 He et al. (2007) LEX 8 Benslimane et al. (2007) FormModel 9 Nguyen et al. (2008) LabelEx
  20. 20. DIMENSION: facilitates comparison Literature Review Process among different works by placing them under the same umbrella 20  Review was done in 2 phases:  Reductionist Analysis: The works were decomposed into small pieces.  Each work was visualized as a 2-dimensional grid where the horizontal sections refer to stages of the SIU process. For each stage the works were analyzed in vertical degrees of analysis known as Stage-specific dimensions.  Holistic Analysis: Each work was studied in its entirety within a big picture context.  Composite dimensions were created out of the stage- specific dimensions.
  21. 21. A: Representation Reductionist Analysis : Representation B: Parsing C: Segmentation D: Segment Processing E: Evaluation Work Segment: Segment Contents Text Label: Form Meta-information 21 Element HSP Conditional Pattern: Attribute- 1:M name, Operator*, and Value+ DEQUE Field segment : f, Name(f), 1:1 JavaScript Functions, Label(f) Visible and invisible values, f = field Subinfo(F) = {action, method, enctype) Iset(F) =initial field set, that can be submittied without completing a form domain(f), type(f) F=form AttrList Attribute: Attribute-name, 1:1 Domain information for each attribute (set of description and form element values and data types) LEX Logical Attribute Ai: Attr-label L, 1:M site information and form constraint List of domain elements {Ej,…Ek}, (1:1 in case of Ai=(P, U, Re, Ca, DT, DF, VT) and element labels . “element label” Ai=ith attribute, :form element) P = layout order position, U = Unit, Re= relationship type, Ca = domain element constraint, DT = domain type, DF = default value, VT = value type Ei=(N, Fe, V, DV) N = internal name, Fe = format, V = set of values, DV = default value.
  22. 22. A: Representation B: Parsing Reductionist Analysis : Parsing C: Segmentation D: Segment Processing E: Evaluation 22 Work Input Mode Basic Step: Description Cleaning Up Resulting Structure LITE HTML source code Pruning Discard Images, Pruned Page And Isolate elements that ignore styling Visual Interface directly influence layout information such as of form elements and font size, font style, labels. and style sheets. CombMatch HTML source code Chunk Partitioning, and Stop Chunk List and Table finding meta-information Phrases(“optional”, Index List. Each chunk is about each chunk: Find “required”, “*”, Text represented as an 8- bounding HTML tags, and formatting HTML tags. tuple describing meta- text strings delimited by information. table cell tags, etc. DEQUE HTML text Preparing Form Ignore font size, Pruned Tree And Database: A DOM tree is typefaces, and styling Visual interface created for each FORM information. element LEX HTML source code Interface Expression String Generation: t=text, e = element, I=row delimiter (<BR>, <P>, or </TR>
  23. 23. A: Representation B: Parsing Reductionist Analysis: Segmentation C: Segmentation D: Segment Processing E: Evaluation 23 Work Problem Description Segmentation Criteria Technique CombMatch Assigning text Label to an Combination of string similarity Heuristics (String properties, input element and spatial similarity algorithms Proximity and Layout) HSP Finding the 3-tuple <attribute Grammar (set of rules) based on Rules (Best Effort Parser to name, operators, values> productions and preferences build a parse tree) LEX Assigning text labels to Ending colon, textual similarity Heuristics attributes, and assigning with element name, vertical (String Properties, Layout and element labels to domain alignment, distance, preference Proximity) elements to current row LabelEx Assigning text Label to a Classifiers (Naïve Bayes, and Supervised Machine Learning form element Decision Tree). Features considered include spatial features, element type, font type, internal, similarity, alignment, label placement, distance.
  24. 24. A: Representation B: Parsing C: Segmentation Reductionist Analysis : Segment Processing D: Segment Processing E: Evaluation 24 Work Technique for extracting Post-processing meta-information HSP The Merger module reports conflicting (that occur in two query conditions) and missing tokens (they do not occur in any query condition). LEX Naïve Bayesian Classification Meaningless stopwords (the, with, any, etc.) (Supervised Machine Learning) FormModel Learning by Examples (Machine Learning) LabelEx Heuristics for reconciliation of multiple assigned labels to an element; and to handle dangling elements.
  25. 25. A: Representation B: Parsing C: Segmentation D: Segment Processing E: Evaluation 25 Work Test Domain Yahoo Subject Category Comparison with… Metrics LITE Semiconductor Science, Entertainment, CombMatch (in terms of Accuracy Reductionist Analysis: Evaluation Industry, Movies, Computers & Internet methodology) Database Technology. HSP Airfare, automobile, Business & Economy, 4 datasets from different Precision, Recall book, job, real estate, Recreation & Sports, sources collected by car rental, hotel, Entertainment authors. movies, music records. LabelEx Airfare, Automobiles, Business & Economy, Barbosa et al. (2007)’s Recall, Books, Movies. Recreation & Sports, and HSP ( in terms of Precision, Entertainment datasets) F-Measure Using Classifier Ensemble with or without Mapping Reconciliation (MR). Generic Classifier Vs Domain-specific Classifier Generic Classifier with MR Vs Domain-specific Classifier with MR HSP and LEX ( in terms of methodology)
  26. 26. Work Type of semantics Techniques Human Target Application Involvement LITE Partial form capabilities Heuristics None Deep Web Crawler (Label associated with (search engine 26 form element) visibility) HSP Query capability Rules Manual Meta-searchers (attribute name, operator Specification of (domain-specific and values) Grammar Rules usability) LEX Components belonging to Heuristics None Meta-searchers same logical attribute (domain-specific (labels and form elements) usability) Meta-information Supervised Machine Training data for Learning classifier Holistic Analysis FormModel Structural Units (groups of NOT REPORTED Unknown Ontology Derivation fields belonging to same (Knowledge entity) Organization) Partial form capabilities Heuristics None (Label associated with form element) Meta-information Supervised Machine Training data for Learning learning by examples. LabelEx Partial form capabilities Supervised Machine Classifier Training Deep Web in general (Label associated with Learning data was manually (search engine form element) tagged. visibility domain- specific usability)
  27. 27. A: Representation B: Parsing Progress Made C: Segmentation D: Segment Processing E: Evaluation 27  SEMANTICS modeled and extracted. (Stages A and B)  from merely stating what we see, to stating what is meant by what we see  from merely associating labels to form elements, to discovering query capabilities  from no meta-information to a lot of meta-information which might be useful for target application.  TECHNIQUES employed (Stages C and D)  A mild transitioning from naïve techniques (rules-based and heuristic-based) to sophisticated techniques (supervised machine learning).  DOMAINS explored (Stage E)  Only Commercial Domains: books, used cars, movies, etc.  Still Unexplored Non-Commercial Domains: subject categories such as regional, society and culture, education, arts and humanities, science, reference, and others
  28. 28. 28 RESEARCH QUESTIONS Techniques Vs Design Heterogeneity Techniques Vs Domain Heterogeneity Simulating a Human Designer
  29. 29. Derived from Research Questions Holistic and Reductionist Analysis 29  R.Q.#1 Technique Vs Design Heterogeneity  What is the correlation between the technique employed and the ability to handle heterogeneity in design of interfaces?  R.Q.#2 Technique Vs Domains  How can we design approaches that work well for arbitrary domains, and thus prevent the need to design domain-specific approaches?  R.Q.#3 Simulating a Human Designer  How can we make a machine understand an interface in the same way as a human designer does?
  30. 30. Technique is a Research Question #1 dimension of What is the correlation between the technique employed and the ability to Stages Segmentation & Segment handle heterogeneity in design of interfaces? Processing 30 Elaborating the Question  Techniques: Rules, Heuristics, and Machine Learning.  Design: Arrangement of interface components.  Handling Heterogeneity in design: Being able to perform the following tasks for any kind of design.  Segmentation  Semantic Tagging  Grouping (Label Assignment is a part of this)  Segment Processing
  31. 31. Technique is a Research Question #1 dimension of What is the correlation between the technique employed and the ability to Stages Segmentation & Segment handle heterogeneity in design of interfaces? Processing 31 Heterogeneity: Automobile Domain Heterogeneity: Movie Domain Multiple Attribute-name Operator Attribute-name Operand
  32. 32. This question has Research Question #1 been only partially What is the correlation between the technique employed and the ability to explored. handle heterogeneity in design of interfaces? 32 Existing Efforts to Answer  A 2002 study (Kushmerick, 2002) suggests the superiority of machine learning techniques over rule-based and heuristic-based techniques for handling design heterogeneity in general.  A 2008 study (Nguyen et al., 2008) compared the label assignment accuracy (a part of grouping accuracy) of the three approaches: rule-based (HSP), heuristic- based (LEX), and machine learning based (LabelEx). Machine learning technique outperformed the other two.
  33. 33. Tasks to test: Segmentation Investigating R.Q.#1 •Grouping Technique Vs Design Heterogeneity •Semantic Tagging Segment Processing 33 However, there is NO comparative study in terms of overall grouping, semantic tagging, and segment processing. Experiment Description Evaluation Result Compared With Improvement Metrics A machine learning Grouping 86% Heuristic-based 10% technique based on Hidden Accuracy (label state-of-the-art Markov Models (HMMs) was assignment approach LEX designed and tested on a included) dataset belonging to biology Semantic Tagging 90% A Heuristic-based 17% domain. Accuracy algorithm was designed  Compare Segmentation Performance:  Compare Segment Processing Performances:  Machine Learning Vs. Rule-Based  Rules Vs. heuristics Vs. machine learning  Various machine learning techniques  Classification Vs. HMM Vs…
  34. 34. Human Intervention is a Investigating R.Q.#1 dimension of Technique Vs Design Heterogeneity Holistic analysis 34 There is NO comparative study to measure human intervention in these techniques. Experiment Description Evaluation Result Compared With Metrics Monitoring Human Rule-based: Manual Rule Based Vs Intervention Crafting Heuristics Vs (IN PROGRESS) Heuristics: Manual Machine Learning Observations Machine Learning: Manual Tagging The HMM was trained P(O|λ) Not promising using unsupervised training algorithm Baum Welch  Designing Unsupervised Techniques
  35. 35. Research Question #2 Domain tested is a dimension of How can we design approaches that work well for arbitrary domains, Evaluation stage. and thus prevent the need to design domain-specific approaches? 35 Elaborating the Question  Domain Heterogeneity: Deep Web is heterogeneous in terms of domains, i.e. has databases belonging to all the 14 subject categories of Yahoo (Arts & Humanities, Business and Economy, Computers and Internet, Education, Entertainment, Government, Health, etc. )  How to design generic approaches that work for many domains?  How do interface designs differ across domains?  Which technique should be employed?
  36. 36. Research Question #2 Deep Web has a balanced domain How can we design approaches that work well for arbitrary domains, distribution and thus prevent the need to design domain-specific approaches? 36 Existing Efforts to Answer  2004: A single grammar (rule-based) generates reasonably good segmentation performance (grouping & semantic tagging) for all domains. (Zhang et al., 2004)  Higher accuracy can be attained using domain-specific techniques which are not feasible to be designed using rules (Nguyen et al., 2008) .  2008: For label assignment (a portion of grouping), domain-specific classifiers result in higher accuracy than generic classifiers. (Nguyen et al., 2008)  Still missing:  A comparison of domain-specific and generic approaches for overall segmentation performance  The design differences across domains  generic approaches that result in equally good results for as many domains as possible.
  37. 37. 0.41 Design tendencies Investigating R.Q. #2 of designers from 0.09 0.21 Text- 0.35 different domains trivial 0.30 are different. 0.56 0.08 Attribute- 0.57 Operand 37 name 0.83 0.12 0.40 0.20 Text- 0.14 0.62 trivial 0.37 0.20 0.22 Attribute- Movie Operator Operand 0.34 name 0.64 0.31 0.15 0.21 0.88 0.21 Text- 0.44 0.11 Operator trivial 0.23 0.16 References & Attribute- Operand 0.59 name Education 0.54 0.64 0.89 0.08 Text- 0.08 0.09 0.09 0.17 0.24 Biology Operator trivial 0.11 0.05 Attribute- Operand 0. 51 name 0.83 0.08 1.0 Automobile Operator
  38. 38. Investigating R.Q. #2: All experiments done using the Technique Vs Domain Machine learning technique, HMM. 38 Domain Exp Description Evaluation Winner (improvement) Movie Domain-Specific HMM Segmentation Accuracy Generic HMM Vs. Generic HMM (4.4%) Ref & Edu Domain-Specific HMM Segmentation Accuracy Domain-Specific Vs. Generic HMM HMM (7%) Automobile Domain-Specific HMM Segmentation Accuracy Domain-Specific Vs. Generic HMM HMM (8%) Biology Domain-Specific HMM Segmentation Accuracy Domain-Specific Vs. Generic HMM HMM (36%) What is the correlation between design topology and performance of domain-specific model?
  39. 39. Research Question #3 How can we make a machine understand the interface and extract semantics from it in the same way as a human designer does? 39 A human-designer/user naturally understands the design and semantics of an interface based on visual cues and based on his prior experiences.  A machine cannot really “see” an interface and does not have any implicit Web search experience. (How much do visual layout engines assist?)  Hence, there is a difference between the way a machine perceives an interface and the way a designer perceives the interface.  How can we reconcile these differences?
  40. 40. Existing methods have been Investigating R.Q. #3 able to: understand design, attach semantic labels, derive Simulating a Human Designer segments and query capabilities. 40  Hypothesis: A machine can be made to understand the interface in the same way as a human designer does if it is enabled to discover the deep source of knowledge that created the interface in the first place. Attach Search Understand Derive Semantic Interface design Segments Labels Understands / Designs Derive Web Design Query Knowledge Capabilities Conceptual Designer Recover DB Model /Modeler Schema Extracting DB schema and conceptual model is still an open question.
  41. 41. Connecting the dots R.Q. 1 Attach Search Understand Derive Search Semantic 41 Interface design Segments Interface Labels Web Design Knowledge Derive Query Web Design Capabilities Knowledge Designer Conceptual Recover DB Model R.Q. 2 Schema R.Q. 3 Search ? Conceptual Model based Interface Interface
  42. 42. 42 THANK YOU ! Suggestions, Comments, Thoughts, Ideas, Questions… Acknowledgements: To My Prospectus Committee Members References: [1] to [42] (in prospectus report).