An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation

1. AN EMPIRICAL STUDY ON USINGHIDDEN MARKOV MODEL FORSEARCH INTERFACE SEGMENTATION Ritu Khare and Yuan An The iSchool at Drexel Drexel University, USA 1

2. Presentation Order Problem: Interface Segmentation Solution : Hidden Markov Model Empirical Results Summing Up 2

4. Search Interface Segmentation

5. Challenges

6. Novelty of the SolutionSolution : Hidden Markov Model Empirical Results Summing Up 3

7. Motivation: The Deep Web What is DEEP WEB? Portion of Web, not returned by search engines through traditional crawling and indexing. Contents lie in online databases and are accessed by manually filling up HTML forms on search interfaces. How to make it USEFUL? Meta Search Engines E.g. Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005) Deep Web Crawlers E.g. Raghavan and Garcia-Molina (2001), Madhavan et al. (2008) A pre-requisite is A thorough understanding of semantics of search interfaces 4

8. Search Interface Segmentation 5 A critical part in understanding semantics of search interfaces The segmentation of search interfaces into logical groups of implied queries. Grouping of related interface components together Search Interface Segmentation Top segment = 7 components Bottom Segment = 4 components

9. Why is Segmentation Challenging? 6 Human Designer / User Machine Segment has apparent semantic existence Visual Arrangements Past Experiences Cannot “see” a segment. Visually close components, might be located far away in the HTML code. No Cognitive Ability In this paper, we investigate whether a machine can “learn” how to segment an interface.

10. The Novelty of The Solution:Model-based 7 Shortcomings of existing works: They use rules and heuristics for segmentation. These techniques have problems in handling scalability and heterogeneity. Zhang et al., 2004 and He et al., 2004, Raghavan and Garcia-Molina, 2001, Kalijuvee et al., 2001 We overcome these shortcomings Model Based Approach Implicit Knowledge (used by a designer to design an interface) HMM (Artificial Designer) SEGMENTATION

11. 8 The deep Web has diverse domains. The interface designs differ across domains The Novelty of The Solution: The Domain Aspect To segment interfaces from a given subject domain … Existing works have compared the accuracies attained by two methods. Using Hidden Markov Models . . . We don’t limit to the comparison between the two methods. For a given domain, we investigate what kind of training interfaces result in high segmentation accuracy and why? Domain – Specific Method Generic Method I(Di) Interface I from domain Di Interfaces from mix of arbitrary domain D1, D2, D3 … Interfaces from domain Di Fresh Perspective

13. Search Interface Analysis

14. HMM: An Artificial Designer

15. 2-Layered Approach

16. Model Specification & ArchitectureEmpirical Results Summing Up 9

18. Needed to model and explain the ‘real-world processes’ that are implicit and unobservable.TRANSITION STATE (hidden) q0 q1 q2 q3 q4 EMISSION σ4 σ0 σ1 σ2 σ3 1. State Space : A finite set of states {q0, q1, q2 …qn}. 2. Transition Matrix: Probability P (qi-> qj) of transitioning from a state qi to qj. 3. Symbol Space : A set of output tokens {σ1, σ2, …, σm}. 4. Emission Matrix :Probability P (qi↑ σk) of state qi emitting the token σk. SYMBOL (observable)

19. Search Interface AnalysisSemantic Labels 11 Logical Group For data-driven Web applications, interface components are translated into structured query (e.g. SQL) expressions: SELECT * FROM Gene WHERE Gene_Name = ‘maggie’; A segment in a search interface corresponds to a WHERE clause, each collecting values qualified using a built-in operator, for a particular attribute in the DB schema. Segmentation is a two-fold problem Identification of boundaries of logical groups Assignment of semantic labels to components. Logical Group Operator Operand Attribute-name

20. INTERFACE DESIGN PROCESS 12 Operand While the components are observable, their semantic roles appear hidden to a machine. The proceeding of one semantic label by another is similar to the transitioning of HMM states. Attribute-name Operator Operand Attribute-name Attribute Name Operand Operator Attribute Name Operand Text (Gene ID) Textbox Text (Gene Name) RB Group Textbox

21. HMM: An Artificial Designer 13 An HMM can act like a human designer that can design an interface and determine the segment boundaries and semantic labels of components. We encoded the implicit knowledge required for interface segmentation in an HMM-based artificial designer. We employ a 2-layered HMM: The first layer T-HMMtags each component with appropriate semantic labels (attribute-name, operator, and operand). The second layer S-HMM segments the interface into logical attributes.

22. 2-LAYERED HMM 14 Parser Text Textbox Text RB Group Textbox T-HMM Attribute-name Operand Attribute-name Operator Operand S-HMM Begin-segment End-segment Begin-segment Inside-segment End-segment

23. MODEL SPECIFICATION: T-HMM & S-HMM 15 T-HMM S-HMM Test interfaces Training interfaces Symbol Sequences Semantic Labels & Segment Boundaries (of test interfaces) T-HMM State Sequences S-HMM

25. Variations of Models

26. Some Interesting Results

27. ConclusionsSumming Up 16

28. INITIAL EXPERIMENTS: Domain-Specific Dataset: 200 interfaces Cross Validation: 190 training and 10 testing examples. Training: Maximum Likelihood Method Testing: Viterbi Algorithm Dataset: 100 interfaces each Why 2-Layered HMM outperformed? LEX does not model text-misc and thus suffered from under-segmentation. LEX considers only those texts as attribute-names that are located within 2-top-row distance from the form element. In reality, attribute-name and operand might be located far apart in the source code. 17 FIRST EXP.: BIOLOGY DOMAIN COMPARISON WITH LEX (He et al. 2007) : 4 DOMAINS S-HMM T-HMM *For segments with multiple instances of attr-names, at least 1 was correctly identified

29. Design preferences of designers from different domains are different. HMM VariationsT-HMM Topology AUTOMOBILE BIOLOGY MOVIE HEALTH MIXED REFERENCE & EDU Transitions <5% probable not shown

30. RESULTS 19 1. Domain-Specific 2. Generic 3. Cross Domain A Pattern Captured by Domain Specific Model A Pattern Captured by Cross-Domain Model Text-misc Health Automobile Domain-specific models do not always result in best performance, e.g. movie domain

33. Future Work21

34. CONTRIBUTIONS 22 Introduction to 2-layered HMM approach for interface segmentation motivated by probabilistic nature of interface design process. First work to apply HMMs on deep Web search interfaces. Effectiveness test across representative domains of deep Web. High segmentation accuracy in most domains. Outperformed a previous approach, LEX by at least 10% in most cases. Design & comparison of various of learning models. A single model has the potential of accurately segmenting interfaces from multiple domains, provided it is trained on the data having appropriate variety and frequency of design patterns. An example is HMMbio that performed better than other models on 80% of the tested domains. The variety and frequency of patterns in biology domain helps HMMbio contain more design knowledge & be a smarter designer.

35. FUTURE WORK 23 Design a minimal set of models that reaches as many deep Web domains as possible Involve More Domains Each model returns higher accuracy than its domain-specific counterpart Transition to a new interface representation scheme: Distributed Segments and Segments with intertwined components Recover the schema of deep Web databases: Extracting finer details, such as data types and constraints. Overcome the challenges posed by HMMs Manual Tagging of training data: Explore unsupervised training methods such as Baum Welch algorithm. Time taken by Viterbi algorithm for state recovery Find optimization techniques to improve efficiency. Use this method as an off-line pre-processing module to other applications such as meta-search engines and deep Web crawlers.

36. Suggestions, Thoughts, Ideas, Questions… THANK YOU ! 24 Acknowledgements: To the Anonymous Reviewers of CIKM 2009 References: [1] to [23] (in full paper).

An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (11)

Similar to An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation

Similar to An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation (20)

An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation

Editor's Notes