SlideShare a Scribd company logo
1 of 24
AN  EMPIRICAL STUDY ON USINGHIDDEN MARKOV MODEL FORSEARCH INTERFACE SEGMENTATION  Ritu Khare and Yuan An The iSchool at Drexel Drexel University, USA 1
Presentation Order Problem: Interface Segmentation Solution : Hidden Markov Model Empirical Results Summing Up 2
Part 1 Problem: Interface Segmentation ,[object Object]
Search Interface Segmentation
Challenges
Novelty of the SolutionSolution : Hidden Markov Model Empirical Results Summing Up 3
Motivation: The Deep Web What is DEEP WEB? Portion of Web, not returned by search engines through traditional crawling and indexing.  Contents lie in online databases and are accessed by manually filling up HTML forms on search interfaces.  How to make it USEFUL? Meta Search Engines  E.g. Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005) Deep Web Crawlers E.g. Raghavan and Garcia-Molina (2001),  Madhavan et al. (2008) A pre-requisite is  A thorough understanding of semantics of search interfaces 4
Search Interface Segmentation 5 A critical part in understanding semantics of search interfaces The segmentation of search interfaces into logical groups of implied queries.  Grouping of related interface components together Search Interface Segmentation Top segment = 7 components Bottom Segment = 4 components
Why is Segmentation Challenging? 6 Human Designer / User Machine Segment has apparent semantic existence Visual Arrangements Past Experiences Cannot “see” a segment. Visually close components, might be located far away in the HTML code.  No Cognitive Ability In this paper, we investigate whether a machine can “learn” how to segment an interface.
The Novelty of The Solution:Model-based 7 Shortcomings of existing works: They use rules and heuristics for segmentation. These techniques have problems  in handling scalability and heterogeneity.  Zhang et al., 2004 and He et al., 2004, Raghavan and Garcia-Molina, 2001, Kalijuvee et al., 2001 We overcome these shortcomings Model Based Approach Implicit Knowledge  (used by a designer to design an interface) HMM (Artificial Designer) SEGMENTATION
8 The deep Web has diverse domains. The interface designs differ across domains The Novelty of The Solution: The Domain Aspect To segment interfaces from a given subject domain … Existing works have compared the accuracies attained by two methods.   Using Hidden Markov Models . . .   We don’t limit to the comparison between the two methods. For a given domain, we investigate what kind of training interfaces result in high segmentation accuracy and why?  Domain – Specific Method Generic Method  I(Di) Interface I from domain Di Interfaces  from mix of arbitrary domain D1, D2, D3 … Interfaces from  domain Di Fresh Perspective
Part 2 Problem: Interface Segmentation Solution : Hidden Markov Model ,[object Object]
Search Interface Analysis
HMM: An Artificial Designer
2-Layered Approach
Model Specification & ArchitectureEmpirical Results Summing Up 9
What is an HMM?“A finite state automaton with stochastic state transitions and symbol emissions” (Rabiner, 1989). 10 ,[object Object]
Needed to model and explain the ‘real-world processes’ that are implicit and unobservable.TRANSITION STATE (hidden) q0 q1 q2 q3 q4 EMISSION σ4 σ0 σ1 σ2 σ3 1. State Space : A finite set of states {q0, q1, q2 …qn}. 2. Transition Matrix: Probability P (qi-> qj) of transitioning from a state qi to qj. 	 3. Symbol Space : A set of output tokens {σ1, σ2, …, σm}.   4. Emission Matrix :Probability P (qi↑ σk) of state qi emitting the token σk.  SYMBOL (observable)
Search Interface AnalysisSemantic Labels 11 Logical Group For data-driven Web applications, interface components are translated into structured query (e.g. SQL) expressions: SELECT * FROM Gene WHERE Gene_Name = ‘maggie’;  A  segment in a search interface corresponds to a WHERE clause, each collecting values qualified  using a built-in operator, for a particular attribute in the DB schema.   Segmentation is a two-fold problem Identification of boundaries of logical groups Assignment of semantic labels  to components.   Logical Group Operator Operand Attribute-name
INTERFACE DESIGN PROCESS 12 Operand While the components are observable, their semantic roles appear hidden to a machine.  The proceeding of one semantic label by another is similar to the transitioning of HMM states.  Attribute-name Operator Operand Attribute-name Attribute Name Operand Operator Attribute Name Operand Text (Gene ID) Textbox Text (Gene Name) RB Group Textbox
HMM: An Artificial Designer 13 An HMM can act like a human designer that can design an interface and determine the segment boundaries and semantic labels of components.  We encoded the implicit knowledge required for interface segmentation in an HMM-based artificial designer. We employ a 2-layered HMM:  The first layer T-HMMtags each component with appropriate semantic labels (attribute-name, operator, and operand).  The second layer S-HMM segments the interface into logical attributes.
2-LAYERED HMM 14 Parser Text Textbox Text  RB Group Textbox T-HMM Attribute-name Operand Attribute-name Operator Operand S-HMM Begin-segment End-segment Begin-segment Inside-segment End-segment
 MODEL SPECIFICATION: T-HMM & S-HMM 15 T-HMM S-HMM Test interfaces Training  interfaces Symbol Sequences Semantic Labels  &  Segment  Boundaries (of test interfaces)  T-HMM State Sequences S-HMM
Part 3 Problem: Interface Segmentation Solution : Hidden Markov Model Empirical Results ,[object Object]

More Related Content

What's hot

Design patterns
Design patternsDesign patterns
Design patternsNYversity
 
A new language for a new biology: How SBML and other tools are transforming m...
A new language for a new biology: How SBML and other tools are transforming m...A new language for a new biology: How SBML and other tools are transforming m...
A new language for a new biology: How SBML and other tools are transforming m...Mike Hucka
 
Reasoning of database consistency through description logics
Reasoning of database consistency through description logicsReasoning of database consistency through description logics
Reasoning of database consistency through description logicsAhmad karawash
 
Modules and modularization criteria
Modules and modularization criteriaModules and modularization criteria
Modules and modularization criteriaUmaselvi_R
 
Software Design
Software DesignSoftware Design
Software DesignHa Ninh
 
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGESOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGEIJCI JOURNAL
 
2011-10-28大咪報告
2011-10-28大咪報告2011-10-28大咪報告
2011-10-28大咪報告體妮 陳
 
Is fortran still relevant comparing fortran with java and c++
Is fortran still relevant comparing fortran with java and c++Is fortran still relevant comparing fortran with java and c++
Is fortran still relevant comparing fortran with java and c++ijseajournal
 
GENERATING PYTHON CODE FROM OBJECT-Z SPECIFICATIONS
GENERATING PYTHON CODE FROM OBJECT-Z SPECIFICATIONSGENERATING PYTHON CODE FROM OBJECT-Z SPECIFICATIONS
GENERATING PYTHON CODE FROM OBJECT-Z SPECIFICATIONSijseajournal
 
Software architecture styles families_research_gssi_nov2013
Software architecture styles families_research_gssi_nov2013Software architecture styles families_research_gssi_nov2013
Software architecture styles families_research_gssi_nov2013Henry Muccini
 
MGU SYLLABUS MANUAL-Advance diploma in computer applications
MGU SYLLABUS MANUAL-Advance diploma in computer applicationsMGU SYLLABUS MANUAL-Advance diploma in computer applications
MGU SYLLABUS MANUAL-Advance diploma in computer applicationsmahatmagandhiuniversity
 
Software Design 1: Coupling & cohesion
Software Design 1: Coupling & cohesionSoftware Design 1: Coupling & cohesion
Software Design 1: Coupling & cohesionAttila Magyar
 
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATORPSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATORijistjournal
 
Semantic extraction of arabic
Semantic extraction of arabicSemantic extraction of arabic
Semantic extraction of arabiccsandit
 
Comparison of the Formal Specification Languages Based Upon Various Parameters
Comparison of the Formal Specification Languages Based Upon Various ParametersComparison of the Formal Specification Languages Based Upon Various Parameters
Comparison of the Formal Specification Languages Based Upon Various ParametersIOSR Journals
 

What's hot (19)

Design patterns
Design patternsDesign patterns
Design patterns
 
A new language for a new biology: How SBML and other tools are transforming m...
A new language for a new biology: How SBML and other tools are transforming m...A new language for a new biology: How SBML and other tools are transforming m...
A new language for a new biology: How SBML and other tools are transforming m...
 
Reasoning of database consistency through description logics
Reasoning of database consistency through description logicsReasoning of database consistency through description logics
Reasoning of database consistency through description logics
 
Modules and modularization criteria
Modules and modularization criteriaModules and modularization criteria
Modules and modularization criteria
 
Software Design
Software DesignSoftware Design
Software Design
 
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGESOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
 
2011-10-28大咪報告
2011-10-28大咪報告2011-10-28大咪報告
2011-10-28大咪報告
 
Is fortran still relevant comparing fortran with java and c++
Is fortran still relevant comparing fortran with java and c++Is fortran still relevant comparing fortran with java and c++
Is fortran still relevant comparing fortran with java and c++
 
GENERATING PYTHON CODE FROM OBJECT-Z SPECIFICATIONS
GENERATING PYTHON CODE FROM OBJECT-Z SPECIFICATIONSGENERATING PYTHON CODE FROM OBJECT-Z SPECIFICATIONS
GENERATING PYTHON CODE FROM OBJECT-Z SPECIFICATIONS
 
Software architecture styles families_research_gssi_nov2013
Software architecture styles families_research_gssi_nov2013Software architecture styles families_research_gssi_nov2013
Software architecture styles families_research_gssi_nov2013
 
Introduction to ‘C’ Language
Introduction to ‘C’ LanguageIntroduction to ‘C’ Language
Introduction to ‘C’ Language
 
MGU SYLLABUS MANUAL-Advance diploma in computer applications
MGU SYLLABUS MANUAL-Advance diploma in computer applicationsMGU SYLLABUS MANUAL-Advance diploma in computer applications
MGU SYLLABUS MANUAL-Advance diploma in computer applications
 
Software Design 1: Coupling & cohesion
Software Design 1: Coupling & cohesionSoftware Design 1: Coupling & cohesion
Software Design 1: Coupling & cohesion
 
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATORPSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
 
Semantic extraction of arabic
Semantic extraction of arabicSemantic extraction of arabic
Semantic extraction of arabic
 
Icsme16.ppt
Icsme16.pptIcsme16.ppt
Icsme16.ppt
 
combination
combinationcombination
combination
 
Advance diploma in it
Advance diploma in itAdvance diploma in it
Advance diploma in it
 
Comparison of the Formal Specification Languages Based Upon Various Parameters
Comparison of the Formal Specification Languages Based Upon Various ParametersComparison of the Formal Specification Languages Based Upon Various Parameters
Comparison of the Formal Specification Languages Based Upon Various Parameters
 

Viewers also liked (11)

Two Layered HMMs for Search Interface Segmentation
Two Layered HMMs for Search Interface SegmentationTwo Layered HMMs for Search Interface Segmentation
Two Layered HMMs for Search Interface Segmentation
 
Prospectus presentation
Prospectus presentation Prospectus presentation
Prospectus presentation
 
PRO Workshop - Selecting the appropriate PRO for your clinical study
PRO Workshop - Selecting the appropriate PRO for your clinical studyPRO Workshop - Selecting the appropriate PRO for your clinical study
PRO Workshop - Selecting the appropriate PRO for your clinical study
 
A Multi-level Methodology for Developing UML Sequence Diagrams
A Multi-level Methodology for Developing UML Sequence DiagramsA Multi-level Methodology for Developing UML Sequence Diagrams
A Multi-level Methodology for Developing UML Sequence Diagrams
 
Dissertation Proposal Presentation
Dissertation Proposal Presentation Dissertation Proposal Presentation
Dissertation Proposal Presentation
 
GbDportfolio Marketing+Analytics
GbDportfolio Marketing+AnalyticsGbDportfolio Marketing+Analytics
GbDportfolio Marketing+Analytics
 
Mike thelwall ritu
Mike thelwall rituMike thelwall ritu
Mike thelwall ritu
 
Can Clinicians Create High-Quality Databases?
Can Clinicians Create High-Quality Databases?Can Clinicians Create High-Quality Databases?
Can Clinicians Create High-Quality Databases?
 
Guia argentina de tratamiento de la EPOC
Guia argentina de tratamiento de la EPOCGuia argentina de tratamiento de la EPOC
Guia argentina de tratamiento de la EPOC
 
Save th
Save thSave th
Save th
 
Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...
Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...
Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...
 

Similar to An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation

Zhao huang deep sim deep learning code functional similarity
Zhao huang deep sim   deep learning code functional similarityZhao huang deep sim   deep learning code functional similarity
Zhao huang deep sim deep learning code functional similarityitrejos
 
Ontology mapping for the semantic web
Ontology mapping for the semantic webOntology mapping for the semantic web
Ontology mapping for the semantic webWorawith Sangkatip
 
Nina Grantcharova - Approach to Separation of Concerns via Design Patterns
Nina Grantcharova - Approach to Separation of Concerns via Design PatternsNina Grantcharova - Approach to Separation of Concerns via Design Patterns
Nina Grantcharova - Approach to Separation of Concerns via Design Patternsiasaglobal
 
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGDOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGkevig
 
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGDOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGijnlc
 
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGDOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGkevig
 
Analysis on Domain Adaptation based on different papers
Analysis on Domain Adaptation based on different papersAnalysis on Domain Adaptation based on different papers
Analysis on Domain Adaptation based on different papersharshavardhan814108
 
Object- Relational Persistence in Smalltalk
Object- Relational Persistence in SmalltalkObject- Relational Persistence in Smalltalk
Object- Relational Persistence in SmalltalkESUG
 
Spy On Your Models, Standard talk at EclipseCon 2011
Spy On Your Models, Standard talk at EclipseCon 2011Spy On Your Models, Standard talk at EclipseCon 2011
Spy On Your Models, Standard talk at EclipseCon 2011Hugo Bruneliere
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEEMEMTECHSTUDENTPROJECTS
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models BootcampData Science Dojo
 
A Comparative Study of RDBMs and OODBMs in Relation to Security of Data
A Comparative Study of RDBMs and OODBMs in Relation to Security of DataA Comparative Study of RDBMs and OODBMs in Relation to Security of Data
A Comparative Study of RDBMs and OODBMs in Relation to Security of Datainscit2006
 

Similar to An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation (20)

HMM-based Artificial Designer for Search Interface Segmentation
HMM-based Artificial Designer for Search Interface SegmentationHMM-based Artificial Designer for Search Interface Segmentation
HMM-based Artificial Designer for Search Interface Segmentation
 
Zhao huang deep sim deep learning code functional similarity
Zhao huang deep sim   deep learning code functional similarityZhao huang deep sim   deep learning code functional similarity
Zhao huang deep sim deep learning code functional similarity
 
Qeary pro and opt
Qeary pro and optQeary pro and opt
Qeary pro and opt
 
Qeary pro and opt
Qeary pro and optQeary pro and opt
Qeary pro and opt
 
Ontology mapping for the semantic web
Ontology mapping for the semantic webOntology mapping for the semantic web
Ontology mapping for the semantic web
 
Nina Grantcharova - Approach to Separation of Concerns via Design Patterns
Nina Grantcharova - Approach to Separation of Concerns via Design PatternsNina Grantcharova - Approach to Separation of Concerns via Design Patterns
Nina Grantcharova - Approach to Separation of Concerns via Design Patterns
 
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGDOMAIN BASED CHUNKING
DOMAIN BASED CHUNKING
 
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGDOMAIN BASED CHUNKING
DOMAIN BASED CHUNKING
 
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGDOMAIN BASED CHUNKING
DOMAIN BASED CHUNKING
 
Analysis on Domain Adaptation based on different papers
Analysis on Domain Adaptation based on different papersAnalysis on Domain Adaptation based on different papers
Analysis on Domain Adaptation based on different papers
 
Object- Relational Persistence in Smalltalk
Object- Relational Persistence in SmalltalkObject- Relational Persistence in Smalltalk
Object- Relational Persistence in Smalltalk
 
Interoperability
InteroperabilityInteroperability
Interoperability
 
Spy On Your Models, Standard talk at EclipseCon 2011
Spy On Your Models, Standard talk at EclipseCon 2011Spy On Your Models, Standard talk at EclipseCon 2011
Spy On Your Models, Standard talk at EclipseCon 2011
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
E0322035037
E0322035037E0322035037
E0322035037
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models Bootcamp
 
A Comparative Study of RDBMs and OODBMs in Relation to Security of Data
A Comparative Study of RDBMs and OODBMs in Relation to Security of DataA Comparative Study of RDBMs and OODBMs in Relation to Security of Data
A Comparative Study of RDBMs and OODBMs in Relation to Security of Data
 
ALT
ALTALT
ALT
 

An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation

  • 1. AN EMPIRICAL STUDY ON USINGHIDDEN MARKOV MODEL FORSEARCH INTERFACE SEGMENTATION Ritu Khare and Yuan An The iSchool at Drexel Drexel University, USA 1
  • 2. Presentation Order Problem: Interface Segmentation Solution : Hidden Markov Model Empirical Results Summing Up 2
  • 3.
  • 6. Novelty of the SolutionSolution : Hidden Markov Model Empirical Results Summing Up 3
  • 7. Motivation: The Deep Web What is DEEP WEB? Portion of Web, not returned by search engines through traditional crawling and indexing. Contents lie in online databases and are accessed by manually filling up HTML forms on search interfaces. How to make it USEFUL? Meta Search Engines E.g. Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005) Deep Web Crawlers E.g. Raghavan and Garcia-Molina (2001), Madhavan et al. (2008) A pre-requisite is A thorough understanding of semantics of search interfaces 4
  • 8. Search Interface Segmentation 5 A critical part in understanding semantics of search interfaces The segmentation of search interfaces into logical groups of implied queries. Grouping of related interface components together Search Interface Segmentation Top segment = 7 components Bottom Segment = 4 components
  • 9. Why is Segmentation Challenging? 6 Human Designer / User Machine Segment has apparent semantic existence Visual Arrangements Past Experiences Cannot “see” a segment. Visually close components, might be located far away in the HTML code. No Cognitive Ability In this paper, we investigate whether a machine can “learn” how to segment an interface.
  • 10. The Novelty of The Solution:Model-based 7 Shortcomings of existing works: They use rules and heuristics for segmentation. These techniques have problems in handling scalability and heterogeneity. Zhang et al., 2004 and He et al., 2004, Raghavan and Garcia-Molina, 2001, Kalijuvee et al., 2001 We overcome these shortcomings Model Based Approach Implicit Knowledge (used by a designer to design an interface) HMM (Artificial Designer) SEGMENTATION
  • 11. 8 The deep Web has diverse domains. The interface designs differ across domains The Novelty of The Solution: The Domain Aspect To segment interfaces from a given subject domain … Existing works have compared the accuracies attained by two methods. Using Hidden Markov Models . . . We don’t limit to the comparison between the two methods. For a given domain, we investigate what kind of training interfaces result in high segmentation accuracy and why? Domain – Specific Method Generic Method I(Di) Interface I from domain Di Interfaces from mix of arbitrary domain D1, D2, D3 … Interfaces from domain Di Fresh Perspective
  • 12.
  • 16. Model Specification & ArchitectureEmpirical Results Summing Up 9
  • 17.
  • 18. Needed to model and explain the ‘real-world processes’ that are implicit and unobservable.TRANSITION STATE (hidden) q0 q1 q2 q3 q4 EMISSION σ4 σ0 σ1 σ2 σ3 1. State Space : A finite set of states {q0, q1, q2 …qn}. 2. Transition Matrix: Probability P (qi-> qj) of transitioning from a state qi to qj. 3. Symbol Space : A set of output tokens {σ1, σ2, …, σm}. 4. Emission Matrix :Probability P (qi↑ σk) of state qi emitting the token σk. SYMBOL (observable)
  • 19. Search Interface AnalysisSemantic Labels 11 Logical Group For data-driven Web applications, interface components are translated into structured query (e.g. SQL) expressions: SELECT * FROM Gene WHERE Gene_Name = ‘maggie’; A segment in a search interface corresponds to a WHERE clause, each collecting values qualified using a built-in operator, for a particular attribute in the DB schema. Segmentation is a two-fold problem Identification of boundaries of logical groups Assignment of semantic labels to components. Logical Group Operator Operand Attribute-name
  • 20. INTERFACE DESIGN PROCESS 12 Operand While the components are observable, their semantic roles appear hidden to a machine. The proceeding of one semantic label by another is similar to the transitioning of HMM states. Attribute-name Operator Operand Attribute-name Attribute Name Operand Operator Attribute Name Operand Text (Gene ID) Textbox Text (Gene Name) RB Group Textbox
  • 21. HMM: An Artificial Designer 13 An HMM can act like a human designer that can design an interface and determine the segment boundaries and semantic labels of components. We encoded the implicit knowledge required for interface segmentation in an HMM-based artificial designer. We employ a 2-layered HMM: The first layer T-HMMtags each component with appropriate semantic labels (attribute-name, operator, and operand). The second layer S-HMM segments the interface into logical attributes.
  • 22. 2-LAYERED HMM 14 Parser Text Textbox Text RB Group Textbox T-HMM Attribute-name Operand Attribute-name Operator Operand S-HMM Begin-segment End-segment Begin-segment Inside-segment End-segment
  • 23. MODEL SPECIFICATION: T-HMM & S-HMM 15 T-HMM S-HMM Test interfaces Training interfaces Symbol Sequences Semantic Labels & Segment Boundaries (of test interfaces) T-HMM State Sequences S-HMM
  • 24.
  • 28. INITIAL EXPERIMENTS: Domain-Specific Dataset: 200 interfaces Cross Validation: 190 training and 10 testing examples. Training: Maximum Likelihood Method Testing: Viterbi Algorithm Dataset: 100 interfaces each Why 2-Layered HMM outperformed? LEX does not model text-misc and thus suffered from under-segmentation. LEX considers only those texts as attribute-names that are located within 2-top-row distance from the form element. In reality, attribute-name and operand might be located far apart in the source code. 17 FIRST EXP.: BIOLOGY DOMAIN COMPARISON WITH LEX (He et al. 2007) : 4 DOMAINS S-HMM T-HMM *For segments with multiple instances of attr-names, at least 1 was correctly identified
  • 29. Design preferences of designers from different domains are different. HMM VariationsT-HMM Topology AUTOMOBILE BIOLOGY MOVIE HEALTH MIXED REFERENCE & EDU Transitions <5% probable not shown
  • 30. RESULTS 19 1. Domain-Specific 2. Generic 3. Cross Domain A Pattern Captured by Domain Specific Model A Pattern Captured by Cross-Domain Model Text-misc Health Automobile Domain-specific models do not always result in best performance, e.g. movie domain
  • 31.
  • 32.
  • 34. CONTRIBUTIONS 22 Introduction to 2-layered HMM approach for interface segmentation motivated by probabilistic nature of interface design process. First work to apply HMMs on deep Web search interfaces. Effectiveness test across representative domains of deep Web. High segmentation accuracy in most domains. Outperformed a previous approach, LEX by at least 10% in most cases. Design & comparison of various of learning models. A single model has the potential of accurately segmenting interfaces from multiple domains, provided it is trained on the data having appropriate variety and frequency of design patterns. An example is HMMbio that performed better than other models on 80% of the tested domains. The variety and frequency of patterns in biology domain helps HMMbio contain more design knowledge & be a smarter designer.
  • 35. FUTURE WORK 23 Design a minimal set of models that reaches as many deep Web domains as possible Involve More Domains Each model returns higher accuracy than its domain-specific counterpart Transition to a new interface representation scheme: Distributed Segments and Segments with intertwined components Recover the schema of deep Web databases: Extracting finer details, such as data types and constraints. Overcome the challenges posed by HMMs Manual Tagging of training data: Explore unsupervised training methods such as Baum Welch algorithm. Time taken by Viterbi algorithm for state recovery Find optimization techniques to improve efficiency. Use this method as an off-line pre-processing module to other applications such as meta-search engines and deep Web crawlers.
  • 36. Suggestions, Thoughts, Ideas, Questions… THANK YOU ! 24 Acknowledgements: To the Anonymous Reviewers of CIKM 2009 References: [1] to [23] (in full paper).

Editor's Notes

  1. A very good morning to everyone here. I am Ritu Khare from the Drexel University in the USA, presenting our work on using hidden Markov models for search interface segmentation.
  2. The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
  3. The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
  4. The motivation behind studying this problem is the deep Web. Deep Web is that portion of the Web that is not returned by search engines like Google through crawling and indexing. The contents of deep Web lie in online databases that can only be accessed by filling up HTML forms that lie on search interfaces like this. Researchers have suggested many ways to make these hidden contents more useful and visible to the Web users. Such as designing metasearch engines and increasing search engine visibility of deeb Web contents. A critical pre-requisite of these solutions is a deep understanding of the semantics of search interfaces.
  5. Therefore, we are studying the problem of interface segmentation which is very important in understanding search interface semantics. Very simply states, search interface segmentation means grouping of related attributes together. Lets understand this with the help of this interface. It can be divided into 2 segments where each one forms a different implied query. Top segment has 7 components, and the bottom has 4 components. This example suggests that a segment has can a varied number , formats, and patterns of components.
  6. Now lets see why this makes a challenging problem. A search interface is designed by human designers in such a way that a user quickly recognizes the segments based on the visual arrangements of components and based on her past experiences in performing searches using interfaces. In a way, segmentation comes very naturally to human users. On the other extreme, a machine cannot “see” a segment for a couple of reasons. First, the components that are visually close on the interface, might be located far apart in the machine readable HTML code . Second, a machine has no cognitive ability to recognize a segment boundary. In this work, we are studying whether a machine can “learn” how to segment an interface into implied queries.
  7. There have been many works in the past that address the segmentation problem. These are based on rules and heuristics which makes them unfit for handling diversity and scalability. Also, most of them do not group all components of a segment together i.e. they suffer from under-segmentation. The proposed approach helps in overcoming the shortcomings by taking a deeper approach to solve the segmentation problem. As opposed to rules, we adopt a model based holistic approach. We incorporate the knowledge used by a designer for designing an interface into a model and use this model for segmentation. In a way we create an artificial designer who has the ability to segmentation.
  8. The deep Web has diverse distributions of subject domains and the design tendencies of designers from different domains are also different from each other. For interfaces belonging to a given domain, 2 kinds of methods can be designed. Domain specific method and generic method. Lets say we have an interface I belonging to domain Di. A Domain-specific method for this interface will be designed by observing interfaces from domain Di only. Generic method for same interface will be designed by observing interfaces from a random mix of domains. Existing works have compares the accuracies between the two methods and suggest that domain-specific methods always return in better performance. Using model based approach of hidden markov model, in this work, we look at the domain question with a fresh perspective. Instead of 2 we devise 3 kinds of methods and study in detail why a particular method results in higher accuracy than other.
  9. The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
  10. So what exactly is an HMM? It can be best understood with the help of this figure. This is an example HMM. The hidden nodes are the states and the white nodes are symbols or observations emitted by states. There are two stochastic processes involved here. One is the process of transition from one state to another . Second is the process of emission of symbols by each state. These are the 4 important elements of an HMM. There is a finite set of states, a matrix that describes probability of transition from one state to another. , a finite set of symbols, and a matrix that describes probability of emission of a symbol by a given state. HMMs are needed in context of those real world processes that are unobservable and difficult to interpret particularly by a machine. An HMM is used to model such processes and also to explain such processes i.e. to determine possible state transitions the process might have undergone to generate a given sequence of observable symbols.
  11. Now lets look at a search interface in greater detail. A search interface consists of a sequence of components that belong to different logical groups. Components in a single group have difference semantic roles which we call as a semantic label.For data intensive Web applications, each search interface when submitted to the server is converted to a structured query expression. E.g. Assuming the underlying DB table name is “Gene,” the lower segment can be expressed as select * from Gene where Gene_name=“maggie”. In a way each segment in a search interface represents a WHERE clause expressing a query condition. Thus, for this work, we use a set of 3 semantic labels : attribute name, operator and operand. It should be noted that segmentation is a two fold problem – involves determination of boundaries of logical groups and determination of semantic labels of components in each group.
  12. In this work, our primary assumption is that the process of search interface design is probabilistic in nature. Consider this interface and let us think of how a designer might have laid down the components on it. The designer first lays out an attribute name, then an operand then again an attribute name, an operator and an operand. He lays these labels based on some implicit knowledge which is beyond natural understanding of a machine. All a machine can observe it that , there is a text followed by a textbox followed by another text and so on. A machine can observe the components but the semantic labels appear hidden. Therefore , we believe that interface design process can be modeled and explained using a hidden morkov model.
  13. We believe that an HMM can simulate the process of interface design and can act like a human designer who has ability to design an interface using implicit knowledge of semantic labels and segment patterns. And also has the ability to determine the segment boundaries and semantic labels given a previously designed search interface. To accomplish segmentation, we encoded the implicit designer’s knowledge in an HMM-based artificial designer. As we saw earlier segmentation is a 2-fold process- determination of semantic labels and determination of boundaries. Therefore, we use a layered HMM with 2 layers: T-HMM that tags components with apt semantic labels, and S-HMM that creates boundaries around related group of components.
  14. Here is how the 2-layered HMM functions. Consider the same example interface. A machine parser with no intelligence embedded and no training provided would read this interface as a raw sequence of components. This becomes the input for the first layer T-HMM. T-HMM would read these components as a sequence of semantic labels. This in turn becomes the input for next layer S-HMM. S-HMM tags these labels with respect to their position in a segment and hence finishes the task of segmentation.
  15. Now let us look at the 2 layers in a greater detail. For T-HMM i.e. the layer that provides semantic labels. Observation symbol consist of the raw HTML components such as text labels and various form elements. For States, there are semantic labels as discussed earlier: attribute-name, operator and operand. In initial analysis of interface we noticed that there are certain texts found in real-world interfaces that belong to none of the 3 classes. They are either some instructions for entering data, descriptions, or some examples and constraints. Thus we create a 4th state and call is as text misc state. The topology obtained from a spec_databset of 50 random interfaces is shown here. For S-HMM, obs symbol space is same as the state space for T-HMM. As both are layers are used in tandem. States for S-HMM are the relative position of each component with respect to a segment. Here is the state transition topology obtained from observing 50 randomly selected interfaces.
  16. The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
  17. First experiment was conducted on biology domain, as we found this domain very interesting and less explored one. Most of the domains used by existing works are commercial ones such as movies, books, so we decided to first dive into a non-commercial domain. We applied the 2-layered approach to segment 200 interfaces from this domain. Both the training and testing interfaces belong to the biology domain and hence it’s a domain-specific method. And found the following results. 86% of the segments were correctly identified and out of the these rightly determined segments, we measured the accuracy for identification of semantic labels. We found that in many cases there are multiple instances of attribute-names within a single segment. So, we decided to measure the accuracy in two ways. In 90% of the cases, correct attribute-name label was identified. And in 99% of the cases, at least one instance of attribute-name was correctly identified by T-HMM. The accuracy attained was all the semantic labels were pretty high except for text-misc which were misidentified to be attribute names in most of the cases. We shall work on improving this in future. To compare accuracy of our method with an existing heuristic based approach LEX. We implemented LEX and tested it on 100 interfaces from each of the 4 domains – two commercial – auto and movie, and two non-commercial – bio and health. Again these are domain-specific methods. Second column lists the segmentation accuracy obtained by LEX. And third column lists the improvement in this accuracy attained by our method. The reason we attained such results is that – LEX does not model text-misc state and suffered from under-segmentation in many cases. The heuristics of LEX are limited in that it assumes that attribute name and operand cannot be more than 2 rows apart in HTML code which is contrary to reality in many domains. You might have noticed the 4th column in the table of comparison. It represents a variation of the HMM. And it too outperforms LEX on all the domains. Lets look at different variations of 2-layered HMM that we created by altering the training data.
  18. We noticed that there exists differences in interface designs from different domains. Using HMMs, I derived T-HMM topologies for different domains. This figure shows design tendencies in the auto domain. States indicate semantic labels assigned to components. Similarly, for 4 other domains, this state transition topology was created. The transitions and values were found different in all the 5 domains. E.g. several peculiarities can be seen in one domain say auto domain. in all domains there is some prob. of transitioning from operator to attribute-name except in the auto domain. Also the transition from operand to operator is only found in this domain. HMMs are a useful way of studying the differences and preferences of designers in a particular domain. We also created another HMM with interfaces from a mix of all 5 domains and call it the mixed model.
  19. Using these 6 variations of HMMs we conducted 30 experiments. We tried all possible combinations of training and testing data. All these cells belong to one of the three kinds of methods. The green cells represent the domain specific methods i.e. the training and test data are same. This is the method we used to conduct our initial experiments. The orange cells represent the generic methods i.e. the training data is not consciously created and comes from a bunch of mixed domain interfaces. The rest of the cell belong to the ‘cross domain’ method. i.e. training data is from domain X and test data is from domain Y. The numbers in bold represent the highest accuracy attained while testing interfaces in a given domain and numbers in italics represent the weakest performance by a model in a given domain. We can see that HMMbio gives highest performance in 4 out of 5 domains. Out of which 3 are cross domain methods. Looking at a greater detail – lets look at patterns captured by domain-specific model. The first example comes from automobile domain. We notice that the domain-specific model HMMauto generates best performance in auto domain. This pattern is peculiar in auto domain and hence wasn’t captured by other models. Similarly in bio domain this segment pattern was peculiar and frequent and hence wasn’t captured by other domains resulting in best performance by domain-specific model in bio domain. Lets look at some patterns that were captured by cross domain models. E.g. a segment pattern in health domain. It was undersegmented by HMMhealth as this is a rare pattern in this domain. However this was captured by cross domain model HMMbio where it is common to have a text-misc after a textbox within a segment. Another pattern comes from movie domain – this was incorrectly segmented by HMMmovie as its not common to have operators in selection list in the movie domain. But this pattern is common in bio domain, so was captured by cross domain model HMMbio. We can see that contrary to previous study and intuition, domain specific model don’t always return in highest accuracy. E.g. in movie domain, HMMmovie returned 70% accuracy. Which was less than that returned by every other model.
  20. Although the domain testes are limited, we can derive some general conclusions. First, when a domain has a peculiar as well as frequent pattern , then that pattern can be returned by domain-specific model. E,g. are bio and auto domains. Second, when a domain D has a rare pattern and there is another domain B that has the same pattern as a frequent one, then that pattern can be recovered by cross model prepared by interfaces from domain B.In short, Its not that domain specific models always lead to higher accuracy instead the model trained by better examples result in better results. Better in the sense of frequency of design patterns in both domains.
  21. The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
  22. We showed that the interface design process is probabilistic in nature and introduced an approach to interface segmentation. We are the first one to apply HMMs on deep web interfaces. We tested our method across several domains and found that it results in high accuracy and outshines a contemporary approach in all domains.We designed different variations of the HMMs and tested them across all domains. An interesting conclusion we reached is: we can design a single model that can be used for segmenting interfaces for multiple domains – e.g. the HMMbio, prepared by biology interfaces, performed other models in 4 out of 5 domains.
  23. In future, we want to test our method on more domains, a derive a minimal set of models that can various domains present on the deep Web. In terms of improvement, we want to be able to represent more complex segments. Some segments are intertwined with components of other segments and certain segments are really strange like they have attribute name and operands intertwined in a single component. And are composed of a single component. We also want to be able to extract more information about an attribute such as data type, integrity constraints, etc. Also, using HMMs posed certain limitations to the approach. We had to perform manual tagging to prepare training data. We want to explore some unsupervised learning methods to prepare training data. Another problem was of time complexity. We want to explore some optimization methods to improve efficiency of this approach; or we could use this approach as a pre-processing module to other advanced tasks related to deep Web.
  24. Thank you very much for listening with patience. Please let me know if you have any questions or comments to make.