An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
Upcoming SlideShare
Loading in …5
×

An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation

471 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
471
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • A very good morning to everyone here. I am Ritu Khare from the Drexel University in the USA, presenting our work on using hidden Markov models for search interface segmentation.
  • The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
  • The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
  • The motivation behind studying this problem is the deep Web. Deep Web is that portion of the Web that is not returned by search engines like Google through crawling and indexing. The contents of deep Web lie in online databases that can only be accessed by filling up HTML forms that lie on search interfaces like this. Researchers have suggested many ways to make these hidden contents more useful and visible to the Web users. Such as designing metasearch engines and increasing search engine visibility of deeb Web contents. A critical pre-requisite of these solutions is a deep understanding of the semantics of search interfaces.
  • Therefore, we are studying the problem of interface segmentation which is very important in understanding search interface semantics. Very simply states, search interface segmentation means grouping of related attributes together. Lets understand this with the help of this interface. It can be divided into 2 segments where each one forms a different implied query. Top segment has 7 components, and the bottom has 4 components. This example suggests that a segment has can a varied number , formats, and patterns of components.
  • Now lets see why this makes a challenging problem. A search interface is designed by human designers in such a way that a user quickly recognizes the segments based on the visual arrangements of components and based on her past experiences in performing searches using interfaces. In a way, segmentation comes very naturally to human users. On the other extreme, a machine cannot “see” a segment for a couple of reasons. First, the components that are visually close on the interface, might be located far apart in the machine readable HTML code . Second, a machine has no cognitive ability to recognize a segment boundary. In this work, we are studying whether a machine can “learn” how to segment an interface into implied queries.
  • There have been many works in the past that address the segmentation problem. These are based on rules and heuristics which makes them unfit for handling diversity and scalability. Also, most of them do not group all components of a segment together i.e. they suffer from under-segmentation. The proposed approach helps in overcoming the shortcomings by taking a deeper approach to solve the segmentation problem. As opposed to rules, we adopt a model based holistic approach. We incorporate the knowledge used by a designer for designing an interface into a model and use this model for segmentation. In a way we create an artificial designer who has the ability to segmentation.
  • The deep Web has diverse distributions of subject domains and the design tendencies of designers from different domains are also different from each other. For interfaces belonging to a given domain, 2 kinds of methods can be designed. Domain specific method and generic method. Lets say we have an interface I belonging to domain Di. A Domain-specific method for this interface will be designed by observing interfaces from domain Di only. Generic method for same interface will be designed by observing interfaces from a random mix of domains. Existing works have compares the accuracies between the two methods and suggest that domain-specific methods always return in better performance. Using model based approach of hidden markov model, in this work, we look at the domain question with a fresh perspective. Instead of 2 we devise 3 kinds of methods and study in detail why a particular method results in higher accuracy than other.
  • The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
  • So what exactly is an HMM? It can be best understood with the help of this figure. This is an example HMM. The hidden nodes are the states and the white nodes are symbols or observations emitted by states. There are two stochastic processes involved here. One is the process of transition from one state to another . Second is the process of emission of symbols by each state. These are the 4 important elements of an HMM. There is a finite set of states, a matrix that describes probability of transition from one state to another. , a finite set of symbols, and a matrix that describes probability of emission of a symbol by a given state. HMMs are needed in context of those real world processes that are unobservable and difficult to interpret particularly by a machine. An HMM is used to model such processes and also to explain such processes i.e. to determine possible state transitions the process might have undergone to generate a given sequence of observable symbols.
  • Now lets look at a search interface in greater detail. A search interface consists of a sequence of components that belong to different logical groups. Components in a single group have difference semantic roles which we call as a semantic label.For data intensive Web applications, each search interface when submitted to the server is converted to a structured query expression. E.g. Assuming the underlying DB table name is “Gene,” the lower segment can be expressed as select * from Gene where Gene_name=“maggie”. In a way each segment in a search interface represents a WHERE clause expressing a query condition. Thus, for this work, we use a set of 3 semantic labels : attribute name, operator and operand. It should be noted that segmentation is a two fold problem – involves determination of boundaries of logical groups and determination of semantic labels of components in each group.
  • In this work, our primary assumption is that the process of search interface design is probabilistic in nature. Consider this interface and let us think of how a designer might have laid down the components on it. The designer first lays out an attribute name, then an operand then again an attribute name, an operator and an operand. He lays these labels based on some implicit knowledge which is beyond natural understanding of a machine. All a machine can observe it that , there is a text followed by a textbox followed by another text and so on. A machine can observe the components but the semantic labels appear hidden. Therefore , we believe that interface design process can be modeled and explained using a hidden morkov model.
  • We believe that an HMM can simulate the process of interface design and can act like a human designer who has ability to design an interface using implicit knowledge of semantic labels and segment patterns. And also has the ability to determine the segment boundaries and semantic labels given a previously designed search interface. To accomplish segmentation, we encoded the implicit designer’s knowledge in an HMM-based artificial designer. As we saw earlier segmentation is a 2-fold process- determination of semantic labels and determination of boundaries. Therefore, we use a layered HMM with 2 layers: T-HMM that tags components with apt semantic labels, and S-HMM that creates boundaries around related group of components.
  • Here is how the 2-layered HMM functions. Consider the same example interface. A machine parser with no intelligence embedded and no training provided would read this interface as a raw sequence of components. This becomes the input for the first layer T-HMM. T-HMM would read these components as a sequence of semantic labels. This in turn becomes the input for next layer S-HMM. S-HMM tags these labels with respect to their position in a segment and hence finishes the task of segmentation.
  • Now let us look at the 2 layers in a greater detail. For T-HMM i.e. the layer that provides semantic labels. Observation symbol consist of the raw HTML components such as text labels and various form elements. For States, there are semantic labels as discussed earlier: attribute-name, operator and operand. In initial analysis of interface we noticed that there are certain texts found in real-world interfaces that belong to none of the 3 classes. They are either some instructions for entering data, descriptions, or some examples and constraints. Thus we create a 4th state and call is as text misc state. The topology obtained from a spec_databset of 50 random interfaces is shown here. For S-HMM, obs symbol space is same as the state space for T-HMM. As both are layers are used in tandem. States for S-HMM are the relative position of each component with respect to a segment. Here is the state transition topology obtained from observing 50 randomly selected interfaces.
  • The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
  • First experiment was conducted on biology domain, as we found this domain very interesting and less explored one. Most of the domains used by existing works are commercial ones such as movies, books, so we decided to first dive into a non-commercial domain. We applied the 2-layered approach to segment 200 interfaces from this domain. Both the training and testing interfaces belong to the biology domain and hence it’s a domain-specific method. And found the following results. 86% of the segments were correctly identified and out of the these rightly determined segments, we measured the accuracy for identification of semantic labels. We found that in many cases there are multiple instances of attribute-names within a single segment. So, we decided to measure the accuracy in two ways. In 90% of the cases, correct attribute-name label was identified. And in 99% of the cases, at least one instance of attribute-name was correctly identified by T-HMM. The accuracy attained was all the semantic labels were pretty high except for text-misc which were misidentified to be attribute names in most of the cases. We shall work on improving this in future. To compare accuracy of our method with an existing heuristic based approach LEX. We implemented LEX and tested it on 100 interfaces from each of the 4 domains – two commercial – auto and movie, and two non-commercial – bio and health. Again these are domain-specific methods. Second column lists the segmentation accuracy obtained by LEX. And third column lists the improvement in this accuracy attained by our method. The reason we attained such results is that – LEX does not model text-misc state and suffered from under-segmentation in many cases. The heuristics of LEX are limited in that it assumes that attribute name and operand cannot be more than 2 rows apart in HTML code which is contrary to reality in many domains. You might have noticed the 4th column in the table of comparison. It represents a variation of the HMM. And it too outperforms LEX on all the domains. Lets look at different variations of 2-layered HMM that we created by altering the training data.
  • We noticed that there exists differences in interface designs from different domains. Using HMMs, I derived T-HMM topologies for different domains. This figure shows design tendencies in the auto domain. States indicate semantic labels assigned to components. Similarly, for 4 other domains, this state transition topology was created. The transitions and values were found different in all the 5 domains. E.g. several peculiarities can be seen in one domain say auto domain. in all domains there is some prob. of transitioning from operator to attribute-name except in the auto domain. Also the transition from operand to operator is only found in this domain. HMMs are a useful way of studying the differences and preferences of designers in a particular domain. We also created another HMM with interfaces from a mix of all 5 domains and call it the mixed model.
  • Using these 6 variations of HMMs we conducted 30 experiments. We tried all possible combinations of training and testing data. All these cells belong to one of the three kinds of methods. The green cells represent the domain specific methods i.e. the training and test data are same. This is the method we used to conduct our initial experiments. The orange cells represent the generic methods i.e. the training data is not consciously created and comes from a bunch of mixed domain interfaces. The rest of the cell belong to the ‘cross domain’ method. i.e. training data is from domain X and test data is from domain Y. The numbers in bold represent the highest accuracy attained while testing interfaces in a given domain and numbers in italics represent the weakest performance by a model in a given domain. We can see that HMMbio gives highest performance in 4 out of 5 domains. Out of which 3 are cross domain methods. Looking at a greater detail – lets look at patterns captured by domain-specific model. The first example comes from automobile domain. We notice that the domain-specific model HMMauto generates best performance in auto domain. This pattern is peculiar in auto domain and hence wasn’t captured by other models. Similarly in bio domain this segment pattern was peculiar and frequent and hence wasn’t captured by other domains resulting in best performance by domain-specific model in bio domain. Lets look at some patterns that were captured by cross domain models. E.g. a segment pattern in health domain. It was undersegmented by HMMhealth as this is a rare pattern in this domain. However this was captured by cross domain model HMMbio where it is common to have a text-misc after a textbox within a segment. Another pattern comes from movie domain – this was incorrectly segmented by HMMmovie as its not common to have operators in selection list in the movie domain. But this pattern is common in bio domain, so was captured by cross domain model HMMbio. We can see that contrary to previous study and intuition, domain specific model don’t always return in highest accuracy. E.g. in movie domain, HMMmovie returned 70% accuracy. Which was less than that returned by every other model.
  • Although the domain testes are limited, we can derive some general conclusions. First, when a domain has a peculiar as well as frequent pattern , then that pattern can be returned by domain-specific model. E,g. are bio and auto domains. Second, when a domain D has a rare pattern and there is another domain B that has the same pattern as a frequent one, then that pattern can be recovered by cross model prepared by interfaces from domain B.In short, Its not that domain specific models always lead to higher accuracy instead the model trained by better examples result in better results. Better in the sense of frequency of design patterns in both domains.
  • The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
  • We showed that the interface design process is probabilistic in nature and introduced an approach to interface segmentation. We are the first one to apply HMMs on deep web interfaces. We tested our method across several domains and found that it results in high accuracy and outshines a contemporary approach in all domains.We designed different variations of the HMMs and tested them across all domains. An interesting conclusion we reached is: we can design a single model that can be used for segmenting interfaces for multiple domains – e.g. the HMMbio, prepared by biology interfaces, performed other models in 4 out of 5 domains.
  • In future, we want to test our method on more domains, a derive a minimal set of models that can various domains present on the deep Web. In terms of improvement, we want to be able to represent more complex segments. Some segments are intertwined with components of other segments and certain segments are really strange like they have attribute name and operands intertwined in a single component. And are composed of a single component. We also want to be able to extract more information about an attribute such as data type, integrity constraints, etc. Also, using HMMs posed certain limitations to the approach. We had to perform manual tagging to prepare training data. We want to explore some unsupervised learning methods to prepare training data. Another problem was of time complexity. We want to explore some optimization methods to improve efficiency of this approach; or we could use this approach as a pre-processing module to other advanced tasks related to deep Web.
  • Thank you very much for listening with patience. Please let me know if you have any questions or comments to make.
  • ×