Two Layered HMMs for Search Interface Segmentation

2-Layered HMMs for Search Interface Segmentation Ritu Khare (Under the Supervision of Dr Yuan An, Assistant Professor, iSchool) 1

Order of Presentation 2 Background Deep Web What is Search Interface Understanding? What is Interface Segmentation? Why is Segmentation Challenging? Our Approach for Segmentation Interface Representation HMM: The Artificial Designer 2-Layered Approach Architecture Experimentation Parameters Result Contributions Future Work References

Background: Deep Web What is Deep Web: The data that exists on the Web but is not returned by search engines through traditional crawling and indexing. The primary way to access this data is by filling up HTML forms on search interfaces. Characteristics[6] :A large proportion of structured databases; Diversity of domains; and its ; Growing scale Researchers have many goals for the deep Web: design intra-domain meta-search engines [22, 8, 15, 5, 21] increase content visibility on existing search engines [17, 12] derive ontologies from search interfaces [1]. A pre-requisite to attain these goals is an understanding of the search interfaces (slide 4). In this project, we propose an approach to address the segmentation(slide 5) portion of the problem of search interface understanding. 3

Background: What is Search Interface Understanding? 4 Understanding semantics of a search interface (shown in figure) is an intricate process [4] . It involves 4 stages. Representation: A suitable interface representation scheme is chosen; semantic labels (slide 8) to be assigned to interface components are decided. An interface component is any text or HTML form element (textbox, textarea, selection list, radiobutton, checkbox, file input) that exists inside an HTML form. Parsing: Components are parsed into a suitable structure. Segmentation: The interface components are assigned semantic labels , and related components are grouped together The questions like “Which surrounding text is associated with which form element?” (In figure 2, “Gene ID” is associated with the textbox placed next) are also answered in this stage. Segment-processing: Additional information, such as domain, constraints, and data type, about each segment component is extracted.

What is Interface Segmentation? 5 This project focuses on Segmentation, the 3rd stage of this process. Figure shows a segmented interface. The related components are grouped together. The left segment has 7 components. The right segment has 4 components (“cM Position:”, selection list, textbox, and “e.g., “10.0-40.0””).

Why is Segmentation Challenging? From a user’s (or designer’s) standpoint, By looking at the visual arrangement of components, and based on past experiences, the user creates a logical boundary around the related components as they appear to belong to the same atomic query. On the other hand, a machine is unable to “see” a segment due to the following reasons: The components that are visually close to each other might be located very far apart in the HTML source code, A machine does not implicitly have any search experience that can be leveraged to identify a segment boundary. This project aims to investigate whether a machine can “learn” how to understand and segment an interface. Existing works have two shortcomings: they [9,13,17] do not group all related components together i.e. do not create complete segments. they [23, 7] use rules and heuristics to segment a search interface. These techniques have problems in handling scalability and heterogeneity [10]. 6

Our Approach for Segmentation We incorporate the first-hand implicit knowledge using which a human designer is assumed to have designed an interface. This is accomplished by designing an artificial designer using Hidden Markov Models (refer to week 9’s slides on HMM introduction). We visualize segmentation as a two-folded problem Identification of boundaries of logical attributes (slide 9) Assignment of semantic labels (attribute-name, operator, and operand described in slide 9) to interface components. 7

Interface Representation 8 In figure, each component of the lower segment is marked with a label, which we term as a semantic label. The semantic label for a particular component denotes the meaning of the component from a user’s or designer’s standpoint. Search Entity Logical Attribute Logical Attribute Operand Operator Attribute-name

Interface Representation Attribute-name: Attribute-name denotes the criteria available for searching a particular entity, e.g. the entity “Genes” can be searched by “Gene ID” and by “Gene Name”. Operand: An attribute-name is usually associated with operand(s), the value(s) entered by the user that is(are) matched against the corresponding field value(s) in the underlying database. Operator: The user may also be given an option of specifying the operator that further qualifies an operand. Filling up an HTML form is similar to writing SQL queries. Assuming the underlying database table name is “Gene”, the SQL queries for figure would be: SELECT * FROM Gene WHERE Gene_ID= ‘PF11_0344’; SELECT * FROM Gene WHERE Gene_Name LIKE ‘maggie’; Logical Attribute: The predicate in the WHERE clause of each query is created by a group of related components. We combine the semantic roles (attribute-name, operator(s), and operand(s)) of these components to create a composite semantic label called logical attribute. Our approach assumes that a segment corresponds to a logical attribute. 9

HMM: The Artificial Designer 10 We assume that an HMM can act like a human designer who has the ability to design an interface using acquired knowledge and to determine (decode) the segment boundaries and semantic labels of components. The designing process is similar to statistically choosing one component from a bag of components (a superset of all possible components) and placing it on the Web page while keeping the semantic role (attribute-name, operand, or operator) of the component in mind. Knowledge of Semantic Labels Bag of Components Search Interface Designing 2-Layered HMM (Artificial Designer) Segments & Tagged Components Decoding

HMM: The Artificial Designer While the components are observable, their semantic roles appear hidden to a machine. The proceeding of one semantic label by another is similar to the transitioning of HMM states. In the figure, Ovals=states (semantic labels); Rectangles= emitted symbols (components). The designing ability is provided by training the HMM with suitable algorithms. Once an HMM is trained, it can be used for the decoding process i.e. for explaining the design of a given search interface. 11 Attribute Name Operand Operator Attribute Name Operand Text (Gene ID) Textbox Text (Gene Name) RadioButton Group Textbox

2-Layered HMM The problem of decoding that we address in this paper is two-folded involving segmentation as well as assignment of semantic labels to components. Hence, we employ a layered HMM [14] with 2 layers. The first layer T-HMM tags each component with appropriate semantic labels (attribute-name, operator, and operand). The second layer S-HMM segments the interface into logical attributes. 12

2-Layered Approach Architecture DOM-TREE PARSING Training Interfaces T-HMM Manually tagged State Sequences T-HMM Specs Test interfaces Predicted State Sequences T-HMM TRAINING T-HMM TESTING S-HMM S-HMM Specs Test interfaces S-HMM TRAINING Manually tagged State Sequences S-HMM TESTING Predicted State Sequences 13

Experimentation Parameters Data-Set: 200 interfaces (NAR collection) http://www3.oup.co.uk/nar/database/c/ Parsing: DOM-trees [3] of components. Trees were traversed in the depth-first search order . Testing and Training Data: The examples were randomly divided into 20 equal-sized sets. We conducted 20 experiments each having 190 training and 10 testing examples. Testing and Training Algorithms: In both layers, training and testing were performed using Maximum Likelihood method and Viterbi algorithm respectively. 14

Contributions We studied a challenging stage (segmentation) of the process of search interface understanding. In the context of deep Web, this is the third formal empirical study (after [23] and [7]) that groups components belonging to the same logical attribute together. We incorporated the first-hand knowledge of the designer for interface segmentation and component tagging. To the best of our knowledge, this is the first work to apply HMMs on deep Web search interfaces. The interface has been represented in terms of the underlying database. This helped in extracting database querying semantics. Moreover, we tested our method on a less-explored domain (biology), and found promising results. 16

Future Work 17 To recover the schema of deep Web databases by extraction of finer details such as data type and constraints of logical attribute. To do justice to the balanced domain distribution of the deep Web [6], we want to test this method on interfaces from other less-explored domains. To improve the degree of automation we want to investigate the use of Baum Welch training algorithm. To minimize the zero emission probabilities, we want to investigate the use of Synset-HMM [20] .

References 18 Benslimane, S. M., Malki, M., Rahmouni, M. K., & Benslimane, D. (2007). Extracting personalised ontology from data-intensive web application: An HTML forms-based reverse engineering approach.Informatica, 18(4), 511-534. Freitag, D., & Mccallum, A. K. (1999). Information extraction with HMMs and shrinkage. AAAI-99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 31-36. Gupta , S., Kaiser, G. E., Grimm , P., Chiang, M. F., & Starren, J. (2005). Automating content extraction of HTML documents. World Wide Web, 8(2), 179-224. Halevy, A. Y. (2005, Why your data won't mix: Semantic heterogeneity. Queue, 3, 50-50-58. He, B., & Chang, K. C. (2003). Statistical schema matching across web query interfaces. 2003 ACM SIGMOD International Conference on Management of Data , San Diego, California. 217-228. He, B., Patel, M., Zhang, Z., & Chang, K. C. (2007a). Accessing the deep web. Communications of the ACM, 50(5), 94-101. He, H., Meng, W., Lu, Y., Yu, C., & Wu, Z. (2007b). Towards deeper understanding of the search interfaces of the deep web. World Wide Web, 10(2), 133 - 155. He, H., Meng, W., Yu, C., & Wu, Z. (2004). Automatic integration of web search interfaces with WISE-integrator. The VLDB Journal the International Journal on very Large Data Bases, 13(3), 256-273. Kalijuvee, O., Buyukkokten, O., Garcia-Molina, H., & Paepcke, A. (2001). Efficient web form entry on PDAs. Proceedings of the 10th International Conference on World Wide Web , Hong Kong, Hong Kong. Kushmerick , N. (2002). Finite-state approaches to web information extraction. 3rd Summer Convention on Information Extraction, 77-91. Kushmerick , N. (2003). Learning to invoke web forms. On the move to meaningful internet systems 2003 (pp. 997-1013) Springer Berlin / Heidelberg. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. Y. (2008). Google's deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241-1252.

References 19 Nguyen, H., Nguyen, T., & Freire, J. (2008). Learning to extract form labels. Proceedings of the VLDB Endowment , Auckland, New Zealand. , 1(1) 684-694. Oliver, N., Garg, A., & Horvitz, E. (2004). Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding, 96(2), 163-180. Pei, J., Hong, J., & Bell, D. (2006). A robust approach to schema matching over web query interfaces. Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW'06), Atlanta, Georgia. 46-55. Rabiner, L., R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. Proceedings of the 27th International Conference on very Large Data Bases , Rome, Italy. 129-138. Russell, S., J., & Norvig, P. (2002). Artificial intelligence: Modern approach Prentice Hall. Seymore, K., Mccallum, A. K., & Rosenfeld , R. (1999). Learning hidden markov model structure for information extraction. AAAI 99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 37-42. Tran-Le, M. S., Vo-Dang , T. T., Ho-Van , Q., & Dang, T. K. (2008). Automatic information extraction from the web: An HMM-based approach. Modeling, simulation and optimization of complex processes (pp. 575-585) Springer Berlin Heidelberg. Wang, J., Wen, J., Lochovsky, F., & Ma, W. (2004). Instance-based schema matching for web databases by domain-specific query probing. Thirtieth International Conference on very Large Data Bases, 30, 408 - 419. Wu, W., Yu, C., Doan, A., & Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep web. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data , Paris, France. 95 - 106. Zhang, Z., He, B., & Chang, K. C. (2004). Understanding web query interfaces: Best-effort parsing with hidden syntax. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France. 107 - 118. Zhong, P., & Chen, J. (2006). A generalized hidden markov model approach for web information extractionWeb Intelligence, 2006. WI 2006, Hong Kong, China. 709-718.

Thank You Questions, Comments, Ideas??? 20

Two Layered HMMs for Search Interface Segmentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Two Layered HMMs for Search Interface Segmentation

Similar to Two Layered HMMs for Search Interface Segmentation (20)

Two Layered HMMs for Search Interface Segmentation