This document summarizes a study on using Hidden Markov Models (HMMs) for search interface segmentation. The researchers applied a two-layered HMM approach, with the first layer tagging interface components with semantic labels and the second layer segmenting the interface. Their experiments showed domain-specific HMMs performed best on interfaces from the same domain, while cross-domain HMMs captured patterns across domains. The study contributed an effective probabilistic approach to interface segmentation and found appropriate training data is key to accurate segmentation across domains.
A Programmatic View and Implementation of XMLCSCJournals
XML as a markup language defines rules to encode data in a free format comprehensive by both human and machines. Usage of XML as a support for data integration, file configuration and interface definition is widely adopted and implemented by the software industry community.
The purpose of this paper is to examine an implementation of XML as a programming language, extending the capabilities offered by frameworks and simplifying the coding tasks. The code becomes a set of functions sharing the same pattern all written as XML parts. The defined language takes advantage from the predefined common libraries and provides a mean to invoke handlers from user interface components. Programmers take benefits from the simplicity of this language to apprehend quickly the logic implemented by a function, which result in an increase in maintenance quality and rapid development stability.
This document presents Trustrace, an approach that uses software repository links to improve the trust in automatically recovered traceability links. Trustrace calculates trust values for traceability links based on their similarity scores from information retrieval techniques as well as evidence from other sources like version control commit logs. An empirical study on two systems found that Trustrace improved precision and recall over vector space models and reduced an expert's effort to validate links by up to 50%. The results also tended to improve when using larger version control commit logs.
This document compares model-oriented and process algebra approaches to formal specification languages. It discusses key formal specification styles including model-oriented, algebraic, transition-based, process algebra, logic-based, and reactive approaches. It then evaluates several model-oriented (Z, VDM, B) and process algebra (CSP, CCS) languages based on criteria like abstraction, ambiguity, consistency, concurrency, readability and reusability. Finally, it discusses the B method and its tool support, comparing it to related techniques like Event-B, VDM, TLA, ASM and Z. The document provides an overview of different formal specification approaches and evaluates some example languages in these categories.
This document discusses and compares two formal specification styles: model-oriented and process algebra approaches. It provides an overview of different formal specification languages, including model-oriented languages like B, VDM, and Z, as well as process algebra languages like CSP and CCS. The document analyzes these approaches based on criteria like abstraction, ambiguity, consistency, and concurrency to evaluate their strengths and weaknesses for specifying systems formally.
This document summarizes a research paper about introducing explicit phrase alignment to neural machine translation (NMT) models. The key ideas are: (1) To develop an NMT model that treats phrase alignment as a latent variable during decoding, allowing the use of a phrase-based search space where alignment is available; (2) To design a new decoding algorithm using the available phrase alignment that can impose lexical and structural constraints while maintaining translation quality. Experiments showed the approach makes NMT more interpretable without sacrificing performance, and significantly improves constrained translation tasks.
IRJET- An Analysis of Recent Advancements on the Dependency ParserIRJET Journal
This document summarizes recent advancements in dependency parsers. It discusses how dependency parsers have been used to parse languages with free word order like Hindi and analyze source code from various programming languages. Several studies are highlighted that have used dependency parsers to extract semantic relationships, identify errors in automatic speech recognition, incorporate long-distance dependencies, and address feature sparseness issues. Dependency parsers have been shown to outperform other models for tasks like topic detection and can parse biomedical text, though both Link Grammar and Connexor Machinese Syntax parsers were found to have limitations for the biomedical domain.
Doppl is a new programming language being developed that aims to provide natural syntax for parallel programming. The language is focused on shared memory applications and message passing between tasks. This first development diary outlines the goals of the language, which include allowing programmers to model algorithms as state machines and represent data as attribute arrays to improve cache performance. Future iterations will explore additional features like loop structures and type systems.
1) Conditional random fields (CRFs) are a framework for building probabilistic models to segment and label sequence data.
2) CRFs offer advantages over hidden Markov models (HMMs) by allowing dependencies between labels that are not conditioned on the state, thus avoiding strong independence assumptions.
3) CRFs also solve the "label bias" problem that affects maximum entropy Markov models (MEMMs) by assigning a single joint probability to entire label sequences rather than conditional probabilities at each state.
A Programmatic View and Implementation of XMLCSCJournals
XML as a markup language defines rules to encode data in a free format comprehensive by both human and machines. Usage of XML as a support for data integration, file configuration and interface definition is widely adopted and implemented by the software industry community.
The purpose of this paper is to examine an implementation of XML as a programming language, extending the capabilities offered by frameworks and simplifying the coding tasks. The code becomes a set of functions sharing the same pattern all written as XML parts. The defined language takes advantage from the predefined common libraries and provides a mean to invoke handlers from user interface components. Programmers take benefits from the simplicity of this language to apprehend quickly the logic implemented by a function, which result in an increase in maintenance quality and rapid development stability.
This document presents Trustrace, an approach that uses software repository links to improve the trust in automatically recovered traceability links. Trustrace calculates trust values for traceability links based on their similarity scores from information retrieval techniques as well as evidence from other sources like version control commit logs. An empirical study on two systems found that Trustrace improved precision and recall over vector space models and reduced an expert's effort to validate links by up to 50%. The results also tended to improve when using larger version control commit logs.
This document compares model-oriented and process algebra approaches to formal specification languages. It discusses key formal specification styles including model-oriented, algebraic, transition-based, process algebra, logic-based, and reactive approaches. It then evaluates several model-oriented (Z, VDM, B) and process algebra (CSP, CCS) languages based on criteria like abstraction, ambiguity, consistency, concurrency, readability and reusability. Finally, it discusses the B method and its tool support, comparing it to related techniques like Event-B, VDM, TLA, ASM and Z. The document provides an overview of different formal specification approaches and evaluates some example languages in these categories.
This document discusses and compares two formal specification styles: model-oriented and process algebra approaches. It provides an overview of different formal specification languages, including model-oriented languages like B, VDM, and Z, as well as process algebra languages like CSP and CCS. The document analyzes these approaches based on criteria like abstraction, ambiguity, consistency, and concurrency to evaluate their strengths and weaknesses for specifying systems formally.
This document summarizes a research paper about introducing explicit phrase alignment to neural machine translation (NMT) models. The key ideas are: (1) To develop an NMT model that treats phrase alignment as a latent variable during decoding, allowing the use of a phrase-based search space where alignment is available; (2) To design a new decoding algorithm using the available phrase alignment that can impose lexical and structural constraints while maintaining translation quality. Experiments showed the approach makes NMT more interpretable without sacrificing performance, and significantly improves constrained translation tasks.
IRJET- An Analysis of Recent Advancements on the Dependency ParserIRJET Journal
This document summarizes recent advancements in dependency parsers. It discusses how dependency parsers have been used to parse languages with free word order like Hindi and analyze source code from various programming languages. Several studies are highlighted that have used dependency parsers to extract semantic relationships, identify errors in automatic speech recognition, incorporate long-distance dependencies, and address feature sparseness issues. Dependency parsers have been shown to outperform other models for tasks like topic detection and can parse biomedical text, though both Link Grammar and Connexor Machinese Syntax parsers were found to have limitations for the biomedical domain.
Doppl is a new programming language being developed that aims to provide natural syntax for parallel programming. The language is focused on shared memory applications and message passing between tasks. This first development diary outlines the goals of the language, which include allowing programmers to model algorithms as state machines and represent data as attribute arrays to improve cache performance. Future iterations will explore additional features like loop structures and type systems.
1) Conditional random fields (CRFs) are a framework for building probabilistic models to segment and label sequence data.
2) CRFs offer advantages over hidden Markov models (HMMs) by allowing dependencies between labels that are not conditioned on the state, thus avoiding strong independence assumptions.
3) CRFs also solve the "label bias" problem that affects maximum entropy Markov models (MEMMs) by assigning a single joint probability to entire label sequences rather than conditional probabilities at each state.
The document describes a method for learning design patterns from hierarchical labeled data using grammar induction and Bayesian modeling. The method begins with an initial grammar derived directly from the exemplars, then uses Markov chain Monte Carlo optimization to explore more general grammars formed by merging and splitting nonterminal symbols. The optimal grammar balances descriptive power over the exemplars against representation complexity, distilling general patterns from the data in a principled way. The method is demonstrated on geometric models and web pages.
A new language for a new biology: How SBML and other tools are transforming m...Mike Hucka
Presentation given at the Victorian Systems Biology Symposium (http://www.emblaustralia.org/About_us/news/mike-hucka.aspx) at the Walter and Eliza Hall Institute in Melbourne, Australia, on 20 August 2013.
Reasoning of database consistency through description logicsAhmad karawash
The document discusses reasoning of database consistency through description logics. It begins with an introduction and overview before covering data models and description logics, description logics and database querying, data integration, and concluding. It describes how entity relationship models are used to describe database structure and how they can be transformed into description logics knowledge bases. This allows reasoning about database consistency, satisfiability, and other properties to identify issues like redundancy. Description logics are also discussed as a way to perform querying and classify queries.
This document discusses criteria for modularization in software design. It defines modules as named entities that contain instructions, logic, and data structures. Good modularization aims to decompose a system into functional units with minimal coupling between modules. Modules should be designed for high cohesion (related elements) and low coupling (dependencies). The types of coupling from strongest to weakest are content, common, control, stamp, and data coupling. The document also discusses different types of cohesion within modules from weakest to strongest. The goal is functional cohesion with minimal coupling between modules.
This document discusses software design principles and methods. It covers topics like abstraction, modularity, coupling and cohesion, and information hiding. It also describes different design methods including functional decomposition, data flow design, design based on data structures, and object-oriented design. Key aspects of these methods are explained, such as the stages of object-oriented analysis and design. The document provides examples to illustrate different design concepts and metrics.
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGEIJCI JOURNAL
Pseudocode is an artificial and informal language that helps programmers to develop algorithms. In this
paper a software tool is described, for translating the pseudocode into a particular programming
language. This tool takes the pseudocode as input, compiles it and translates it to a concrete programming
language. The scope of the tool is very much wide as we can extend it to a universal programming tool
which produces any of the specified programming language from a given pseudocode. Here we present the
solution for translating the pseudocode to a programming language by implementing the stages of a
compiler.
The document proposes a novel ranking method called Fidelity Rank (FRank) that combines the probabilistic ranking framework with the generalized additive model. It introduces a new fidelity loss function to address problems with existing loss functions like cross entropy. FRank was tested on TREC and web search datasets and significantly outperformed other learning to rank algorithms like RankBoost, RankNet and RankSVM in terms of metrics like MAP and NDCG. Future work could involve theoretical analysis of FRank's generalization bounds and combining it with other machine learning techniques.
Is fortran still relevant comparing fortran with java and c++ijseajournal
This paper presents a comparative study to evaluate and compare Fortran with the two mo
st popular
programming languages Java and C++. Fortran has gone through major and minor extensions in the
years 2003 and 2008. (1) How much have these extensions made Fortran comparable to Java and C++?
(2) What are the differences and similarities, in sup
porting features like: Templates, object constructors
and destructors, abstract data types and dynamic binding? These are the main questions we are trying to
answer in this study. An object
-
oriented ray tracing application is implemented in these three lan
guages to
compare them. By using only one program we ensured there was only one set of requirements thus making
the comparison homogeneous. Based on our literature survey this is the first study carried out to compare
these languages by applying software m
etrics to the ray tracing application and comparing these results
with the similarities and differences found in practice. We motivate the language implementers and
compiler developers, by providing binary analysis and profiling of the application, to impr
ove Fortran
object handling and processing, and hence making it more prolific and general. This study facilitates and
encourages the reader to further explore, study and use these languages more effectively and productively,
especially Fortran.
GENERATING PYTHON CODE FROM OBJECT-Z SPECIFICATIONSijseajournal
ABSTRACT
Object-Z is an object-oriented specification language which extends the Z language with classes, objects, inheritance and polymorphism that can be used to represent the specification of a complex system as collections of objects. There are a number of existing works that mapped Object-Z to C++ and Java programming languages. Since Python and Object-Z share many similarities, both are object-oriented paradigm, support set theory and predicate calculus moreover, Python is a functional programming language which is naturally closer to formal specifications, we propose a mapping from Object-Z specifications to Python code that covers some Object-Z constructs and express its specifications in Python to validate these specifications. The validations are used in the mapping covered preconditions,
post-conditions, and invariants that are bui l t using lambda funct ion and Python's decorator. This work has found Python is an excellent language for developing libraries to map Object-Z specifications to Python.
This seminar lecture, provided at the Gran Sasso Science Institute, provides an overview on software architecture styles, product lines, and my research
An important aspect of a program, apart from its ability to solve the problem, is its maintainability. A program has to undergo frequent changes in its lifetime because of the change in the problems to be solved. If a program is not written in a manner that allows incorporating changes easily, after a while, it may become useless altogether.
One way to bring some discipline into programming practices is structured programming. It is a way of creating programs that ensures high quality of maintainability, reusability, amenability to easy debugging and readability.
GOTO
The document discusses software design principles of coupling and cohesion. It defines coupling as the interdependence between modules, and lists different types of coupling from high to low. Cohesion refers to the degree that the responsibilities within a module belong together, and it categorizes different levels of cohesion from worst to best. The document emphasizes that good design aims for low coupling between modules and high cohesion within modules.
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATORijistjournal
Pseudocode is an artificial and informal language that helps developers to create algorithms. In this papera software tool is described, for translating the pseudocode into a particular source programminglanguage. This tool compiles the pseudocode given by the user and translates it to a source programminglanguage. The scope of the tool is very much wide as we can extend it to a universal programming toolwhich produces any of the specified programming language from a given pseudocode. Here we present thesolution for translating the pseudocode to a programming language by using the different stages of acompiler
A considerable interest has been given to Multiword Expression (MWEs) identification and treatment. The identification of MWEs affects the quality of results of different tasks heavily used in natural language processing (NLP) such as parsing and generation. Different
approaches for MWEs identification have been applied such as statistical methods which employed as an inexpensive and language independent way of finding co-occurrence patterns.Another approach relays on linguistic methods for identification, which employ information such as part of speech (POS) filters and lexical alignment between languages is also used and
produced more targeted candidate lists. This paper presents a framework for extracting Arabic
MWEs (nominal or verbal MWEs) for bi-gram using hybrid approach. The proposed approach starts with applying statistical method and then utilizes linguistic rules in order to enhance the results by extracting only patterns that match relevant language rule. The proposed hybrid
approach outperforms other traditional approaches.
The document presents a taxonomy called ProMeTA for classifying program metamodels used in program reverse engineering. ProMeTA defines dimensions for characterizing metamodels such as target language, abstraction level, meta-language, and quality attributes. The taxonomy is used to analyze and classify five popular metamodels (ASTM, KDM, FAMIX, SPOOL, UNIQ). Key findings are that metamodels should support widely used languages and standards for long-term use and provide robust functionality and quality.
This document describes a morphological tagger for Korean developed by Chung-Hye Han and Martha Palmer. The tagger takes raw text as input and outputs each word labeled with its lemma and part-of-speech tag, and inflections labeled with inflectional tags. Unlike prior approaches, this tagger performs statistical tagging before morphological analysis. It uses a trigram tagger followed by applying morphological rules to tag unknown words, achieving 95% accuracy on test data.
Comparison of the Formal Specification Languages Based Upon Various ParametersIOSR Journals
This document compares various formal specification languages based on different parameters. It describes Z notation, OCL, VDM, SDL and Larch languages. Z notation uses set theory and logic to model state using schemas. OCL uses constraints to describe UML models. VDM uses basic types and functions to formally specify models. SDL specifies systems as communicating finite state machines. Larch uses an interface language and shared language to specify behaviors. The languages differ based on whether they are process-oriented, sequential-oriented, model-oriented or property-oriented and the underlying mathematics used like set theory, logic or algebra.
This document summarizes an approach to segmenting search interfaces using a two-layered hidden Markov model (HMM). The first layer uses a T-HMM to tag interface components with semantic labels like attribute-name, operator, and operand. The second layer uses an S-HMM to segment the interface into logical attributes by grouping related tagged components. The approach models an artificial designer that learns to segment interfaces by training the HMMs on manually segmented examples. It was tested on 200 biology search interfaces and showed promising results for extracting the underlying database querying semantics from the interface structure. Future work aims to improve schema extraction and domain coverage.
The document discusses search interface understanding (SIU), which involves representing, parsing, segmenting, and evaluating search interfaces on the deep web. SIU is challenging because search interfaces are designed autonomously without standard structures. The document outlines the SIU process and key challenges, such as interfaces having no defined boundaries for segmenting semantically related components. Techniques for SIU include rules, heuristics, and machine learning.
The document describes a method for learning design patterns from hierarchical labeled data using grammar induction and Bayesian modeling. The method begins with an initial grammar derived directly from the exemplars, then uses Markov chain Monte Carlo optimization to explore more general grammars formed by merging and splitting nonterminal symbols. The optimal grammar balances descriptive power over the exemplars against representation complexity, distilling general patterns from the data in a principled way. The method is demonstrated on geometric models and web pages.
A new language for a new biology: How SBML and other tools are transforming m...Mike Hucka
Presentation given at the Victorian Systems Biology Symposium (http://www.emblaustralia.org/About_us/news/mike-hucka.aspx) at the Walter and Eliza Hall Institute in Melbourne, Australia, on 20 August 2013.
Reasoning of database consistency through description logicsAhmad karawash
The document discusses reasoning of database consistency through description logics. It begins with an introduction and overview before covering data models and description logics, description logics and database querying, data integration, and concluding. It describes how entity relationship models are used to describe database structure and how they can be transformed into description logics knowledge bases. This allows reasoning about database consistency, satisfiability, and other properties to identify issues like redundancy. Description logics are also discussed as a way to perform querying and classify queries.
This document discusses criteria for modularization in software design. It defines modules as named entities that contain instructions, logic, and data structures. Good modularization aims to decompose a system into functional units with minimal coupling between modules. Modules should be designed for high cohesion (related elements) and low coupling (dependencies). The types of coupling from strongest to weakest are content, common, control, stamp, and data coupling. The document also discusses different types of cohesion within modules from weakest to strongest. The goal is functional cohesion with minimal coupling between modules.
This document discusses software design principles and methods. It covers topics like abstraction, modularity, coupling and cohesion, and information hiding. It also describes different design methods including functional decomposition, data flow design, design based on data structures, and object-oriented design. Key aspects of these methods are explained, such as the stages of object-oriented analysis and design. The document provides examples to illustrate different design concepts and metrics.
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGEIJCI JOURNAL
Pseudocode is an artificial and informal language that helps programmers to develop algorithms. In this
paper a software tool is described, for translating the pseudocode into a particular programming
language. This tool takes the pseudocode as input, compiles it and translates it to a concrete programming
language. The scope of the tool is very much wide as we can extend it to a universal programming tool
which produces any of the specified programming language from a given pseudocode. Here we present the
solution for translating the pseudocode to a programming language by implementing the stages of a
compiler.
The document proposes a novel ranking method called Fidelity Rank (FRank) that combines the probabilistic ranking framework with the generalized additive model. It introduces a new fidelity loss function to address problems with existing loss functions like cross entropy. FRank was tested on TREC and web search datasets and significantly outperformed other learning to rank algorithms like RankBoost, RankNet and RankSVM in terms of metrics like MAP and NDCG. Future work could involve theoretical analysis of FRank's generalization bounds and combining it with other machine learning techniques.
Is fortran still relevant comparing fortran with java and c++ijseajournal
This paper presents a comparative study to evaluate and compare Fortran with the two mo
st popular
programming languages Java and C++. Fortran has gone through major and minor extensions in the
years 2003 and 2008. (1) How much have these extensions made Fortran comparable to Java and C++?
(2) What are the differences and similarities, in sup
porting features like: Templates, object constructors
and destructors, abstract data types and dynamic binding? These are the main questions we are trying to
answer in this study. An object
-
oriented ray tracing application is implemented in these three lan
guages to
compare them. By using only one program we ensured there was only one set of requirements thus making
the comparison homogeneous. Based on our literature survey this is the first study carried out to compare
these languages by applying software m
etrics to the ray tracing application and comparing these results
with the similarities and differences found in practice. We motivate the language implementers and
compiler developers, by providing binary analysis and profiling of the application, to impr
ove Fortran
object handling and processing, and hence making it more prolific and general. This study facilitates and
encourages the reader to further explore, study and use these languages more effectively and productively,
especially Fortran.
GENERATING PYTHON CODE FROM OBJECT-Z SPECIFICATIONSijseajournal
ABSTRACT
Object-Z is an object-oriented specification language which extends the Z language with classes, objects, inheritance and polymorphism that can be used to represent the specification of a complex system as collections of objects. There are a number of existing works that mapped Object-Z to C++ and Java programming languages. Since Python and Object-Z share many similarities, both are object-oriented paradigm, support set theory and predicate calculus moreover, Python is a functional programming language which is naturally closer to formal specifications, we propose a mapping from Object-Z specifications to Python code that covers some Object-Z constructs and express its specifications in Python to validate these specifications. The validations are used in the mapping covered preconditions,
post-conditions, and invariants that are bui l t using lambda funct ion and Python's decorator. This work has found Python is an excellent language for developing libraries to map Object-Z specifications to Python.
This seminar lecture, provided at the Gran Sasso Science Institute, provides an overview on software architecture styles, product lines, and my research
An important aspect of a program, apart from its ability to solve the problem, is its maintainability. A program has to undergo frequent changes in its lifetime because of the change in the problems to be solved. If a program is not written in a manner that allows incorporating changes easily, after a while, it may become useless altogether.
One way to bring some discipline into programming practices is structured programming. It is a way of creating programs that ensures high quality of maintainability, reusability, amenability to easy debugging and readability.
GOTO
The document discusses software design principles of coupling and cohesion. It defines coupling as the interdependence between modules, and lists different types of coupling from high to low. Cohesion refers to the degree that the responsibilities within a module belong together, and it categorizes different levels of cohesion from worst to best. The document emphasizes that good design aims for low coupling between modules and high cohesion within modules.
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATORijistjournal
Pseudocode is an artificial and informal language that helps developers to create algorithms. In this papera software tool is described, for translating the pseudocode into a particular source programminglanguage. This tool compiles the pseudocode given by the user and translates it to a source programminglanguage. The scope of the tool is very much wide as we can extend it to a universal programming toolwhich produces any of the specified programming language from a given pseudocode. Here we present thesolution for translating the pseudocode to a programming language by using the different stages of acompiler
A considerable interest has been given to Multiword Expression (MWEs) identification and treatment. The identification of MWEs affects the quality of results of different tasks heavily used in natural language processing (NLP) such as parsing and generation. Different
approaches for MWEs identification have been applied such as statistical methods which employed as an inexpensive and language independent way of finding co-occurrence patterns.Another approach relays on linguistic methods for identification, which employ information such as part of speech (POS) filters and lexical alignment between languages is also used and
produced more targeted candidate lists. This paper presents a framework for extracting Arabic
MWEs (nominal or verbal MWEs) for bi-gram using hybrid approach. The proposed approach starts with applying statistical method and then utilizes linguistic rules in order to enhance the results by extracting only patterns that match relevant language rule. The proposed hybrid
approach outperforms other traditional approaches.
The document presents a taxonomy called ProMeTA for classifying program metamodels used in program reverse engineering. ProMeTA defines dimensions for characterizing metamodels such as target language, abstraction level, meta-language, and quality attributes. The taxonomy is used to analyze and classify five popular metamodels (ASTM, KDM, FAMIX, SPOOL, UNIQ). Key findings are that metamodels should support widely used languages and standards for long-term use and provide robust functionality and quality.
This document describes a morphological tagger for Korean developed by Chung-Hye Han and Martha Palmer. The tagger takes raw text as input and outputs each word labeled with its lemma and part-of-speech tag, and inflections labeled with inflectional tags. Unlike prior approaches, this tagger performs statistical tagging before morphological analysis. It uses a trigram tagger followed by applying morphological rules to tag unknown words, achieving 95% accuracy on test data.
Comparison of the Formal Specification Languages Based Upon Various ParametersIOSR Journals
This document compares various formal specification languages based on different parameters. It describes Z notation, OCL, VDM, SDL and Larch languages. Z notation uses set theory and logic to model state using schemas. OCL uses constraints to describe UML models. VDM uses basic types and functions to formally specify models. SDL specifies systems as communicating finite state machines. Larch uses an interface language and shared language to specify behaviors. The languages differ based on whether they are process-oriented, sequential-oriented, model-oriented or property-oriented and the underlying mathematics used like set theory, logic or algebra.
This document summarizes an approach to segmenting search interfaces using a two-layered hidden Markov model (HMM). The first layer uses a T-HMM to tag interface components with semantic labels like attribute-name, operator, and operand. The second layer uses an S-HMM to segment the interface into logical attributes by grouping related tagged components. The approach models an artificial designer that learns to segment interfaces by training the HMMs on manually segmented examples. It was tested on 200 biology search interfaces and showed promising results for extracting the underlying database querying semantics from the interface structure. Future work aims to improve schema extraction and domain coverage.
The document discusses search interface understanding (SIU), which involves representing, parsing, segmenting, and evaluating search interfaces on the deep web. SIU is challenging because search interfaces are designed autonomously without standard structures. The document outlines the SIU process and key challenges, such as interfaces having no defined boundaries for segmenting semantically related components. Techniques for SIU include rules, heuristics, and machine learning.
This document presents a multi-level methodology for developing UML sequence diagrams (SQDs) in a systematic way. The methodology has three levels - the object framework level, responsibility assignment level, and visual pattern level. Each level breaks the SQD development process into discrete stages and provides guidelines to help avoid common errors. The goal is to serve as an easy-to-use reference for novice SQD modelers to develop correct and consistent SQDs.
This dissertation proposal outlines a system that allows non-technical users to design and evolve databases by modeling their data needs through customizable forms. The key goals are to provide an easy-to-use interface for form design, and mapping algorithms that translate user-designed forms into high-quality databases. A preliminary evaluation with nurses found the form modeling interface effective and efficient. Mapping experiments successfully translated forms into databases that matched expert-designed standards. Future work includes usability studies varying form and database complexity, and exploring enhancements to mapping and merging algorithms.
Career portfolio which illustrates multi-media marketing, search engine marketing and optimization, strategic research and Web analytics accomplishments.
Mike Thelwall is a professor known for his research in the field of webometrics. He received his PhD in mathematics and leads the Statistical Cybermetrics Research Group. Webometrics involves the quantitative analysis of web phenomena such as link analysis, search engine evaluation, and web citation analysis. Thelwall's research has explored using webometrics to study the dissemination of scholarly research and evaluate universities. He has emphasized the need for conceptual frameworks and methodologies to interpret webometrics results and address challenges like the size and changing nature of the web.
Clinicians rely on health information technologies (HITs) for clinical data collection, but current HITs are inflexible and inconsistent with clinicians' needs. The researchers propose a flexible electronic health record (fEHR) system to allow clinicians to easily modify the system based on their changing data collection needs. The fEHR uses a form-based interface for clinicians to design forms, generates a corresponding form tree structure, and designs a high-quality database from the tree. A user study with 5 nurses found they could effectively replicate needs in the system and their efficiency and understanding improved over two rounds of tasks of increasing complexity. The researchers conclude the fEHR has potential to reduce HIT problems and that the database design
You're invited! Commemorative Dinner--Friday, January 28, 2011, at the Desert Diamond Casino (Pima Mine Rd.) & Treaty Exhibit Opening--Wednesday, February 2, 2011, at the Arizona State Museum.
The document announces the 4th Annual Segundo de Febrero Commemorative Dinner to recognize February 2, 1848, the date the Treaty of Guadalupe Hidalgo was signed establishing the border between the US and Mexico and creating the Mexican American community. The event will be held on January 30, 2010 at Desert Diamond Casino featuring a keynote speaker and award ceremony with proceeds benefiting Amistades Inc, a nonprofit for Latino substance abuse prevention. Attendees can RSVP and purchase tickets by January 15th.
This document describes using a Hidden Markov Model (HMM) approach to segment deep web search interfaces. The HMM acts as an artificial designer that can determine segment boundaries and label components based on acquired knowledge. A two-layered HMM is employed, with the first layer assigning semantic labels and the second layer segmenting the interface. The approach outperforms previous heuristic methods, achieving a 10% improvement in segmentation accuracy. Future work involves extracting more schema details, testing on other domains, and exploring alternative training algorithms.
Zhao huang deep sim deep learning code functional similarityitrejos
Measuring code similarity is fundamental for many software engineering
tasks, e.g., code search, refactoring and reuse. However,
most existing techniques focus on code syntactical similarity only,
while measuring code functional similarity remains a challenging
problem. In this paper, we propose a novel approach that encodes
code control flow and data flow into a semantic matrix in which
each element is a high dimensional sparse binary feature vector,
and we design a new deep learning model that measures code functional
similarity based on this representation. By concatenating
hidden representations learned from a code pair, this new model
transforms the problem of detecting functionally similar code to
binary classification, which can effectively learn patterns between
functionally similar code with very different syntactics.
The document introduces query processing and optimization in relational database systems. It discusses the three phases a query passes through: parsing and translation, optimization, and evaluation. Key concepts covered include query metrics like cost based on disk accesses, the role of indexes in reducing costs, algorithms for select and join operations, and the goal of query optimization to find low-cost evaluation plans.
The document introduces query processing and optimization in database management systems. It discusses the three main phases a query passes through: 1) parsing and translation, 2) optimization, and 3) evaluation. In the first phase, the query is converted into an internal representation like relational algebra. In the second phase, rules are applied to transform the representation into a more efficient form. In the third phase, the optimized plan is executed and results are returned. The goal is to retrieve desired information from the database in a predictable, reliable, and timely manner.
This document discusses ontology mapping. It begins with an introduction to the semantic web and ontologies. Ontology mapping is important for allowing different ontologies to be aligned and related. There are different types of ontology mapping including alignment, merging, and mapping. The document then surveys some popular ontology mapping techniques including GLUE, PROMPT, and QOM. It evaluates these techniques and discusses their inputs, outputs, and approaches. The document concludes that semantic web research is important for advancing web technologies and realizing the goals of web 3.0. Future work could involve developing new ontology mapping techniques and publishing research on existing mapping methods.
Nina Grantcharova - Approach to Separation of Concerns via Design Patternsiasaglobal
Separation of Concerns aims at managing complexity by establishing a well-organized system where each part adheres to a single and unique purpose while maximizing the system's ability to adapt to change and increasing developers' productivity. The goal of this presentation is to promote the understanding of the principle of Separation of Concerns and to provide a selected set of foundational patterns to aid software architects in the designing of maintainable and extensible systems.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it comes to high-performance chunking systems, transformer models have proved to be the state of the art benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it
comes to high-performance chunking systems, transformer models have proved to be the state of the art
benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where
each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it comes to high-performance chunking systems, transformer models have proved to be the state of the art benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
PPT : [26 Feb]
Source target transformation
Brain segmentation
Existing models
What are disadvantages and disadvantages in the adversial stretching the networks
What changes do we need to perform
Resources:
https://www.v7labs.com/blog/domain-adaptation-guide
Some Standard Datasets:
Amazon, Webcam, and DSLR are has three distinct domains
ML technique that focuses on training models on a source domain and then adapting them to perform well on a target domain, where the source and target domains may have different distributions. Real world datasets often suffer from domain shift, where data distributions differ due to factors like sensor changes, environment variations, or user demographics.
Types of Domain Adaptation:
Supervised Domain Adaptation(SDA) : The model has labels in both the source and target domain
Unsupervised Domain Adaptation(UDA) : Only labeled data from the source domain is available during training.
Semi Supervised Domain Adaptation(SSDA) : A combination of labeled source domain data and limited labeled target domain data are used during training.
Multi-Source Domain Adaptation: Extends domain adaptation to multiple source domains. The model adapts to the target domain using knowledge from multiple related sources.
Adaptation Techniques:
Feature level adaptation : Transforming feature from both domains to a common space, reducing domain discrepancy.
Instance level adaptation :
Model level adaptation:
This adaptation can involve modifying the architecture, parameters, or other components of the model to reduce the impact of domain shift.
Classifier adaptation:
Generative models:
In the context of domain adaptation, generative models can be used to generate synthetic data in the target domain that is similar to the actual target domain data. This synthetic data can be combined with the source domain data during training to improve the model's ability to generalize to the target domain.
Transfer learning:
Domain adaptation is often a part of transfer learning, where knowledge gained from one task or domain is applied to another.
Applications:
Object Recognition, nlp, sentimental analysis
Challenges: dealing with the heterogeneity between domains, selecting suitable adaptation techniques, addressing the limited availability of labeled data in the target domain are some common challenges.
How it works:
Training a model on the source domain.
Extracting knowledge about the data distribution.
Adapting the model to the target domain by aligning features, modifying the classifier, or learning a shared representation.
Fine-tuning the model on the target domain (optional, depending on the technique).
Disadvantages of Adversarial Domain Adaptation:
1. Computational cost: Training two competing neural networks simultaneously can be expensive.
Self-supervised learning: Leverages unlabeled data from the target domain to learn generalizable representations without relying solely on adversarial training.
This document summarizes the GLORP (Generic Lightweight Object-Relational Persistence) library, an open-source library for object-relational mapping. It discusses GLORP's motivations, including supporting schema changes for a critical application with a complex data model. Key features highlighted are GLORP's declarative mappings, optimized queries, automatic transaction handling, and object-level rollback support. The document also covers GLORP's licensing under the LGPL and acknowledges its contributors.
This document discusses interoperability between software components. It defines interoperability as the ability of independently developed components to interact meaningfully by communicating and exchanging data or services. Achieving interoperability is challenging due to heterogeneity between components in terms of programming languages, platforms, data formats, and assumptions. Common Object Request Broker Architecture (CORBA) and XML are examined as approaches to enabling interoperability, but both make assumptions that can limit their effectiveness and even introduce new interoperability issues in some cases. Shaw's taxonomy of interoperability solutions is also referenced.
Spy On Your Models, Standard talk at EclipseCon 2011Hugo Bruneliere
The document discusses MoDisco, an Eclipse modeling project that uses models to represent and manipulate existing software systems. MoDisco aims to help with software modernization tasks like understanding legacy code, performing quality analysis, and migrating to new technologies. It includes a model browser for navigating and querying large and complex models. MoDisco is developed by a joint team from INRIA and Ecole des Mines de Nantes, with contributions from Mia-Software, Obeo, and other Eclipse members.
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEEMEMTECHSTUDENTPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
This document provides a 50-hour roadmap for building large language model (LLM) applications. It introduces key concepts like text-based and image-based generative AI models, encoder-decoder models, attention mechanisms, and transformers. It then covers topics like intro to image generation, generative AI applications, embeddings, attention mechanisms, transformers, vector databases, semantic search, prompt engineering, fine-tuning foundation models, orchestration frameworks, autonomous agents, bias and fairness, and recommended LLM application projects. The document recommends several hands-on exercises and lists upcoming bootcamp dates and locations for learning to build LLM applications.
A Comparative Study of RDBMs and OODBMs in Relation to Security of Datainscit2006
Mansaf Alam and Siri Krishan Wasan
Department of Computer Sciences, Jamia Millia Islamia, New Delhi, India.
Department of Mathematics, Jamia Millia Islamia, New Delhi, India.
- The document discusses lessons learned from building the AMMA Model Engineering Platform, including the need for sound modeling principles with models treated as first-class entities.
- It describes how different technical spaces, like MDE, XML, and databases each have their own modeling conventions defined by metamodels.
- Transforming models across technical spaces requires understanding their different metamodels and representation schemes.
Similar to An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation (20)
An Empirical Study on Using Hidden Markov Models for Search Interface Segmentation
1. AN EMPIRICAL STUDY ON USINGHIDDEN MARKOV MODEL FORSEARCH INTERFACE SEGMENTATION Ritu Khare and Yuan An The iSchool at Drexel Drexel University, USA 1
2. Presentation Order Problem: Interface Segmentation Solution : Hidden Markov Model Empirical Results Summing Up 2
6. Novelty of the SolutionSolution : Hidden Markov Model Empirical Results Summing Up 3
7. Motivation: The Deep Web What is DEEP WEB? Portion of Web, not returned by search engines through traditional crawling and indexing. Contents lie in online databases and are accessed by manually filling up HTML forms on search interfaces. How to make it USEFUL? Meta Search Engines E.g. Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005) Deep Web Crawlers E.g. Raghavan and Garcia-Molina (2001), Madhavan et al. (2008) A pre-requisite is A thorough understanding of semantics of search interfaces 4
8. Search Interface Segmentation 5 A critical part in understanding semantics of search interfaces The segmentation of search interfaces into logical groups of implied queries. Grouping of related interface components together Search Interface Segmentation Top segment = 7 components Bottom Segment = 4 components
9. Why is Segmentation Challenging? 6 Human Designer / User Machine Segment has apparent semantic existence Visual Arrangements Past Experiences Cannot “see” a segment. Visually close components, might be located far away in the HTML code. No Cognitive Ability In this paper, we investigate whether a machine can “learn” how to segment an interface.
10. The Novelty of The Solution:Model-based 7 Shortcomings of existing works: They use rules and heuristics for segmentation. These techniques have problems in handling scalability and heterogeneity. Zhang et al., 2004 and He et al., 2004, Raghavan and Garcia-Molina, 2001, Kalijuvee et al., 2001 We overcome these shortcomings Model Based Approach Implicit Knowledge (used by a designer to design an interface) HMM (Artificial Designer) SEGMENTATION
11. 8 The deep Web has diverse domains. The interface designs differ across domains The Novelty of The Solution: The Domain Aspect To segment interfaces from a given subject domain … Existing works have compared the accuracies attained by two methods. Using Hidden Markov Models . . . We don’t limit to the comparison between the two methods. For a given domain, we investigate what kind of training interfaces result in high segmentation accuracy and why? Domain – Specific Method Generic Method I(Di) Interface I from domain Di Interfaces from mix of arbitrary domain D1, D2, D3 … Interfaces from domain Di Fresh Perspective
18. Needed to model and explain the ‘real-world processes’ that are implicit and unobservable.TRANSITION STATE (hidden) q0 q1 q2 q3 q4 EMISSION σ4 σ0 σ1 σ2 σ3 1. State Space : A finite set of states {q0, q1, q2 …qn}. 2. Transition Matrix: Probability P (qi-> qj) of transitioning from a state qi to qj. 3. Symbol Space : A set of output tokens {σ1, σ2, …, σm}. 4. Emission Matrix :Probability P (qi↑ σk) of state qi emitting the token σk. SYMBOL (observable)
19. Search Interface AnalysisSemantic Labels 11 Logical Group For data-driven Web applications, interface components are translated into structured query (e.g. SQL) expressions: SELECT * FROM Gene WHERE Gene_Name = ‘maggie’; A segment in a search interface corresponds to a WHERE clause, each collecting values qualified using a built-in operator, for a particular attribute in the DB schema. Segmentation is a two-fold problem Identification of boundaries of logical groups Assignment of semantic labels to components. Logical Group Operator Operand Attribute-name
20. INTERFACE DESIGN PROCESS 12 Operand While the components are observable, their semantic roles appear hidden to a machine. The proceeding of one semantic label by another is similar to the transitioning of HMM states. Attribute-name Operator Operand Attribute-name Attribute Name Operand Operator Attribute Name Operand Text (Gene ID) Textbox Text (Gene Name) RB Group Textbox
21. HMM: An Artificial Designer 13 An HMM can act like a human designer that can design an interface and determine the segment boundaries and semantic labels of components. We encoded the implicit knowledge required for interface segmentation in an HMM-based artificial designer. We employ a 2-layered HMM: The first layer T-HMMtags each component with appropriate semantic labels (attribute-name, operator, and operand). The second layer S-HMM segments the interface into logical attributes.
22. 2-LAYERED HMM 14 Parser Text Textbox Text RB Group Textbox T-HMM Attribute-name Operand Attribute-name Operator Operand S-HMM Begin-segment End-segment Begin-segment Inside-segment End-segment
23. MODEL SPECIFICATION: T-HMM & S-HMM 15 T-HMM S-HMM Test interfaces Training interfaces Symbol Sequences Semantic Labels & Segment Boundaries (of test interfaces) T-HMM State Sequences S-HMM
28. INITIAL EXPERIMENTS: Domain-Specific Dataset: 200 interfaces Cross Validation: 190 training and 10 testing examples. Training: Maximum Likelihood Method Testing: Viterbi Algorithm Dataset: 100 interfaces each Why 2-Layered HMM outperformed? LEX does not model text-misc and thus suffered from under-segmentation. LEX considers only those texts as attribute-names that are located within 2-top-row distance from the form element. In reality, attribute-name and operand might be located far apart in the source code. 17 FIRST EXP.: BIOLOGY DOMAIN COMPARISON WITH LEX (He et al. 2007) : 4 DOMAINS S-HMM T-HMM *For segments with multiple instances of attr-names, at least 1 was correctly identified
29. Design preferences of designers from different domains are different. HMM VariationsT-HMM Topology AUTOMOBILE BIOLOGY MOVIE HEALTH MIXED REFERENCE & EDU Transitions <5% probable not shown
30. RESULTS 19 1. Domain-Specific 2. Generic 3. Cross Domain A Pattern Captured by Domain Specific Model A Pattern Captured by Cross-Domain Model Text-misc Health Automobile Domain-specific models do not always result in best performance, e.g. movie domain
34. CONTRIBUTIONS 22 Introduction to 2-layered HMM approach for interface segmentation motivated by probabilistic nature of interface design process. First work to apply HMMs on deep Web search interfaces. Effectiveness test across representative domains of deep Web. High segmentation accuracy in most domains. Outperformed a previous approach, LEX by at least 10% in most cases. Design & comparison of various of learning models. A single model has the potential of accurately segmenting interfaces from multiple domains, provided it is trained on the data having appropriate variety and frequency of design patterns. An example is HMMbio that performed better than other models on 80% of the tested domains. The variety and frequency of patterns in biology domain helps HMMbio contain more design knowledge & be a smarter designer.
35. FUTURE WORK 23 Design a minimal set of models that reaches as many deep Web domains as possible Involve More Domains Each model returns higher accuracy than its domain-specific counterpart Transition to a new interface representation scheme: Distributed Segments and Segments with intertwined components Recover the schema of deep Web databases: Extracting finer details, such as data types and constraints. Overcome the challenges posed by HMMs Manual Tagging of training data: Explore unsupervised training methods such as Baum Welch algorithm. Time taken by Viterbi algorithm for state recovery Find optimization techniques to improve efficiency. Use this method as an off-line pre-processing module to other applications such as meta-search engines and deep Web crawlers.
36. Suggestions, Thoughts, Ideas, Questions… THANK YOU ! 24 Acknowledgements: To the Anonymous Reviewers of CIKM 2009 References: [1] to [23] (in full paper).
Editor's Notes
A very good morning to everyone here. I am Ritu Khare from the Drexel University in the USA, presenting our work on using hidden Markov models for search interface segmentation.
The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
The motivation behind studying this problem is the deep Web. Deep Web is that portion of the Web that is not returned by search engines like Google through crawling and indexing. The contents of deep Web lie in online databases that can only be accessed by filling up HTML forms that lie on search interfaces like this. Researchers have suggested many ways to make these hidden contents more useful and visible to the Web users. Such as designing metasearch engines and increasing search engine visibility of deeb Web contents. A critical pre-requisite of these solutions is a deep understanding of the semantics of search interfaces.
Therefore, we are studying the problem of interface segmentation which is very important in understanding search interface semantics. Very simply states, search interface segmentation means grouping of related attributes together. Lets understand this with the help of this interface. It can be divided into 2 segments where each one forms a different implied query. Top segment has 7 components, and the bottom has 4 components. This example suggests that a segment has can a varied number , formats, and patterns of components.
Now lets see why this makes a challenging problem. A search interface is designed by human designers in such a way that a user quickly recognizes the segments based on the visual arrangements of components and based on her past experiences in performing searches using interfaces. In a way, segmentation comes very naturally to human users. On the other extreme, a machine cannot “see” a segment for a couple of reasons. First, the components that are visually close on the interface, might be located far apart in the machine readable HTML code . Second, a machine has no cognitive ability to recognize a segment boundary. In this work, we are studying whether a machine can “learn” how to segment an interface into implied queries.
There have been many works in the past that address the segmentation problem. These are based on rules and heuristics which makes them unfit for handling diversity and scalability. Also, most of them do not group all components of a segment together i.e. they suffer from under-segmentation. The proposed approach helps in overcoming the shortcomings by taking a deeper approach to solve the segmentation problem. As opposed to rules, we adopt a model based holistic approach. We incorporate the knowledge used by a designer for designing an interface into a model and use this model for segmentation. In a way we create an artificial designer who has the ability to segmentation.
The deep Web has diverse distributions of subject domains and the design tendencies of designers from different domains are also different from each other. For interfaces belonging to a given domain, 2 kinds of methods can be designed. Domain specific method and generic method. Lets say we have an interface I belonging to domain Di. A Domain-specific method for this interface will be designed by observing interfaces from domain Di only. Generic method for same interface will be designed by observing interfaces from a random mix of domains. Existing works have compares the accuracies between the two methods and suggest that domain-specific methods always return in better performance. Using model based approach of hidden markov model, in this work, we look at the domain question with a fresh perspective. Instead of 2 we devise 3 kinds of methods and study in detail why a particular method results in higher accuracy than other.
The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
So what exactly is an HMM? It can be best understood with the help of this figure. This is an example HMM. The hidden nodes are the states and the white nodes are symbols or observations emitted by states. There are two stochastic processes involved here. One is the process of transition from one state to another . Second is the process of emission of symbols by each state. These are the 4 important elements of an HMM. There is a finite set of states, a matrix that describes probability of transition from one state to another. , a finite set of symbols, and a matrix that describes probability of emission of a symbol by a given state. HMMs are needed in context of those real world processes that are unobservable and difficult to interpret particularly by a machine. An HMM is used to model such processes and also to explain such processes i.e. to determine possible state transitions the process might have undergone to generate a given sequence of observable symbols.
Now lets look at a search interface in greater detail. A search interface consists of a sequence of components that belong to different logical groups. Components in a single group have difference semantic roles which we call as a semantic label.For data intensive Web applications, each search interface when submitted to the server is converted to a structured query expression. E.g. Assuming the underlying DB table name is “Gene,” the lower segment can be expressed as select * from Gene where Gene_name=“maggie”. In a way each segment in a search interface represents a WHERE clause expressing a query condition. Thus, for this work, we use a set of 3 semantic labels : attribute name, operator and operand. It should be noted that segmentation is a two fold problem – involves determination of boundaries of logical groups and determination of semantic labels of components in each group.
In this work, our primary assumption is that the process of search interface design is probabilistic in nature. Consider this interface and let us think of how a designer might have laid down the components on it. The designer first lays out an attribute name, then an operand then again an attribute name, an operator and an operand. He lays these labels based on some implicit knowledge which is beyond natural understanding of a machine. All a machine can observe it that , there is a text followed by a textbox followed by another text and so on. A machine can observe the components but the semantic labels appear hidden. Therefore , we believe that interface design process can be modeled and explained using a hidden morkov model.
We believe that an HMM can simulate the process of interface design and can act like a human designer who has ability to design an interface using implicit knowledge of semantic labels and segment patterns. And also has the ability to determine the segment boundaries and semantic labels given a previously designed search interface. To accomplish segmentation, we encoded the implicit designer’s knowledge in an HMM-based artificial designer. As we saw earlier segmentation is a 2-fold process- determination of semantic labels and determination of boundaries. Therefore, we use a layered HMM with 2 layers: T-HMM that tags components with apt semantic labels, and S-HMM that creates boundaries around related group of components.
Here is how the 2-layered HMM functions. Consider the same example interface. A machine parser with no intelligence embedded and no training provided would read this interface as a raw sequence of components. This becomes the input for the first layer T-HMM. T-HMM would read these components as a sequence of semantic labels. This in turn becomes the input for next layer S-HMM. S-HMM tags these labels with respect to their position in a segment and hence finishes the task of segmentation.
Now let us look at the 2 layers in a greater detail. For T-HMM i.e. the layer that provides semantic labels. Observation symbol consist of the raw HTML components such as text labels and various form elements. For States, there are semantic labels as discussed earlier: attribute-name, operator and operand. In initial analysis of interface we noticed that there are certain texts found in real-world interfaces that belong to none of the 3 classes. They are either some instructions for entering data, descriptions, or some examples and constraints. Thus we create a 4th state and call is as text misc state. The topology obtained from a spec_databset of 50 random interfaces is shown here. For S-HMM, obs symbol space is same as the state space for T-HMM. As both are layers are used in tandem. States for S-HMM are the relative position of each component with respect to a segment. Here is the state transition topology obtained from observing 50 randomly selected interfaces.
The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
First experiment was conducted on biology domain, as we found this domain very interesting and less explored one. Most of the domains used by existing works are commercial ones such as movies, books, so we decided to first dive into a non-commercial domain. We applied the 2-layered approach to segment 200 interfaces from this domain. Both the training and testing interfaces belong to the biology domain and hence it’s a domain-specific method. And found the following results. 86% of the segments were correctly identified and out of the these rightly determined segments, we measured the accuracy for identification of semantic labels. We found that in many cases there are multiple instances of attribute-names within a single segment. So, we decided to measure the accuracy in two ways. In 90% of the cases, correct attribute-name label was identified. And in 99% of the cases, at least one instance of attribute-name was correctly identified by T-HMM. The accuracy attained was all the semantic labels were pretty high except for text-misc which were misidentified to be attribute names in most of the cases. We shall work on improving this in future. To compare accuracy of our method with an existing heuristic based approach LEX. We implemented LEX and tested it on 100 interfaces from each of the 4 domains – two commercial – auto and movie, and two non-commercial – bio and health. Again these are domain-specific methods. Second column lists the segmentation accuracy obtained by LEX. And third column lists the improvement in this accuracy attained by our method. The reason we attained such results is that – LEX does not model text-misc state and suffered from under-segmentation in many cases. The heuristics of LEX are limited in that it assumes that attribute name and operand cannot be more than 2 rows apart in HTML code which is contrary to reality in many domains. You might have noticed the 4th column in the table of comparison. It represents a variation of the HMM. And it too outperforms LEX on all the domains. Lets look at different variations of 2-layered HMM that we created by altering the training data.
We noticed that there exists differences in interface designs from different domains. Using HMMs, I derived T-HMM topologies for different domains. This figure shows design tendencies in the auto domain. States indicate semantic labels assigned to components. Similarly, for 4 other domains, this state transition topology was created. The transitions and values were found different in all the 5 domains. E.g. several peculiarities can be seen in one domain say auto domain. in all domains there is some prob. of transitioning from operator to attribute-name except in the auto domain. Also the transition from operand to operator is only found in this domain. HMMs are a useful way of studying the differences and preferences of designers in a particular domain. We also created another HMM with interfaces from a mix of all 5 domains and call it the mixed model.
Using these 6 variations of HMMs we conducted 30 experiments. We tried all possible combinations of training and testing data. All these cells belong to one of the three kinds of methods. The green cells represent the domain specific methods i.e. the training and test data are same. This is the method we used to conduct our initial experiments. The orange cells represent the generic methods i.e. the training data is not consciously created and comes from a bunch of mixed domain interfaces. The rest of the cell belong to the ‘cross domain’ method. i.e. training data is from domain X and test data is from domain Y. The numbers in bold represent the highest accuracy attained while testing interfaces in a given domain and numbers in italics represent the weakest performance by a model in a given domain. We can see that HMMbio gives highest performance in 4 out of 5 domains. Out of which 3 are cross domain methods. Looking at a greater detail – lets look at patterns captured by domain-specific model. The first example comes from automobile domain. We notice that the domain-specific model HMMauto generates best performance in auto domain. This pattern is peculiar in auto domain and hence wasn’t captured by other models. Similarly in bio domain this segment pattern was peculiar and frequent and hence wasn’t captured by other domains resulting in best performance by domain-specific model in bio domain. Lets look at some patterns that were captured by cross domain models. E.g. a segment pattern in health domain. It was undersegmented by HMMhealth as this is a rare pattern in this domain. However this was captured by cross domain model HMMbio where it is common to have a text-misc after a textbox within a segment. Another pattern comes from movie domain – this was incorrectly segmented by HMMmovie as its not common to have operators in selection list in the movie domain. But this pattern is common in bio domain, so was captured by cross domain model HMMbio. We can see that contrary to previous study and intuition, domain specific model don’t always return in highest accuracy. E.g. in movie domain, HMMmovie returned 70% accuracy. Which was less than that returned by every other model.
Although the domain testes are limited, we can derive some general conclusions. First, when a domain has a peculiar as well as frequent pattern , then that pattern can be returned by domain-specific model. E,g. are bio and auto domains. Second, when a domain D has a rare pattern and there is another domain B that has the same pattern as a frequent one, then that pattern can be recovered by cross model prepared by interfaces from domain B.In short, Its not that domain specific models always lead to higher accuracy instead the model trained by better examples result in better results. Better in the sense of frequency of design patterns in both domains.
The presentation is divided into 4 parts: First, I will describe the research problem, Second, I will describe the proposed solution to solve this problem. Then, I will talk about the results of the experiments we’d carried out. Finally, I will specify the contributions of this work and some future directions.
We showed that the interface design process is probabilistic in nature and introduced an approach to interface segmentation. We are the first one to apply HMMs on deep web interfaces. We tested our method across several domains and found that it results in high accuracy and outshines a contemporary approach in all domains.We designed different variations of the HMMs and tested them across all domains. An interesting conclusion we reached is: we can design a single model that can be used for segmenting interfaces for multiple domains – e.g. the HMMbio, prepared by biology interfaces, performed other models in 4 out of 5 domains.
In future, we want to test our method on more domains, a derive a minimal set of models that can various domains present on the deep Web. In terms of improvement, we want to be able to represent more complex segments. Some segments are intertwined with components of other segments and certain segments are really strange like they have attribute name and operands intertwined in a single component. And are composed of a single component. We also want to be able to extract more information about an attribute such as data type, integrity constraints, etc. Also, using HMMs posed certain limitations to the approach. We had to perform manual tagging to prepare training data. We want to explore some unsupervised learning methods to prepare training data. Another problem was of time complexity. We want to explore some optimization methods to improve efficiency of this approach; or we could use this approach as a pre-processing module to other advanced tasks related to deep Web.
Thank you very much for listening with patience. Please let me know if you have any questions or comments to make.