Searching Repositories of Web Application Models

Searching Repositories of Web Application Models Alessandro Bozzon, Marco Brambilla, Piero Fraternali ICWE 2010 Vienna, July 7th 2010

Context Project repositories are a central asset in (Web) software development they preserve the technical knowledge gathered in past development activities repositories now overcome the boundaries of individual organizations and have a social role in the diffusion of coding and design solutions they allow for reuse of knowledge and artifacts Locating relevant information in a vast project repository is problematic Two options Manual tagging  time consuming and prone to errors , omissions and incoherencies Automatic analysis  a lot of semantic can be lost in the process

Addressed Problem Objective: easing the discovery of useful information from past software projects Main resource: application models available in companies applying Model-Driven Engineering practices In contrast to existing solutions, that mainly focus on discovery of code, documentation, and annotations Why dealing with application models is an advantage? Increased result quality (thanks to the more valuable information embedded in models wrt to the code) Less need for manual tagging

Related work: Component search Retrieval of annotated pieces of software dates back to the '90s. Various approaches: worldwide search engine based on JavaBeans and Corba [Agora, Internet Computing, 1998] Search engines for Web services based on indexed Vector Space Model characterization of their properties [Dustdar et al., ECWS 2005] Significance based search that exploits graph models of a software component library (usage relations used as links propagating significance) [Inoue et al., TOSEM 2005] Combination of formal and semi-formal specification to describe behaviour and structure of components [Khalifa et al., ASEA 2008]

Related work: Source code search Several communities and on-line tools for sharing and retrieving code : Google code , Snipplr , Koders , Codase, Jexamples, SourceForge Keyword queries directly matched to the code Results are the exact locations where the keyword(s) appear Plus advanced behaviours: regular expressions (Google), wildcards (Codase), restriction to specific concept types (Jexamples, Codase), advanced ranking , e.g., based on rank results based on relevance of match, activity, date of registration, recency of last update (SourceForge) Other approaches: Information retrieval techniques for software reuse [Frakes et al., SIGIR Forum 1987] taking advantage of code structural information [Holmes and Murphy, ICSE 2005] and [ Sourcerer Project by Bajracharya et al., SUITE ICSE workshop 2009]

Related work: Model search The problem is usually restricted to Searching UML or ER models XML / XMI format for indexing seamlessly UML models, text files, and others [Gibb et al., 2000] [Lorens et al., 2004] [Moogle: Lucredio et al., Models 2008] UML artifacts classified with WordNet terms and extracted though Case-Based Reasoning [Gomes et al., AI Comm., 2004] database conceptual model retrieval based on text search, schema matching, and structurally-aware scoring methods, with queries by example and keword-based [Schemr: Chen, Halevy, SIGMOD09] IR techniques applied to models and code together, for tracing the association between requirements, design artifacts, and code [Antoniol et al., 2000] […]

Related work: Business Process Discovery Different approaches to extraction of BP models from repositories Based on the workflow topology only: graph-based comparison or XML-based querying [Beeri et al., VLDB 2006] [Lu et al., BPM 2006] [Shao et al., ICDE 2009] Based on semantic reasoning and discovery, using SPARQL, query by example, SQL-like languages, and so on [Kiefer et al., ESWC 2007] [Goderis et al., ICWS 2006] [Awad et al., EDOC 2008] [Zhuge 2002] [Belhajjame, Brambilla, BPMDS 2009] Based on IR techniques [ Dongen, Dijkman et al., Caise 2008 ]

Our contribution A model-based search solution , with several innovations: it automatically exploits the semantics from the searched conceptual models It does not require manual annotation it supports alternative indexing and ranking functions, based of the meta-model of the considered DSL(s) it is based on a model-independent framework , which can be customized to any meta-model User study to evaluate acceptance and the quality perceived by users Performance tests to evaluate scalability

Overall Architecture of the System Engineering Web Search Application Bozzon, Brambilla, Tutorial @ICWE2010

Overall Architecture of the System The Content Processing Flow extracts meaningful information from projects and uses it to create the search engine index. 1. CONTENT PROCESSING project analysis captures project-level, global metadata segmentation splits the project into smaller units segment analysis extracts from segments the information to be indexed linguistic normalization applies the typical normalization operations of IR

Overall Architecture of the System The Content Processing Flow extracts meaningful information from projects and uses it to create the search engine index. 2. INDEXING each project or segment is physically represented as a document the search engine indexes are built based on the documents the DSL metamodel is taken into account

Overall Architecture of the System The query and result presentation Flow deals with the submitted queries and the production of the result set. 1. USER INTERFACE supports Keyword-based queries Content-based queries (aka QBE) Rendering of the results

Overall Architecture of the System The query and result presentation Flow deals with the submitted queries and the production of the result set. 2. QUERY PROCESSING matches the query to the indexed content using a given similarity criteria produces ranked results

Design Dimensions of Model Retrieval (1/2) Segmentation Granularity : the “size” of atomic unit of retrieval for the user Project Sub-project Model concepts (all or only the main ones) Index structure : o ne or more fields (associated with an boosting score ) Flat: a simple list of terms without taking into account model semantics Weighted: model concepts used to weight terms in the ranking Multi-field: terms belonging to different model concepts are collected into separate fields Structured: the model is translated into a representation that reflects the hierarchies and associations among concepts

Example of Model Indexing Metamodel Model Model XML Representation Product Catalogue Catalogue Home Page List Products List of product in the catalogue View Details Details of a selected product HYPERTEXT MODEL 1 ID Product Catalogue Application PROJECT NAME Multi-Field

Design Dimensions of Model Retrieval (2/2) Query Language and Result Presentation : the way queries and results are presented. Keyword-based search Document-based search: the system extracts the most significant words and submits them as a query Search by example : the query is a model, analyzed and matched by similarity Faceted search : exploration using facets (i.e., property-value pairs) extracted from the indexed documents Snippet visualization : with the matching points highlighted in graphical or textual form

Our model-based search engine prototype General purpose, model-independent, configurable system : Configuration of a general purpose search engine according to the selected design dimensions metamodel-aware rules to analyze models and populate the index segmentation and text-extraction steps  model transformation rules Offline collection analysis  compute statistics for fine-tuning the retrieval and ranking Stop Domain Concept removal optimization of the weights assigned to each model concept Provides a visual interface to perform queries and inspect results. Content processing has been implemented by extending the text processing and analysis components provided by Apache Lucene

Experiment Settings - Dataset 48 real-world WebML projects from WebRatio trouble ticketing, human resource management, multimedia search engines, Web portals, etc. Italian and English ~ 250 Modeling Concepts 3,800 data model entities (with about 35,000 attributes and 3,800 relationships) 138 site views with about 10,000 pages and 470,000 units, and 20 Web services. The overall repository takes around 85MB of disk space

Experiment settings - Configurations 3 different settings of the design dimensions: A , B , C A flat index structure; B and C multi-field weighted ( projectID, projectName, documentType, text) Option Description A B C Segmentation Granularity Project Entire project X Sub-project Subproject X X Single-Concept Arbitrary model concepts X X Index Structure Flat Flat list of words X X Weighted Words weighted by the model concepts they belong to X Multi-field Words belonging to each model concept in separate fields X X Query Language and Result Presentation Keyword-based Query By Keywords X X X Faceted Query refined through specific dimensions X X X Snippets Visualization and exploration of result previews X X X

Experiment C – model-based scoring function Experiments A and B exploit a traditional TF-IDF ranking function Experiment C exploits the DSL metamodel mtw(m, t) : Model Term Weight , a metamodel specific boost that depends on the concept m containing the term t dw(d) : Document Weight , a metamodel and model-specific boosting value that expresses the importance of a given document (according to the selected granularity)

User Interface (A) Rendered result set and facets (B) Snippet window with highlighted matches

User Evaluation – Perceived Quality User study has been conducted with 5 expert WebML designers to assess the quality and perception of alternative configurations Users rated the results in the result sets, Votes ranged from 1 (highly inappropriate) to 5 (highly appropriate) Experiment B and C got more votes in high range of the scale Success Factor : Injecting the semantic of the meta-model

User Evaluation - Acceptance Users were asked 10 questions about the features of the application Votes ranged from 1 (bad) to 5 (good) useful for model maintenance and reuse role in improving the quality of the applications a certain distance between the overall judged quality and the adoption likelihood But there is a bias due to the lack of a graphical viewer Avg. Var.. Features Keyword Search 3.6 0.24 Search Result Ranking 3.2 0.16 Faceted Search 3.8 0.16 Match Highlighting 3.6 0.24 Application Help reducing the maintenance costs 3.2 0.56 Help improving the quality of the delivered application? 3.0 0.4 Help understanding the model assets in the company? 4.4 0.24 Help providing better estimates for future application costs? 2.8 0.56 Wrap Up Overall Evaluation of the system 4.0 0.4 Would you use the system in your activities? 3.0 1.2

Performance Evaluation - Query Time About 400 2-terms and 3-terms randomly generated keyword queries Each query has been executed 20 times Query time is abundantly sub-second and curves indicate a sub-linear growth The addition of Faceted Search and Snippet Visualization impacts heavily with the number of inde NOSeg: No Segmentation Seg: Segmentation KS: Keyword Search FS: Faceted Search Snip: Snippet

Performance Evaluation - Index Size Size grows almost linearly with the number of projects in all configurations Baseline configurations feature index sizes about 10 times smaller than the repository size Faceted Search doubles the index size NOSeg: No Segmentation Seg: Segmentation KS: Keyword Search FS: Faceted Search Snip: Snippet

Conclusions and future directions A metamodel-aware approach and a system prototype for searches over model repositories Scalability tests and user studies in different experimental settings Future works: Integration of content-based search Improve result visualization: integration in the WebRatio tool-suite for WebML and visual highlighting of the matches in the projects Adaptive fine-tuning for improving precision and recall Experiments with more modeling languages (e.g. BPMN) Definition of generic benchmark criteria for model-driven repository search

Thanks! Questions? Alessandro Bozzon Marco Brambilla Piero Fraternali [email_address] ? Searching Repositories of Web Application Models

Searching Repositories of Web Application Models

More Related Content

What's hot

Viewers also liked

Similar to Searching Repositories of Web Application Models

More from Marco Brambilla

Recently uploaded

Searching Repositories of Web Application Models

Editor's Notes