Searching Repositories of Web Application Models Alessandro Bozzon, Marco Brambilla, Piero Fraternali ICWE 2010 Vienna, July 7th  2010
Context Project repositories   are a central asset in (Web) software development they  preserve the technical knowledge  gathered in past development activities repositories now overcome the boundaries of individual organizations and have  a social role  in the diffusion of coding and design solutions they  allow for reuse  of knowledge and artifacts  Locating relevant information   in a vast project repository is problematic Two options Manual tagging    time consuming  and  prone to errors , omissions and incoherencies Automatic analysis     a lot of semantic can be lost in the process
Addressed Problem Objective:  easing the discovery of useful information   from past software projects Main resource:  application models available in companies applying Model-Driven Engineering practices  In contrast to existing solutions, that mainly focus on discovery of code, documentation, and annotations Why dealing with application models is an advantage? Increased result quality (thanks to the more valuable information embedded in models wrt to the code) Less need for manual tagging
Related work: Component search Retrieval of annotated pieces of software  dates back to the '90s. Various approaches: worldwide search engine  based on JavaBeans and Corba [Agora, Internet Computing, 1998] Search engines for Web services  based on indexed Vector Space Model characterization of their properties [Dustdar et al., ECWS 2005]  Significance based search  that exploits graph models of a software component library (usage relations used as links propagating significance) [Inoue et al., TOSEM 2005] Combination of  formal and semi-formal specification  to describe behaviour and structure of components [Khalifa et al., ASEA 2008]
Related work: Source code search Several communities and  on-line tools for sharing and retrieving code :  Google code ,  Snipplr ,  Koders ,  Codase, Jexamples, SourceForge Keyword queries  directly matched to the code Results are the  exact locations  where the keyword(s) appear Plus advanced behaviours:  regular expressions  (Google),  wildcards  (Codase),  restriction  to specific concept types (Jexamples, Codase),  advanced ranking , e.g., based on rank results based on relevance of match, activity, date of registration, recency of last update (SourceForge) Other approaches: Information retrieval techniques  for software reuse [Frakes et al., SIGIR Forum 1987] taking advantage of  code structural information  [Holmes and Murphy, ICSE 2005] and [ Sourcerer Project by Bajracharya et al., SUITE ICSE workshop 2009]
Related work: Model search The problem is usually restricted to  Searching UML or ER models XML /  XMI format  for indexing seamlessly UML models, text files, and others [Gibb et al., 2000] [Lorens et al., 2004] [Moogle: Lucredio et al., Models 2008] UML artifacts classified with  WordNet terms  and extracted though Case-Based Reasoning [Gomes et al., AI Comm., 2004] database conceptual model retrieval based on text search, schema matching, and structurally-aware scoring methods, with  queries by example  and keword-based [Schemr: Chen, Halevy, SIGMOD09] IR techniques applied to models and code together, for tracing the association between requirements, design artifacts, and code [Antoniol et al., 2000] […]
Related work: Business Process Discovery Different approaches to  extraction of BP models from repositories Based on the  workflow topology  only: graph-based comparison or XML-based querying [Beeri et al., VLDB 2006] [Lu et al., BPM 2006] [Shao et al., ICDE 2009] Based on  semantic reasoning  and discovery, using SPARQL, query by example,  SQL-like languages, and so on [Kiefer et al., ESWC 2007] [Goderis et al., ICWS 2006] [Awad et al., EDOC 2008] [Zhuge 2002] [Belhajjame, Brambilla, BPMDS 2009] Based on  IR techniques  [ Dongen, Dijkman et al., Caise 2008 ]
Our contribution A  model-based search solution , with several innovations:  it  automatically exploits the semantics  from the searched conceptual models It  does not require manual annotation it supports  alternative indexing and ranking  functions, based of the  meta-model  of the considered DSL(s)   it is based on a  model-independent framework , which can be customized to any meta-model User study  to evaluate  acceptance  and the  quality   perceived  by users Performance  tests to evaluate  scalability
Overall Architecture of the System Engineering Web Search Application Bozzon, Brambilla, Tutorial @ICWE2010
Overall Architecture of the System The  Content Processing Flow  extracts meaningful information from projects and uses it to create the search engine index. 1. CONTENT PROCESSING  project analysis  captures project-level, global metadata   segmentation  splits the project into smaller units  segment analysis  extracts from segments the information to be indexed  linguistic normalization  applies the typical normalization operations of IR
Overall Architecture of the System The  Content Processing Flow  extracts meaningful information from projects and uses it to create the search engine index. 2. INDEXING each project or segment is physically represented as a  document the  search engine indexes  are built based on the documents the  DSL metamodel  is taken into account
Overall Architecture of the System The  query and result presentation Flow  deals with the submitted queries and the production of the result set.  1. USER INTERFACE  supports Keyword-based  queries Content-based  queries   (aka QBE) Rendering  of the results
Overall Architecture of the System The  query and result presentation Flow  deals with the submitted queries and the production of the result set.  2. QUERY PROCESSING matches  the query to the indexed content using a given similarity criteria produces ranked results
Design Dimensions of Model Retrieval (1/2) Segmentation Granularity :  the “size” of atomic unit of retrieval for the user Project Sub-project Model concepts  (all or only the main ones) Index structure :  o ne or more fields (associated with an  boosting score )  Flat:  a simple list of terms without taking into account model semantics Weighted:  model concepts used to weight terms in the ranking Multi-field:  terms belonging to  different model concepts are  collected into separate fields Structured:  the model is translated  into a representation that reflects the  hierarchies and associations among concepts
Example of Model Indexing Metamodel Model Model  XML Representation Product Catalogue Catalogue Home Page List Products List of product in the catalogue View Details Details of a selected product HYPERTEXT MODEL 1 ID Product Catalogue Application PROJECT NAME Multi-Field
Example of Model Indexing Metamodel Model Model  XML Representation Product|2.0 Catalogue|2.0   Catalogue|1.0 Home|1.0 Page|1.0   List|0.5 Products|0.5   List|0.2 of|0.2 products|0.2 in|0.2 the|0.2 catalogue|0.2  View Details Details of a selected product HYPERTEXT MODEL 1 ID Product Catalogue Application PROJECT NAME Multi-Field, Weighted Index 2.0  1.0  0.5
Design Dimensions of Model Retrieval (2/2) Query Language and Result Presentation :  the way queries and results are presented. Keyword-based  search  Document-based  search: the system extracts the most significant words and submits them as a query Search by example : the query is a model, analyzed and matched by similarity Faceted search : exploration using facets (i.e., property-value pairs) extracted from the indexed documents  Snippet visualization : with the matching points highlighted in graphical or textual form
Our model-based search engine prototype General purpose, model-independent, configurable system : Configuration  of a general purpose search engine according to the selected design dimensions metamodel-aware  rules to analyze models and populate the index segmentation  and text-extraction steps    model transformation rules Offline collection analysis     compute statistics for fine-tuning the retrieval and ranking Stop Domain Concept removal  optimization of the  weights  assigned to each model concept Provides a visual interface to perform queries and inspect results. Content processing has been implemented by extending the text processing and analysis components provided by  Apache Lucene
Detailed indexing process
Experiment Settings - Dataset 48   real-world   WebML  projects from WebRatio trouble ticketing, human resource management, multimedia search engines, Web portals, etc. Italian and English ~ 250 Modeling Concepts 3,800  data model entities (with about 35,000 attributes and 3,800 relationships) 138 site views with about  10,000  pages and  470,000  units, and 20 Web services.  The overall repository takes around 85MB of disk space
Experiment settings - Configurations 3 different settings of the design dimensions:  A ,  B ,  C A   flat index  structure; B  and  C  multi-field   weighted   ( projectID, projectName, documentType, text) Option Description  A B C Segmentation Granularity Project Entire project X Sub-project Subproject X X Single-Concept Arbitrary model concepts X X Index Structure Flat Flat list of words X X Weighted Words weighted by the model concepts they belong to X  Multi-field Words belonging to each model concept in separate fields X X Query Language and Result Presentation Keyword-based Query By Keywords X X X Faceted Query refined through specific dimensions X X X Snippets Visualization and exploration of result previews X X X
Experiment  C  – model-based scoring function  Experiments   A  and  B  exploit a traditional TF-IDF ranking function Experiment  C  exploits the  DSL metamodel mtw(m, t)  :  Model Term Weight , a  metamodel specific  boost that depends on the concept  m  containing the term  t dw(d)  :  Document Weight , a  metamodel  and  model-specific  boosting value that expresses the importance of a given document (according to the selected granularity)
User Interface (A)  Rendered result set and facets (B)  Snippet window with highlighted matches
User Evaluation – Perceived Quality User study has been conducted with 5 expert WebML designers to assess the quality and perception of alternative configurations Users rated the results in the result sets,  Votes ranged from 1 (highly inappropriate) to 5 (highly appropriate) Experiment B and C got more votes in high range of the scale Success Factor : Injecting the semantic of the meta-model
User Evaluation - Acceptance Users were asked 10 questions about the features of the application Votes ranged from 1 (bad) to 5 (good) useful  for  model maintenance  and  reuse role in improving the quality of the applications a certain distance between the overall judged quality and the adoption likelihood But there is a  bias  due to the  lack  of a  graphical viewer Avg. Var.. Features Keyword Search 3.6 0.24 Search Result Ranking 3.2 0.16 Faceted Search 3.8 0.16 Match Highlighting 3.6 0.24 Application Help reducing the maintenance costs 3.2 0.56 Help improving the quality of the delivered application? 3.0 0.4 Help understanding the model assets in the company? 4.4 0.24 Help providing better estimates for future application costs? 2.8 0.56 Wrap Up Overall Evaluation of the system 4.0 0.4 Would you use the system in your activities? 3.0 1.2
Performance Evaluation - Query Time About 400 2-terms and 3-terms randomly generated keyword queries  Each query has been executed 20 times Query time is abundantly sub-second and curves indicate a sub-linear growth  The addition of Faceted Search and Snippet Visualization impacts heavily with the number of inde NOSeg: No Segmentation  Seg: Segmentation KS: Keyword Search FS: Faceted Search Snip: Snippet
Performance Evaluation - Index Size  Size grows almost linearly with the number of projects in all configurations Baseline configurations feature index sizes about 10 times smaller than the repository size Faceted Search doubles the index size NOSeg: No Segmentation  Seg: Segmentation KS: Keyword Search FS: Faceted Search Snip: Snippet
Conclusions and future directions A  metamodel-aware  approach and a  system prototype  for searches over model repositories Scalability tests and user studies in different experimental settings Future works: Integration of content-based search Improve result visualization: integration in the WebRatio tool-suite for WebML and visual highlighting of the matches in the projects Adaptive fine-tuning for improving precision and recall Experiments with more modeling languages (e.g. BPMN) Definition of generic benchmark criteria for model-driven repository search
Thanks!  Questions? Alessandro Bozzon Marco Brambilla  Piero Fraternali [email_address] ? Searching Repositories of Web Application Models

Searching Repositories of Web Application Models

  • 1.
    Searching Repositories ofWeb Application Models Alessandro Bozzon, Marco Brambilla, Piero Fraternali ICWE 2010 Vienna, July 7th 2010
  • 2.
    Context Project repositories are a central asset in (Web) software development they preserve the technical knowledge gathered in past development activities repositories now overcome the boundaries of individual organizations and have a social role in the diffusion of coding and design solutions they allow for reuse of knowledge and artifacts Locating relevant information in a vast project repository is problematic Two options Manual tagging  time consuming and prone to errors , omissions and incoherencies Automatic analysis  a lot of semantic can be lost in the process
  • 3.
    Addressed Problem Objective: easing the discovery of useful information from past software projects Main resource: application models available in companies applying Model-Driven Engineering practices In contrast to existing solutions, that mainly focus on discovery of code, documentation, and annotations Why dealing with application models is an advantage? Increased result quality (thanks to the more valuable information embedded in models wrt to the code) Less need for manual tagging
  • 4.
    Related work: Componentsearch Retrieval of annotated pieces of software dates back to the '90s. Various approaches: worldwide search engine based on JavaBeans and Corba [Agora, Internet Computing, 1998] Search engines for Web services based on indexed Vector Space Model characterization of their properties [Dustdar et al., ECWS 2005] Significance based search that exploits graph models of a software component library (usage relations used as links propagating significance) [Inoue et al., TOSEM 2005] Combination of formal and semi-formal specification to describe behaviour and structure of components [Khalifa et al., ASEA 2008]
  • 5.
    Related work: Sourcecode search Several communities and on-line tools for sharing and retrieving code : Google code , Snipplr , Koders , Codase, Jexamples, SourceForge Keyword queries directly matched to the code Results are the exact locations where the keyword(s) appear Plus advanced behaviours: regular expressions (Google), wildcards (Codase), restriction to specific concept types (Jexamples, Codase), advanced ranking , e.g., based on rank results based on relevance of match, activity, date of registration, recency of last update (SourceForge) Other approaches: Information retrieval techniques for software reuse [Frakes et al., SIGIR Forum 1987] taking advantage of code structural information [Holmes and Murphy, ICSE 2005] and [ Sourcerer Project by Bajracharya et al., SUITE ICSE workshop 2009]
  • 6.
    Related work: Modelsearch The problem is usually restricted to Searching UML or ER models XML / XMI format for indexing seamlessly UML models, text files, and others [Gibb et al., 2000] [Lorens et al., 2004] [Moogle: Lucredio et al., Models 2008] UML artifacts classified with WordNet terms and extracted though Case-Based Reasoning [Gomes et al., AI Comm., 2004] database conceptual model retrieval based on text search, schema matching, and structurally-aware scoring methods, with queries by example and keword-based [Schemr: Chen, Halevy, SIGMOD09] IR techniques applied to models and code together, for tracing the association between requirements, design artifacts, and code [Antoniol et al., 2000] […]
  • 7.
    Related work: BusinessProcess Discovery Different approaches to extraction of BP models from repositories Based on the workflow topology only: graph-based comparison or XML-based querying [Beeri et al., VLDB 2006] [Lu et al., BPM 2006] [Shao et al., ICDE 2009] Based on semantic reasoning and discovery, using SPARQL, query by example, SQL-like languages, and so on [Kiefer et al., ESWC 2007] [Goderis et al., ICWS 2006] [Awad et al., EDOC 2008] [Zhuge 2002] [Belhajjame, Brambilla, BPMDS 2009] Based on IR techniques [ Dongen, Dijkman et al., Caise 2008 ]
  • 8.
    Our contribution A model-based search solution , with several innovations: it automatically exploits the semantics from the searched conceptual models It does not require manual annotation it supports alternative indexing and ranking functions, based of the meta-model of the considered DSL(s) it is based on a model-independent framework , which can be customized to any meta-model User study to evaluate acceptance and the quality perceived by users Performance tests to evaluate scalability
  • 9.
    Overall Architecture ofthe System Engineering Web Search Application Bozzon, Brambilla, Tutorial @ICWE2010
  • 10.
    Overall Architecture ofthe System The Content Processing Flow extracts meaningful information from projects and uses it to create the search engine index. 1. CONTENT PROCESSING project analysis captures project-level, global metadata segmentation splits the project into smaller units segment analysis extracts from segments the information to be indexed linguistic normalization applies the typical normalization operations of IR
  • 11.
    Overall Architecture ofthe System The Content Processing Flow extracts meaningful information from projects and uses it to create the search engine index. 2. INDEXING each project or segment is physically represented as a document the search engine indexes are built based on the documents the DSL metamodel is taken into account
  • 12.
    Overall Architecture ofthe System The query and result presentation Flow deals with the submitted queries and the production of the result set. 1. USER INTERFACE supports Keyword-based queries Content-based queries (aka QBE) Rendering of the results
  • 13.
    Overall Architecture ofthe System The query and result presentation Flow deals with the submitted queries and the production of the result set. 2. QUERY PROCESSING matches the query to the indexed content using a given similarity criteria produces ranked results
  • 14.
    Design Dimensions ofModel Retrieval (1/2) Segmentation Granularity : the “size” of atomic unit of retrieval for the user Project Sub-project Model concepts (all or only the main ones) Index structure : o ne or more fields (associated with an boosting score ) Flat: a simple list of terms without taking into account model semantics Weighted: model concepts used to weight terms in the ranking Multi-field: terms belonging to different model concepts are collected into separate fields Structured: the model is translated into a representation that reflects the hierarchies and associations among concepts
  • 15.
    Example of ModelIndexing Metamodel Model Model XML Representation Product Catalogue Catalogue Home Page List Products List of product in the catalogue View Details Details of a selected product HYPERTEXT MODEL 1 ID Product Catalogue Application PROJECT NAME Multi-Field
  • 16.
    Example of ModelIndexing Metamodel Model Model XML Representation Product|2.0 Catalogue|2.0 Catalogue|1.0 Home|1.0 Page|1.0 List|0.5 Products|0.5 List|0.2 of|0.2 products|0.2 in|0.2 the|0.2 catalogue|0.2 View Details Details of a selected product HYPERTEXT MODEL 1 ID Product Catalogue Application PROJECT NAME Multi-Field, Weighted Index 2.0 1.0 0.5
  • 17.
    Design Dimensions ofModel Retrieval (2/2) Query Language and Result Presentation : the way queries and results are presented. Keyword-based search Document-based search: the system extracts the most significant words and submits them as a query Search by example : the query is a model, analyzed and matched by similarity Faceted search : exploration using facets (i.e., property-value pairs) extracted from the indexed documents Snippet visualization : with the matching points highlighted in graphical or textual form
  • 18.
    Our model-based searchengine prototype General purpose, model-independent, configurable system : Configuration of a general purpose search engine according to the selected design dimensions metamodel-aware rules to analyze models and populate the index segmentation and text-extraction steps  model transformation rules Offline collection analysis  compute statistics for fine-tuning the retrieval and ranking Stop Domain Concept removal optimization of the weights assigned to each model concept Provides a visual interface to perform queries and inspect results. Content processing has been implemented by extending the text processing and analysis components provided by Apache Lucene
  • 19.
  • 20.
    Experiment Settings -Dataset 48 real-world WebML projects from WebRatio trouble ticketing, human resource management, multimedia search engines, Web portals, etc. Italian and English ~ 250 Modeling Concepts 3,800 data model entities (with about 35,000 attributes and 3,800 relationships) 138 site views with about 10,000 pages and 470,000 units, and 20 Web services. The overall repository takes around 85MB of disk space
  • 21.
    Experiment settings -Configurations 3 different settings of the design dimensions: A , B , C A flat index structure; B and C multi-field weighted ( projectID, projectName, documentType, text) Option Description A B C Segmentation Granularity Project Entire project X Sub-project Subproject X X Single-Concept Arbitrary model concepts X X Index Structure Flat Flat list of words X X Weighted Words weighted by the model concepts they belong to X Multi-field Words belonging to each model concept in separate fields X X Query Language and Result Presentation Keyword-based Query By Keywords X X X Faceted Query refined through specific dimensions X X X Snippets Visualization and exploration of result previews X X X
  • 22.
    Experiment C – model-based scoring function Experiments A and B exploit a traditional TF-IDF ranking function Experiment C exploits the DSL metamodel mtw(m, t) : Model Term Weight , a metamodel specific boost that depends on the concept m containing the term t dw(d) : Document Weight , a metamodel and model-specific boosting value that expresses the importance of a given document (according to the selected granularity)
  • 23.
    User Interface (A) Rendered result set and facets (B) Snippet window with highlighted matches
  • 24.
    User Evaluation –Perceived Quality User study has been conducted with 5 expert WebML designers to assess the quality and perception of alternative configurations Users rated the results in the result sets, Votes ranged from 1 (highly inappropriate) to 5 (highly appropriate) Experiment B and C got more votes in high range of the scale Success Factor : Injecting the semantic of the meta-model
  • 25.
    User Evaluation -Acceptance Users were asked 10 questions about the features of the application Votes ranged from 1 (bad) to 5 (good) useful for model maintenance and reuse role in improving the quality of the applications a certain distance between the overall judged quality and the adoption likelihood But there is a bias due to the lack of a graphical viewer Avg. Var.. Features Keyword Search 3.6 0.24 Search Result Ranking 3.2 0.16 Faceted Search 3.8 0.16 Match Highlighting 3.6 0.24 Application Help reducing the maintenance costs 3.2 0.56 Help improving the quality of the delivered application? 3.0 0.4 Help understanding the model assets in the company? 4.4 0.24 Help providing better estimates for future application costs? 2.8 0.56 Wrap Up Overall Evaluation of the system 4.0 0.4 Would you use the system in your activities? 3.0 1.2
  • 26.
    Performance Evaluation -Query Time About 400 2-terms and 3-terms randomly generated keyword queries Each query has been executed 20 times Query time is abundantly sub-second and curves indicate a sub-linear growth The addition of Faceted Search and Snippet Visualization impacts heavily with the number of inde NOSeg: No Segmentation Seg: Segmentation KS: Keyword Search FS: Faceted Search Snip: Snippet
  • 27.
    Performance Evaluation -Index Size Size grows almost linearly with the number of projects in all configurations Baseline configurations feature index sizes about 10 times smaller than the repository size Faceted Search doubles the index size NOSeg: No Segmentation Seg: Segmentation KS: Keyword Search FS: Faceted Search Snip: Snippet
  • 28.
    Conclusions and futuredirections A metamodel-aware approach and a system prototype for searches over model repositories Scalability tests and user studies in different experimental settings Future works: Integration of content-based search Improve result visualization: integration in the WebRatio tool-suite for WebML and visual highlighting of the matches in the projects Adaptive fine-tuning for improving precision and recall Experiments with more modeling languages (e.g. BPMN) Definition of generic benchmark criteria for model-driven repository search
  • 29.
    Thanks! Questions?Alessandro Bozzon Marco Brambilla Piero Fraternali [email_address] ? Searching Repositories of Web Application Models

Editor's Notes

  • #10 ANTICIPA PARLA DEI 2 PROCESSI