Fostering
Synergies
How Semantic Web Technology could
influence Software Repositories
Michael Würsch, Gerald Reif, Serge De...
Developer’s Information
        Needs
‣ Who has changed this code and why?
‣ How can I persist data in Spring?
‣ What are ...
Information Silos
        Bugzilla




                                                  Mailinglists
                    ...
Leveraging Information: State of the
                Art


                            www.google.com/codesearch
Bugzilla
...
Again, Silos...

                                                    ‣ Database schemas
                                  ...
Release your Data!



‣ Use a common vocabulary to
  describe software artifacts and their
  relationships
‣ Expose unique...
The Semantic Web/The Web of
            Data

 ‣ Graph-based data model
   described by S-P-O triples
 ‣ URIs to reference...
Example: Building an RDF
          Graph



                http://myProject.org/bugs/nr124
                http://evolize...
Research Agenda

Come up with a strategy for generating
URIs for software artifacts
Develop an ontology of software
artifa...
Existing Ontologies

EvoOnt
http://www.ifi.uzh.ch/ddis/evo/

SEON - Software Engineering Ontology
http://evolizer.org

Baet...
Release your Data!                       The Semantic Web/The Web of Data


                                              ...
Upcoming SlideShare
Loading in …5
×

Fostering Synergies - How Semantic Web Technology could influence Software Repositories

692 views

Published on

Talk given at SUITE 2010

Abstract:
The state-of-the-art in mining software repositories stores software artifacts from various sources into monolithic relational databases. This puts a lot of querying power in the hands of the software miners, however it comes at the cost of enclosing the data and hamper cross-application reuse. In this paper we discuss four problem scenarios to illustrate that Semantic Web technology is able to overcome these limitations. However, it requires that the software engineering research community agrees on two prerequisites: (a) a common vocabulary to talk about software repositories -- an ontology; (b) a strategy for generating unique and stable references to all software artifacts inside such a repository -- a Universal Resource Identifier (URI).

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
692
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

  • Search-Driven Software Engineering is all about fulfilling information needs of developers or maintainers. These information needs can be expressed in terms of questions, such as...(Read some of the questions above)
  • The data needed to answer such questions is often locked away in data silos, such as Bug Tracking Systems, Version Control Systems, Mailing lists, etc. I say locked away, because many of these tools are not made for querying. Further, many information needs span more than one domain. This is where limitations of the existing systems are apparent. To summarize them (read examples).
  • (continue) we usually parse, for example, CVS logs or XML exports of bug reports and use some heuristics to establish links between them. Or we build richer source code models by parsing or partially compiling source files. Then we more or less mirror all the information in a relational database and provide a query interface on top of it. Examples are...(name the examples)
  • From the point of view of other researchers and tool builders, we are again building silos that are barely useful for other than the originally envisioned purposes.

    There are three main reasons for that:

    First, in theory, db schemas should be exchangeable thanks to DDLs, in practice is is still a painful undertaking
    Second, relations are local - you can not simply reference an entity stored within another database in your database. you basically enclose your data in the db.
    Third, there is no consistent way to get the meaning of a relation - a query can join tables by any columns which match by data type, without any check on semantics
  • We believe that we should overcome this limitations by defining a common data schema, meaning a common syntax and vocabulary to describe software artifacts and the relationships between them. This would, for example, give us the possibility to try out different search-engines on different data-repositories.

    We should further come up with ways to expose unique identifiers for software artifacts on the web. This enables two things: first, we can reference information across these silo boundaries and second, we could then potentially run distributed queries, without all the preprocessing and mirroring effort, I have mentioned before.
  • We believe that the Semantic Web provides the tools for this. Forget about all the A.I. magic that you might associate with the Semantic Web. It is just a very convenient, but yet simple, framework for describing and working with data.

    It provides a graph-based data model, described by simple subject-predicate-object models and URIs to reference resources. Vocabulary is described by ontologies. You can search in such information graphs with SPARQL, the query language of the Semantic Web.


  • Given two repositories, one that stores bug reports and one that stores a full-fledged source code model, we can then, in a third place make statements about a particular bug and a particular Java class. This is a s-p-o triple.

    Dereferencing the URIs leads to the resources, or, in the case of the ‘affects’-property to the ontology definition.
  • We need unique and stable identifiers for s-e artifacts. It’s easy to come up with such URIs for some artifacts, but not so straight-forward for others.

    We need agree on a common vocabulary (data schema) for software engineering concepts. A source code visualization tool should not need to care whether it works with data retrieved from Evolizer, Google Code Search, Koders, or Sourcerer. These tasks are clearly a community effort
  • No need to start from scratch - take existing ontologies and consolidate.

  • Fostering Synergies - How Semantic Web Technology could influence Software Repositories

    1. 1. Fostering Synergies How Semantic Web Technology could influence Software Repositories Michael Würsch, Gerald Reif, Serge Demeyer, Harald Gall University of Zurich, Switzerland University of Zurich Department of Informatics software evolution & architecture lab
    2. 2. Developer’s Information Needs ‣ Who has changed this code and why? ‣ How can I persist data in Spring? ‣ What are the subclasses of JComponent? ‣ What class in my project had the most bugs prior to the last release? ‣ ...
    3. 3. Information Silos Bugzilla Mailinglists CVS Atlassian Jira Subversion Wikis ‣ Limited search ‣ No unified data model capabilities ‣ No references across silo ‣ No cross-domain boundaries queries
    4. 4. Leveraging Information: State of the Art www.google.com/codesearch Bugzilla preprocess www.koders.com mirror CVS sourcerer.ics.uci.edu e ee e www.evolizer.org
    5. 5. Again, Silos... ‣ Database schemas are not portable www.google.com/codesearch sourcerer.ics.uci.edu ‣ Relations are local ‣ There is no consistent way of e ee getting the meaning e of a relation www.evolizer.org www.koders.com
    6. 6. Release your Data! ‣ Use a common vocabulary to describe software artifacts and their relationships ‣ Expose unique identifiers for software artifacts on the web
    7. 7. The Semantic Web/The Web of Data ‣ Graph-based data model described by S-P-O triples ‣ URIs to reference Resources ‣ Ontologies to formalize a common understanding of a domain ‣ SPARQL to search by matching graph-patterns
    8. 8. Example: Building an RDF Graph http://myProject.org/bugs/nr124 http://evolizer.org/bugOntology/affects http://sourcerer.ics.uci.edu/myProject/Foo.java http://sourcerer.ics.uci.edu/ myProject/Foo.java http://myProject.org/bugs/nr124
    9. 9. Research Agenda Come up with a strategy for generating URIs for software artifacts Develop an ontology of software artifacts and their relationships
    10. 10. Existing Ontologies EvoOnt http://www.ifi.uzh.ch/ddis/evo/ SEON - Software Engineering Ontology http://evolizer.org Baetle - Bug And Enhancement Tracking Language http://code.google.com/p/baetle/ DOAP - Description of a Project http://trac.usefulinc.com/doap
    11. 11. Release your Data! The Semantic Web/The Web of Data ‣ Graph-based data model described by S-P-O triples ‣ Formalize a common vocabulary to describe software artifacts and their ‣ URIs to reference Resources relationships ‣ Ontologies to formalize a ‣ Devise strategies to generate URIs for common understanding of a software artifacts domain ‣ Expose these URIs on the Web ‣ SPARQL to search by matching graph-patterns Existing Ontologies Research Agenda EvoOnt Come up with a strategy for generating http://www.ifi.uzh.ch/ddis/evo/ URIs for software artifacts SEON - Software Engineering Ontology http://evolizer.org Develop an ontology of software artifacts Baetle - Bug And Enhancement Tracking Language and their relationships http://code.google.com/p/baetle/ DOAP - Description of a Project http://trac.usefulinc.com/doap

    ×