Intorduction: SE-CodeSearch is a Internet-scale Semantic Rich Code Search (SICS). It uses Semantic Web knowledge representation and reasoning capabilities to apply static code analysis on incomplete code. The reasoner infers semantic-rich facts as the crawler indexes new source code step by step.
Paper Title: SE-CodeSearch: A Scalable Semantic Web-based Source Code Search Infrastructure
Authors: I. Keivanloo, L. Roostapour, P. Schugerl, J. Rilling
For questions contact: email@example.com
In our daily life, we always search in different domains. But our search engines are limited because of the data size and the domain complexity. For example: you may use geographical search engine to find out “who lives in London” which is a simple query. But what about “who has relatives in London?” It is rather complex query specifically when it must be answered based on an Internet-scale data.
A similar problem exists in Internet-scale Source Code Search. Current engines can answer simple queries easily but they are not able to handle those queries that require source code semantics to be considered appropriately.
Similar to the previous slide, we have the same restriction during source code search. For example, you may search for “the file that contains the class definition” (shown in line 4 and 5). This type of query is quite easy to answer by performing a keyword search.
Finding a method call statement by specifying the receiver type, however is less straight forward. This is shown in example at line 22. Search and return the receiver of methodB() call? To find out the answer the content of MethodA() must be analyzed. Furthermore, it could belong to another library, project or package. In the worst case, the content of methodA() might not have been indexed yet and only will be indexed after the current code segment has been indexed and analyzed. This may be caused by the fact that the source for the data is the Internet and data becomes only available by crawling different Internet resources. Therefore, an Incremental Static Source Code Analysis technique is required that can handle Internet-scale data and incomplete data.
Current search engines cannot answer questions like the previous one. While it is not easy to answer such interconnected “semantic” questions in general, providing such services seems possible since: The amount of data is less And the data is structured well The rules are less and simpler
Important Note About Internet-scale Source Code Search Data Charactrestics: It is a Tera-Scale (terabyte). The data could be incomplete similar to above sample where the content of MethodA() is not available at the first place. The data repository evolves similar to above sample where the content of MethodA() would be indexed by crawler later.
All possible types of code search queries gathered from literature are classified here. The classification has been done based on the analysis type a search engine has to support for each category. PSQ covers those queries related to code structure (aka Structural Query)such as a class definition statement etc. MDQ are those asking about metadata such as code language or application domain. TCQ covers those require transitive closure computation such as object oriented inheritance trees. MCQ is about method call statements which is one the most challenging queries such as the one given in previous slide. AIQ requires when we deal with incomplete repositories to avoid invalid results. It requires Open world assumption to be considered. MXQ which emphasizes on the fact that a real query might be a mixture of earlier query types. So the code search engine must let the user search by such queries.
In order to establish research objective, we defined SICS as a ‘Semantic-rich Internet-scale Code Search’ , which must support all query types introduced earlier and at the same to support search over tera-scale data . That is, the engine must be able to consider the source code semantic while it is extracting code facts during the static code analysis phase.
We did an evaluation on available code engines on the Internet to find out whether there are any qualifying SICS. The result shows us there are two classes (neither of can be considered as a SICS), The first class, provides us some of the fine-grained search but they are limited to compileable code (that is all the code segments must be available at the “fact extraction phase”.) Remember the MethodA() and MethodB() example discussed earlier! These engines can not extract facts if the content of MethodA() was not available even if it becomes available later. The other class supports both complete and incomplete code but they just support coarse-grained queries.
So we find out a gap here to be bridged, which is our research motivation: We want to have a code search engine that provides fine-grained queries but also not limited to compliable code.
Nevertheless, there are some major challenges for SICS implementation First of all, some code might be indexed without the imported binaries (incomplete code). Second, the code repository evolves 24/7. It means that some of the required binaries or code might be indexed later The last challenge is the data size. That is we can not rely on in-memory (RAM) based traditional code analysis anymore.
Considering the discussed requirements (query types) and challenges, we designed SE-CodeSearch which is a knowledge-based code search infrastructure. The overall architecture is shown on the right hand side. All in all, it creates a small ontology for each code part which includes two main sections. The ontology contains some facts and also some static code analysis rules for further fact extraction These ontologies step by step will be saved into the RDF repository At run-time, a backward chaining reasoner will apply the code analysis rules related to the given query to find out the answer.
The main part of SE-CodeSearch is its source code ontology which we call it SICSONT (SICS Ontology). SICSONT is available at http://aseg.cs.concordia.ca/ontology Comparing to other code ontologies, SICSONT is able to represent not only code facts but also static code analysis rules formally. In addition, SICSONT is optimized for Internet-scale reasoning (We did not use most of expensive OWL constraints to make it usable for real-time reasoning) Further details regarding the ontology population are given in the paper. The ontology is publicly available on our website.
SE-Codesearch uses a Semantic-Web-based Static code analysis approach which means: First of all it is a knowledge-based approach so it has its own advantages and challenges. Second, the inference engine will be responsible for static code analysis task. Third, we are using OWL-DL as the language. OWL is the defacto standard for knowledge representation so our repository can be shared on the Internet Second it is decidable since it is based on Description Logic so the answer will be ready at a proper time However representation of code analysis rules are not as easy as using a rule-based approach which is one our challenges. Representation of some code analysis rules using DL is much harder than using a SWRL or Datalog.
In addition, we do not need a compiler to extract facts which lets us handle incomplete code (comparing to current code search engines) As a major part of SE-CodeSearch we created templates for automatic translation of static code analysis rules into OWL-DL. The current version of SE-CodeSearch supports three types of analysis which are : 1-Inheritance tree computation 2-Fully qualified name resolution and 3-Method call/ return statement connection A sample of such template for method call graph construction is shown here
Since one of the biggest challenges for knowledge-based applications is scalability we did some scalability test on the SE-CodeSearch implementation which supports Java language currently.
We have used a regular desktop computer with 3 gigabyte RAM as the server
Our data set is based on a very large Java code repository extracted from Sourceforge which is about 400 gigabytes. We selected about 600000 classes and also removed binaries to simulate the incomplete data.
We selected two major types of queries which are TCQ and MCQ. We observed the response time of them as the repository grew. The graph shown here represents the test result which shows us that the response time was not affected by the repository size (which is a very positive observation for SE-CodeSearch).
In the following some of the SE-CodeSearch highlights will be discussed Although OWL-DL and Semantic Web languages have lots of fancy properties, we have avoided them to remain scalable. The ontology population is optimized to use less hardware resources as usual A Disk-based inference engine is used which means it just needs little memory The inference engine is backward-chaining. That is we do not extract all the possible facts from the repository. The amount of all possible facts could be very huge. For example, in Java all classes are subclass of the Object class! This means that a forward-chaining reasoning (static code analysis) must infer a fact for each class definition that says the observed class is a sub-class of Object class!!! Which doubles the total number of facts in the repository!!!! But the backward-chaining reasoner does infer any fact beforehand. It waits for the query and then apply the rules only on a subset of data which is related to the query and its input criteria.
The knowledge-based approach also helps us to increase the ability to parallelize the processing and analysis, since: 1-We just analyze each code once 2-We support incomplete code without binaries 3-We can apply static analysis independent of the parsing order 4-Our repository can evolve incrementally
Also note that our model supports the “open world assumption” by default, which is unavailable for most relational databases
SE-CodeSearch poster. Available at: http://aseg.cs.concordia.ca/codesearch
•Incomplete code (no binaries)
–The crawler is working 24/7
–Dependent code might be
indexed in any order
•Very large repository
9/14/2010 7ICSM 2010 ERA
•Creates small ontology for each code part
•Static code analysis rules
•Saves them in the RDF repository
•Uses backward chaining reasoner to answer
•Not only structural query
•But also all the other query types
(embedded code analysis at runtime)
9/14/2010 8ICSM 2010 ERA
• Source Code Ontology for Internet-scale Static
9/14/2010 9ICSM 2010 ERA
Static Code Analysis
• Knowledge-based approach
• Inference engine does the analysis
• Restricted to OWL-DL
– De facto standard for knowledge sharing
– Based on Description Logic
• More restricted than rule-based families
9/14/2010 10ICSM 2010 ERA
Static Code Analysis (Cont.)
• No compiler
• Possible analysis
– Inheritance tree computation
– Fully qualified name resolution
– Method call/return statement and type resolution
• Translation template for each analysis rule
9/14/2010 11ICSM 2010 ERA
1. Transitivity closure-based
2. Method call
600,000 Java classes (no binaries)
from a very large dataset (~400 GB)
• 3 GB RAM
• 3.40 GHz CPU
9/14/2010 12ICSM 2010 ERA
•Avoid expensive knowledge
•Optimized ontology population
–Works on minimum hardware
9/14/2010 13ICSM 2010 ERA
–One pass code analysis
–Static code analysis on
–Independent of parsing order
•First Package A then Package B
•First Package B then Package A
–Repository evolves incrementally
•Open World Reasoning (Not available in Relational DB)
9/14/2010 14ICSM 2010 ERA