Scalable Semantic Web-based Source Code Search Infrastructure

•Download as PPTX, PDF•

3 likes•1,030 views

ICSM 2010

Technology

I. Keivanloo, L. Roostapour, P. Schugerl, J. Rilling
Scalable Semantic Web-based Source Code Search Infrastructure
SE-CodeSearch

Search
Who lives in London?
Who has relatives in London!
9/14/2010 2ICSM 2010 ERA

Source code search
 Where is it defined? Where is it called!
9/14/2010 3ICSM 2010 ERA

Query types • Pure structural (PSQ)
• Metadata (MDQ)
• Transitive closure-based (TCQ)
• Method call (MCQ)
• Absent information (AIQ)
• Mixed queries (MXQ)
Requirement-based classification
9/14/2010 ICSM 2010 ERA 4

SICS
Semantic-rich Internet-scale
Code Search
•Supports all query types
•Handles a tera-scale repository
5ICSM 2010 ERA

•Incomplete code (no binaries)
•Repository evolution
–The crawler is working 24/7
–Dependent code might be
indexed in any order
•Very large repository
(tera-scale)
Challenges
9/14/2010 7ICSM 2010 ERA

•Creates small ontology for each code part
•Code facts
•Static code analysis rules
•Saves them in the RDF repository
•Uses backward chaining reasoner to answer
•Not only structural query
•But also all the other query types
(embedded code analysis at runtime)
SE-CodeSearch
9/14/2010 8ICSM 2010 ERA

SICSONT
• Source Code Ontology for Internet-scale Static
Analysis
http://aseg.cs.concordia.ca/ontology#sicsont
9/14/2010 9ICSM 2010 ERA

Semantic Web-based
Static Code Analysis
• Knowledge-based approach
• Inference engine does the analysis
• Restricted to OWL-DL
– De facto standard for knowledge sharing
– Based on Description Logic
• Decidable
• More restricted than rule-based families
9/14/2010 10ICSM 2010 ERA

Semantic Web-based
Static Code Analysis (Cont.)
• No compiler
• Possible analysis
– Inheritance tree computation
– Fully qualified name resolution
– Method call/return statement and type resolution
• Translation template for each analysis rule
9/14/2010 11ICSM 2010 ERA

Queries:
1. Transitivity closure-based
2. Method call
Dataset:
600,000 Java classes (no binaries)
from a very large dataset (~400 GB)
http://www.ics.uci.edu/~lopes/datasets.
Scalability Test
Hardware:
• 3 GB RAM
• 3.40 GHz CPU
9/14/2010 12ICSM 2010 ERA

SE-CodeSearch Highlights
•Avoid expensive knowledge
modeling
•Optimized ontology population
•Backward-chaining reasoner
•Disk-based computation
–Works on minimum hardware
9/14/2010 13ICSM 2010 ERA

SE-CodeSearch Highlights
(Cont.)
•Parallelization
–One pass code analysis
–Static code analysis on
•Complete code
•Partial Code
–Independent of parsing order
•First Package A then Package B
•First Package B then Package A
–Repository evolves incrementally
•Open World Reasoning (Not available in Relational DB)
9/14/2010 14ICSM 2010 ERA

?
• SE-CodeSearch homepage:
http://aseg.cs.concordia.ca/codesearch
• Source Code Ontology homepage:
http://aseg.cs.concordia.ca/ontology
• ASEG Lab. homepage:
http://aseg.cs.concordia.ca
• Any question:
keivanloo@ieee.org
16ICSM 2010 ERA

Similar to Scalable Semantic Web-based Source Code Search Infrastructure

The Ontario library research cloudUniversity of Toronto Libraries - Information Technology Services

SoDA v2 - Named Entity Recognition from streaming textSujit Pal

WOTS2E: A Search Engine for a Semantic Web of ThingsAndreas Kamilaris

Network researchJisc

Legislation.gov.ukJeni Tennison

Flexible search in Apache Jackrabbit OakTommaso Teofili

Nosql query processing system for wireless sensor networksNikhil Bhaware

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)

SemsorGrid4Env (Newsfromthefront 2010)STI International

Exploring MongoDB & Elasticsearch: Better TogetherObjectRocket

Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita

Internet-scale Real-time Code Clone Search via Multi-level Indexingimanmahsa

Bullet: A Real Time Data Query EngineDataWorks Summit

Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...João Rocha da Silva

Building Your Own DSL with XtextGlobalLogic Ukraine

ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool

Floor plan & Power Plan Prathyusha Madapalli

Building genomic data cyberinfrastructure with the online database software T...mestato

Как разработать DBFW с нуляPositive Hack Days

LinkedIn's Approach to Programmable Data CenterShawn Zandi

Similar to Scalable Semantic Web-based Source Code Search Infrastructure (20)

The Ontario library research cloud

SoDA v2 - Named Entity Recognition from streaming text

WOTS2E: A Search Engine for a Semantic Web of Things

Network research

Legislation.gov.uk

Flexible search in Apache Jackrabbit Oak

Nosql query processing system for wireless sensor networks

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

SemsorGrid4Env (Newsfromthefront 2010)

Exploring MongoDB & Elasticsearch: Better Together

Integrating a Domain Ontology Development Environment and an Ontology Search ...

Internet-scale Real-time Code Clone Search via Multi-level Indexing

Bullet: A Real Time Data Query Engine

Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...

Building Your Own DSL with Xtext

ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...

Floor plan & Power Plan

Building genomic data cyberinfrastructure with the online database software T...

Как разработать DBFW с нуля

LinkedIn's Approach to Programmable Data Center

Recently uploaded

Corporate and higher education May webinar.pptxRustici Software

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

MINDCTI Revenue Release Quarter One 2024MIND CTI

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

ICT role in 21st century education and its challengesrafiqahmad00786416

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Exploring Multimodal Embeddings with MilvusZilliz

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Why Teams call analytics are critical to your entire businesspanagenda

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Recently uploaded (20)

Corporate and higher education May webinar.pptx

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

How to Troubleshoot Apps for the Modern Connected Worker

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

MINDCTI Revenue Release Quarter One 2024

Artificial Intelligence Chap.5 : Uncertainty

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

ICT role in 21st century education and its challenges

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Understanding the FAA Part 107 License ..

[BuildWithAI] Introduction to Gemini.pdf

Exploring Multimodal Embeddings with Milvus

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

FWD Group - Insurer Innovation Award 2024

Why Teams call analytics are critical to your entire business

CNIC Information System with Pakdata Cf In Pakistan

Scalable Semantic Web-based Source Code Search Infrastructure

1. I. Keivanloo, L. Roostapour, P. Schugerl, J. Rilling Scalable Semantic Web-based Source Code Search Infrastructure SE-CodeSearch

2. Search Who lives in London? Who has relatives in London! 9/14/2010 2ICSM 2010 ERA

3. Source code search  Where is it defined? Where is it called! 9/14/2010 3ICSM 2010 ERA

4. Query types • Pure structural (PSQ) • Metadata (MDQ) • Transitive closure-based (TCQ) • Method call (MCQ) • Absent information (AIQ) • Mixed queries (MXQ) Requirement-based classification 9/14/2010 ICSM 2010 ERA 4

5. SICS Semantic-rich Internet-scale Code Search •Supports all query types •Handles a tera-scale repository 5ICSM 2010 ERA

6. Is there any SICS? •NO ICSM 2010 ERA 6

7. •Incomplete code (no binaries) •Repository evolution –The crawler is working 24/7 –Dependent code might be indexed in any order •Very large repository (tera-scale) Challenges 9/14/2010 7ICSM 2010 ERA

8. •Creates small ontology for each code part •Code facts •Static code analysis rules •Saves them in the RDF repository •Uses backward chaining reasoner to answer •Not only structural query •But also all the other query types (embedded code analysis at runtime) SE-CodeSearch 9/14/2010 8ICSM 2010 ERA

9. SICSONT • Source Code Ontology for Internet-scale Static Analysis http://aseg.cs.concordia.ca/ontology#sicsont 9/14/2010 9ICSM 2010 ERA

10. Semantic Web-based Static Code Analysis • Knowledge-based approach • Inference engine does the analysis • Restricted to OWL-DL – De facto standard for knowledge sharing – Based on Description Logic • Decidable • More restricted than rule-based families 9/14/2010 10ICSM 2010 ERA

11. Semantic Web-based Static Code Analysis (Cont.) • No compiler • Possible analysis – Inheritance tree computation – Fully qualified name resolution – Method call/return statement and type resolution • Translation template for each analysis rule 9/14/2010 11ICSM 2010 ERA

12. Queries: 1. Transitivity closure-based 2. Method call Dataset: 600,000 Java classes (no binaries) from a very large dataset (~400 GB) http://www.ics.uci.edu/~lopes/datasets. Scalability Test Hardware: • 3 GB RAM • 3.40 GHz CPU 9/14/2010 12ICSM 2010 ERA

13. SE-CodeSearch Highlights •Avoid expensive knowledge modeling •Optimized ontology population •Backward-chaining reasoner •Disk-based computation –Works on minimum hardware 9/14/2010 13ICSM 2010 ERA

14. SE-CodeSearch Highlights (Cont.) •Parallelization –One pass code analysis –Static code analysis on •Complete code •Partial Code –Independent of parsing order •First Package A then Package B •First Package B then Package A –Repository evolves incrementally •Open World Reasoning (Not available in Relational DB) 9/14/2010 14ICSM 2010 ERA

15. The poster 9/14/2010 ICSM 2010 ERA 15

16. ? • SE-CodeSearch homepage: http://aseg.cs.concordia.ca/codesearch • Source Code Ontology homepage: http://aseg.cs.concordia.ca/ontology • ASEG Lab. homepage: http://aseg.cs.concordia.ca • Any question: keivanloo@ieee.org 16ICSM 2010 ERA

Editor's Notes

Intorduction: SE-CodeSearch is a Internet-scale Semantic Rich Code Search (SICS). It uses Semantic Web knowledge representation and reasoning capabilities to apply static code analysis on incomplete code. The reasoner infers semantic-rich facts as the crawler indexes new source code step by step. Paper Title: SE-CodeSearch: A Scalable Semantic Web-based Source Code Search Infrastructure Contact: SE-CodeSearch homepage: http://aseg.cs.concordia.ca/codesearch Source Code Ontology homepage: http://aseg.cs.concordia.ca/ontology ASEG Lab. homepage: http://aseg.cs.concordia.ca Authors: I. Keivanloo, L. Roostapour, P. Schugerl, J. Rilling For questions contact: keivanloo@ieee.org
General discussion: In our daily life, we always search in different domains. But our search engines are limited because of the data size and the domain complexity. For example: you may use geographical search engine to find out “who lives in London” which is a simple query. But what about “who has relatives in London?” It is rather complex query specifically when it must be answered based on an Internet-scale data. A similar problem exists in Internet-scale Source Code Search. Current engines can answer simple queries easily but they are not able to handle those queries that require source code semantics to be considered appropriately.
Similar to the previous slide, we have the same restriction during source code search. For example, you may search for “the file that contains the class definition” (shown in line 4 and 5). This type of query is quite easy to answer by performing a keyword search. Finding a method call statement by specifying the receiver type, however is less straight forward. This is shown in example at line 22. Search and return the receiver of methodB() call? To find out the answer the content of MethodA() must be analyzed. Furthermore, it could belong to another library, project or package. In the worst case, the content of methodA() might not have been indexed yet and only will be indexed after the current code segment has been indexed and analyzed. This may be caused by the fact that the source for the data is the Internet and data becomes only available by crawling different Internet resources. Therefore, an Incremental Static Source Code Analysis technique is required that can handle Internet-scale data and incomplete data. Current search engines cannot answer questions like the previous one. While it is not easy to answer such interconnected “semantic” questions in general, providing such services seems possible since: The amount of data is less And the data is structured well The rules are less and simpler Important Note About Internet-scale Source Code Search Data Charactrestics: It is a Tera-Scale (terabyte). The data could be incomplete similar to above sample where the content of MethodA() is not available at the first place. The data repository evolves similar to above sample where the content of MethodA() would be indexed by crawler later.
All possible types of code search queries gathered from literature are classified here. The classification has been done based on the analysis type a search engine has to support for each category. PSQ covers those queries related to code structure (aka Structural Query)such as a class definition statement etc. MDQ are those asking about metadata such as code language or application domain. TCQ covers those require transitive closure computation such as object oriented inheritance trees. MCQ is about method call statements which is one the most challenging queries such as the one given in previous slide. AIQ requires when we deal with incomplete repositories to avoid invalid results. It requires Open world assumption to be considered. MXQ which emphasizes on the fact that a real query might be a mixture of earlier query types. So the code search engine must let the user search by such queries.
In order to establish research objective, we defined SICS as a ‘Semantic-rich Internet-scale Code Search’ , which must support all query types introduced earlier and at the same to support search over tera-scale data . That is, the engine must be able to consider the source code semantic while it is extracting code facts during the static code analysis phase.
We did an evaluation on available code engines on the Internet to find out whether there are any qualifying SICS. The result shows us there are two classes (neither of can be considered as a SICS), The first class, provides us some of the fine-grained search but they are limited to compileable code (that is all the code segments must be available at the “fact extraction phase”.) Remember the MethodA() and MethodB() example discussed earlier! These engines can not extract facts if the content of MethodA() was not available even if it becomes available later. The other class supports both complete and incomplete code but they just support coarse-grained queries. So we find out a gap here to be bridged, which is our research motivation: We want to have a code search engine that provides fine-grained queries but also not limited to compliable code.
Nevertheless, there are some major challenges for SICS implementation First of all, some code might be indexed without the imported binaries (incomplete code). Second, the code repository evolves 24/7. It means that some of the required binaries or code might be indexed later The last challenge is the data size. That is we can not rely on in-memory (RAM) based traditional code analysis anymore.
Considering the discussed requirements (query types) and challenges, we designed SE-CodeSearch which is a knowledge-based code search infrastructure. The overall architecture is shown on the right hand side. All in all, it creates a small ontology for each code part which includes two main sections. The ontology contains some facts and also some static code analysis rules for further fact extraction These ontologies step by step will be saved into the RDF repository At run-time, a backward chaining reasoner will apply the code analysis rules related to the given query to find out the answer.
The main part of SE-CodeSearch is its source code ontology which we call it SICSONT (SICS Ontology). SICSONT is available at http://aseg.cs.concordia.ca/ontology Comparing to other code ontologies, SICSONT is able to represent not only code facts but also static code analysis rules formally. In addition, SICSONT is optimized for Internet-scale reasoning (We did not use most of expensive OWL constraints to make it usable for real-time reasoning) Further details regarding the ontology population are given in the paper. The ontology is publicly available on our website.
SE-Codesearch uses a Semantic-Web-based Static code analysis approach which means: First of all it is a knowledge-based approach so it has its own advantages and challenges. Second, the inference engine will be responsible for static code analysis task. Third, we are using OWL-DL as the language. OWL is the defacto standard for knowledge representation so our repository can be shared on the Internet Second it is decidable since it is based on Description Logic so the answer will be ready at a proper time However representation of code analysis rules are not as easy as using a rule-based approach which is one our challenges. Representation of some code analysis rules using DL is much harder than using a SWRL or Datalog.
In addition, we do not need a compiler to extract facts which lets us handle incomplete code (comparing to current code search engines) As a major part of SE-CodeSearch we created templates for automatic translation of static code analysis rules into OWL-DL. The current version of SE-CodeSearch supports three types of analysis which are : 1-Inheritance tree computation 2-Fully qualified name resolution and 3-Method call/ return statement connection A sample of such template for method call graph construction is shown here
Since one of the biggest challenges for knowledge-based applications is scalability we did some scalability test on the SE-CodeSearch implementation which supports Java language currently. We have used a regular desktop computer with 3 gigabyte RAM as the server Our data set is based on a very large Java code repository extracted from Sourceforge which is about 400 gigabytes. We selected about 600000 classes and also removed binaries to simulate the incomplete data. We selected two major types of queries which are TCQ and MCQ. We observed the response time of them as the repository grew. The graph shown here represents the test result which shows us that the response time was not affected by the repository size (which is a very positive observation for SE-CodeSearch).
In the following some of the SE-CodeSearch highlights will be discussed Although OWL-DL and Semantic Web languages have lots of fancy properties, we have avoided them to remain scalable. The ontology population is optimized to use less hardware resources as usual A Disk-based inference engine is used which means it just needs little memory The inference engine is backward-chaining. That is we do not extract all the possible facts from the repository. The amount of all possible facts could be very huge. For example, in Java all classes are subclass of the Object class! This means that a forward-chaining reasoning (static code analysis) must infer a fact for each class definition that says the observed class is a sub-class of Object class!!! Which doubles the total number of facts in the repository!!!! But the backward-chaining reasoner does infer any fact beforehand. It waits for the query and then apply the rules only on a subset of data which is related to the query and its input criteria.
The knowledge-based approach also helps us to increase the ability to parallelize the processing and analysis, since: 1-We just analyze each code once 2-We support incomplete code without binaries 3-We can apply static analysis independent of the parsing order 4-Our repository can evolve incrementally Also note that our model supports the “open world assumption” by default, which is unavailable for most relational databases
SE-CodeSearch poster. Available at: http://aseg.cs.concordia.ca/codesearch
SE-CodeSearch homepage: http://aseg.cs.concordia.ca/codesearch Source Code Ontology homepage: http://aseg.cs.concordia.ca/ontology ASEG Lab. homepage: http://aseg.cs.concordia.ca Any question: keivanloo@ieee.org

Scalable Semantic Web-based Source Code Search Infrastructure

Recommended

Recommended

More Related Content

Similar to Scalable Semantic Web-based Source Code Search Infrastructure

Similar to Scalable Semantic Web-based Source Code Search Infrastructure (20)

More from ICSM 2010

More from ICSM 2010 (15)

Recently uploaded

Recently uploaded (20)

Scalable Semantic Web-based Source Code Search Infrastructure

Editor's Notes