SlideShare a Scribd company logo
1 of 16
I. Keivanloo, L. Roostapour, P. Schugerl, J. Rilling
Scalable Semantic Web-based Source Code Search Infrastructure
SE-CodeSearch
Search
Who lives in London?
Who has relatives in London!
9/14/2010 2ICSM 2010 ERA
Source code search
 Where is it defined? Where is it called!
9/14/2010 3ICSM 2010 ERA
Query types • Pure structural (PSQ)
• Metadata (MDQ)
• Transitive closure-based (TCQ)
• Method call (MCQ)
• Absent information (AIQ)
• Mixed queries (MXQ)
Requirement-based classification
9/14/2010 ICSM 2010 ERA 4
SICS
Semantic-rich Internet-scale
Code Search
•Supports all query types
•Handles a tera-scale repository
5ICSM 2010 ERA
Is there any SICS?
•NO
ICSM 2010 ERA 6
•Incomplete code (no binaries)
•Repository evolution
–The crawler is working 24/7
–Dependent code might be
indexed in any order
•Very large repository
(tera-scale)
Challenges
9/14/2010 7ICSM 2010 ERA
•Creates small ontology for each code part
•Code facts
•Static code analysis rules
•Saves them in the RDF repository
•Uses backward chaining reasoner to answer
•Not only structural query
•But also all the other query types
(embedded code analysis at runtime)
SE-CodeSearch
9/14/2010 8ICSM 2010 ERA
SICSONT
• Source Code Ontology for Internet-scale Static
Analysis
http://aseg.cs.concordia.ca/ontology#sicsont
9/14/2010 9ICSM 2010 ERA
Semantic Web-based
Static Code Analysis
• Knowledge-based approach
• Inference engine does the analysis
• Restricted to OWL-DL
– De facto standard for knowledge sharing
– Based on Description Logic
• Decidable
• More restricted than rule-based families
9/14/2010 10ICSM 2010 ERA
Semantic Web-based
Static Code Analysis (Cont.)
• No compiler
• Possible analysis
– Inheritance tree computation
– Fully qualified name resolution
– Method call/return statement and type resolution
• Translation template for each analysis rule
9/14/2010 11ICSM 2010 ERA
Queries:
1. Transitivity closure-based
2. Method call
Dataset:
600,000 Java classes (no binaries)
from a very large dataset (~400 GB)
http://www.ics.uci.edu/~lopes/datasets.
Scalability Test
Hardware:
• 3 GB RAM
• 3.40 GHz CPU
9/14/2010 12ICSM 2010 ERA
SE-CodeSearch Highlights
•Avoid expensive knowledge
modeling
•Optimized ontology population
•Backward-chaining reasoner
•Disk-based computation
–Works on minimum hardware
9/14/2010 13ICSM 2010 ERA
SE-CodeSearch Highlights
(Cont.)
•Parallelization
–One pass code analysis
–Static code analysis on
•Complete code
•Partial Code
–Independent of parsing order
•First Package A then Package B
•First Package B then Package A
–Repository evolves incrementally
•Open World Reasoning (Not available in Relational DB)
9/14/2010 14ICSM 2010 ERA
The poster
9/14/2010 ICSM 2010 ERA 15
?
• SE-CodeSearch homepage:
http://aseg.cs.concordia.ca/codesearch
• Source Code Ontology homepage:
http://aseg.cs.concordia.ca/ontology
• ASEG Lab. homepage:
http://aseg.cs.concordia.ca
• Any question:
keivanloo@ieee.org
16ICSM 2010 ERA

More Related Content

Similar to Scalable Semantic Web-based Source Code Search Infrastructure

SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
WOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsWOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsAndreas Kamilaris
 
Network research
Network researchNetwork research
Network researchJisc
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Nosql query processing system for wireless sensor networks
Nosql query processing system for wireless sensor networksNosql query processing system for wireless sensor networks
Nosql query processing system for wireless sensor networksNikhil Bhaware
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
SemsorGrid4Env (Newsfromthefront 2010)
SemsorGrid4Env (Newsfromthefront 2010)SemsorGrid4Env (Newsfromthefront 2010)
SemsorGrid4Env (Newsfromthefront 2010)STI International
 
Exploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better TogetherExploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better TogetherObjectRocket
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita
 
Internet-scale Real-time Code Clone Search via Multi-level Indexing
Internet-scale Real-time Code Clone Search via Multi-level IndexingInternet-scale Real-time Code Clone Search via Multi-level Indexing
Internet-scale Real-time Code Clone Search via Multi-level Indexingimanmahsa
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineDataWorks Summit
 
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...João Rocha da Silva
 
Building Your Own DSL with Xtext
Building Your Own DSL with XtextBuilding Your Own DSL with Xtext
Building Your Own DSL with XtextGlobalLogic Ukraine
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...mestato
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуляPositive Hack Days
 
LinkedIn's Approach to Programmable Data Center
LinkedIn's Approach to Programmable Data CenterLinkedIn's Approach to Programmable Data Center
LinkedIn's Approach to Programmable Data CenterShawn Zandi
 

Similar to Scalable Semantic Web-based Source Code Search Infrastructure (20)

The Ontario library research cloud
The Ontario library research cloudThe Ontario library research cloud
The Ontario library research cloud
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
WOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsWOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of Things
 
Network research
Network researchNetwork research
Network research
 
Legislation.gov.uk
Legislation.gov.ukLegislation.gov.uk
Legislation.gov.uk
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Nosql query processing system for wireless sensor networks
Nosql query processing system for wireless sensor networksNosql query processing system for wireless sensor networks
Nosql query processing system for wireless sensor networks
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
SemsorGrid4Env (Newsfromthefront 2010)
SemsorGrid4Env (Newsfromthefront 2010)SemsorGrid4Env (Newsfromthefront 2010)
SemsorGrid4Env (Newsfromthefront 2010)
 
Exploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better TogetherExploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better Together
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
 
Internet-scale Real-time Code Clone Search via Multi-level Indexing
Internet-scale Real-time Code Clone Search via Multi-level IndexingInternet-scale Real-time Code Clone Search via Multi-level Indexing
Internet-scale Real-time Code Clone Search via Multi-level Indexing
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
 
Building Your Own DSL with Xtext
Building Your Own DSL with XtextBuilding Your Own DSL with Xtext
Building Your Own DSL with Xtext
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
Floor plan & Power Plan
Floor plan & Power Plan Floor plan & Power Plan
Floor plan & Power Plan
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуля
 
LinkedIn's Approach to Programmable Data Center
LinkedIn's Approach to Programmable Data CenterLinkedIn's Approach to Programmable Data Center
LinkedIn's Approach to Programmable Data Center
 

More from ICSM 2010

A tree kernel based approach for clone detection
A tree kernel based approach for clone detectionA tree kernel based approach for clone detection
A tree kernel based approach for clone detectionICSM 2010
 
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...ICSM 2010
 
Wiki dev nlp
Wiki dev nlpWiki dev nlp
Wiki dev nlpICSM 2010
 
iFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature ImplementationsiFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature ImplementationsICSM 2010
 
Using Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent SoftwareUsing Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent SoftwareICSM 2010
 
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...ICSM 2010
 
Automatically Repairing Test Cases for Evolving Method Declarations
Automatically Repairing Test Cases for Evolving Method DeclarationsAutomatically Repairing Test Cases for Evolving Method Declarations
Automatically Repairing Test Cases for Evolving Method DeclarationsICSM 2010
 
Automated Identification of Cross-browser Issues in Web Applications
Automated Identification of Cross-browser Issues in Web ApplicationsAutomated Identification of Cross-browser Issues in Web Applications
Automated Identification of Cross-browser Issues in Web ApplicationsICSM 2010
 
Reverse Engineering Object-Oriented Distributed Systems
Reverse Engineering Object-Oriented Distributed SystemsReverse Engineering Object-Oriented Distributed Systems
Reverse Engineering Object-Oriented Distributed SystemsICSM 2010
 
Software asset management
Software asset managementSoftware asset management
Software asset managementICSM 2010
 
Successfulresearch 100915022614-phpapp01
Successfulresearch 100915022614-phpapp01Successfulresearch 100915022614-phpapp01
Successfulresearch 100915022614-phpapp01ICSM 2010
 
Enabling multi tenancy(An Industrial Experience Report)
Enabling multi tenancy(An Industrial Experience Report)Enabling multi tenancy(An Industrial Experience Report)
Enabling multi tenancy(An Industrial Experience Report)ICSM 2010
 
Ponsini automatic slides
Ponsini automatic slidesPonsini automatic slides
Ponsini automatic slidesICSM 2010
 
Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality	Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality ICSM 2010
 
Icsm2010 Announcement
Icsm2010 AnnouncementIcsm2010 Announcement
Icsm2010 AnnouncementICSM 2010
 

More from ICSM 2010 (15)

A tree kernel based approach for clone detection
A tree kernel based approach for clone detectionA tree kernel based approach for clone detection
A tree kernel based approach for clone detection
 
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
 
Wiki dev nlp
Wiki dev nlpWiki dev nlp
Wiki dev nlp
 
iFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature ImplementationsiFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature Implementations
 
Using Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent SoftwareUsing Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent Software
 
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
 
Automatically Repairing Test Cases for Evolving Method Declarations
Automatically Repairing Test Cases for Evolving Method DeclarationsAutomatically Repairing Test Cases for Evolving Method Declarations
Automatically Repairing Test Cases for Evolving Method Declarations
 
Automated Identification of Cross-browser Issues in Web Applications
Automated Identification of Cross-browser Issues in Web ApplicationsAutomated Identification of Cross-browser Issues in Web Applications
Automated Identification of Cross-browser Issues in Web Applications
 
Reverse Engineering Object-Oriented Distributed Systems
Reverse Engineering Object-Oriented Distributed SystemsReverse Engineering Object-Oriented Distributed Systems
Reverse Engineering Object-Oriented Distributed Systems
 
Software asset management
Software asset managementSoftware asset management
Software asset management
 
Successfulresearch 100915022614-phpapp01
Successfulresearch 100915022614-phpapp01Successfulresearch 100915022614-phpapp01
Successfulresearch 100915022614-phpapp01
 
Enabling multi tenancy(An Industrial Experience Report)
Enabling multi tenancy(An Industrial Experience Report)Enabling multi tenancy(An Industrial Experience Report)
Enabling multi tenancy(An Industrial Experience Report)
 
Ponsini automatic slides
Ponsini automatic slidesPonsini automatic slides
Ponsini automatic slides
 
Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality	Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality
 
Icsm2010 Announcement
Icsm2010 AnnouncementIcsm2010 Announcement
Icsm2010 Announcement
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Scalable Semantic Web-based Source Code Search Infrastructure

  • 1. I. Keivanloo, L. Roostapour, P. Schugerl, J. Rilling Scalable Semantic Web-based Source Code Search Infrastructure SE-CodeSearch
  • 2. Search Who lives in London? Who has relatives in London! 9/14/2010 2ICSM 2010 ERA
  • 3. Source code search  Where is it defined? Where is it called! 9/14/2010 3ICSM 2010 ERA
  • 4. Query types • Pure structural (PSQ) • Metadata (MDQ) • Transitive closure-based (TCQ) • Method call (MCQ) • Absent information (AIQ) • Mixed queries (MXQ) Requirement-based classification 9/14/2010 ICSM 2010 ERA 4
  • 5. SICS Semantic-rich Internet-scale Code Search •Supports all query types •Handles a tera-scale repository 5ICSM 2010 ERA
  • 6. Is there any SICS? •NO ICSM 2010 ERA 6
  • 7. •Incomplete code (no binaries) •Repository evolution –The crawler is working 24/7 –Dependent code might be indexed in any order •Very large repository (tera-scale) Challenges 9/14/2010 7ICSM 2010 ERA
  • 8. •Creates small ontology for each code part •Code facts •Static code analysis rules •Saves them in the RDF repository •Uses backward chaining reasoner to answer •Not only structural query •But also all the other query types (embedded code analysis at runtime) SE-CodeSearch 9/14/2010 8ICSM 2010 ERA
  • 9. SICSONT • Source Code Ontology for Internet-scale Static Analysis http://aseg.cs.concordia.ca/ontology#sicsont 9/14/2010 9ICSM 2010 ERA
  • 10. Semantic Web-based Static Code Analysis • Knowledge-based approach • Inference engine does the analysis • Restricted to OWL-DL – De facto standard for knowledge sharing – Based on Description Logic • Decidable • More restricted than rule-based families 9/14/2010 10ICSM 2010 ERA
  • 11. Semantic Web-based Static Code Analysis (Cont.) • No compiler • Possible analysis – Inheritance tree computation – Fully qualified name resolution – Method call/return statement and type resolution • Translation template for each analysis rule 9/14/2010 11ICSM 2010 ERA
  • 12. Queries: 1. Transitivity closure-based 2. Method call Dataset: 600,000 Java classes (no binaries) from a very large dataset (~400 GB) http://www.ics.uci.edu/~lopes/datasets. Scalability Test Hardware: • 3 GB RAM • 3.40 GHz CPU 9/14/2010 12ICSM 2010 ERA
  • 13. SE-CodeSearch Highlights •Avoid expensive knowledge modeling •Optimized ontology population •Backward-chaining reasoner •Disk-based computation –Works on minimum hardware 9/14/2010 13ICSM 2010 ERA
  • 14. SE-CodeSearch Highlights (Cont.) •Parallelization –One pass code analysis –Static code analysis on •Complete code •Partial Code –Independent of parsing order •First Package A then Package B •First Package B then Package A –Repository evolves incrementally •Open World Reasoning (Not available in Relational DB) 9/14/2010 14ICSM 2010 ERA
  • 16. ? • SE-CodeSearch homepage: http://aseg.cs.concordia.ca/codesearch • Source Code Ontology homepage: http://aseg.cs.concordia.ca/ontology • ASEG Lab. homepage: http://aseg.cs.concordia.ca • Any question: keivanloo@ieee.org 16ICSM 2010 ERA

Editor's Notes

  1. Intorduction: SE-CodeSearch is a Internet-scale Semantic Rich Code Search (SICS). It uses Semantic Web knowledge representation and reasoning capabilities to apply static code analysis on incomplete code. The reasoner infers semantic-rich facts as the crawler indexes new source code step by step. Paper Title: SE-CodeSearch: A Scalable Semantic Web-based Source Code Search Infrastructure Contact: SE-CodeSearch homepage: http://aseg.cs.concordia.ca/codesearch Source Code Ontology homepage: http://aseg.cs.concordia.ca/ontology ASEG Lab. homepage: http://aseg.cs.concordia.ca Authors: I. Keivanloo, L. Roostapour, P. Schugerl, J. Rilling For questions contact: keivanloo@ieee.org
  2. General discussion: In our daily life, we always search in different domains. But our search engines are limited because of the data size and the domain complexity. For example: you may use geographical search engine to find out “who lives in London” which is a simple query. But what about “who has relatives in London?” It is rather complex query specifically when it must be answered based on an Internet-scale data. A similar problem exists in Internet-scale Source Code Search. Current engines can answer simple queries easily but they are not able to handle those queries that require source code semantics to be considered appropriately.
  3. Similar to the previous slide, we have the same restriction during source code search. For example, you may search for “the file that contains the class definition” (shown in line 4 and 5). This type of query is quite easy to answer by performing a keyword search. Finding a method call statement by specifying the receiver type, however is less straight forward. This is shown in example at line 22. Search and return the receiver of methodB() call? To find out the answer the content of MethodA() must be analyzed. Furthermore, it could belong to another library, project or package. In the worst case, the content of methodA() might not have been indexed yet and only will be indexed after the current code segment has been indexed and analyzed. This may be caused by the fact that the source for the data is the Internet and data becomes only available by crawling different Internet resources. Therefore, an Incremental Static Source Code Analysis technique is required that can handle Internet-scale data and incomplete data. Current search engines cannot answer questions like the previous one. While it is not easy to answer such interconnected “semantic” questions in general, providing such services seems possible since: The amount of data is less And the data is structured well The rules are less and simpler Important Note About Internet-scale Source Code Search Data Charactrestics: It is a Tera-Scale (terabyte). The data could be incomplete similar to above sample where the content of MethodA() is not available at the first place. The data repository evolves similar to above sample where the content of MethodA() would be indexed by crawler later.
  4. All possible types of code search queries gathered from literature are classified here. The classification has been done based on the analysis type a search engine has to support for each category. PSQ covers those queries related to code structure (aka Structural Query)such as a class definition statement etc. MDQ are those asking about metadata such as code language or application domain. TCQ covers those require transitive closure computation such as object oriented inheritance trees. MCQ is about method call statements which is one the most challenging queries such as the one given in previous slide. AIQ requires when we deal with incomplete repositories to avoid invalid results. It requires Open world assumption to be considered. MXQ which emphasizes on the fact that a real query might be a mixture of earlier query types. So the code search engine must let the user search by such queries.
  5. In order to establish research objective, we defined SICS as a ‘Semantic-rich Internet-scale Code Search’ , which must support all query types introduced earlier and at the same to support search over tera-scale data . That is, the engine must be able to consider the source code semantic while it is extracting code facts during the static code analysis phase.
  6. We did an evaluation on available code engines on the Internet to find out whether there are any qualifying SICS. The result shows us there are two classes (neither of can be considered as a SICS), The first class, provides us some of the fine-grained search but they are limited to compileable code (that is all the code segments must be available at the “fact extraction phase”.) Remember the MethodA() and MethodB() example discussed earlier! These engines can not extract facts if the content of MethodA() was not available even if it becomes available later. The other class supports both complete and incomplete code but they just support coarse-grained queries. So we find out a gap here to be bridged, which is our research motivation: We want to have a code search engine that provides fine-grained queries but also not limited to compliable code.
  7. Nevertheless, there are some major challenges for SICS implementation First of all, some code might be indexed without the imported binaries (incomplete code). Second, the code repository evolves 24/7. It means that some of the required binaries or code might be indexed later The last challenge is the data size. That is we can not rely on in-memory (RAM) based traditional code analysis anymore.
  8. Considering the discussed requirements (query types) and challenges, we designed SE-CodeSearch which is a knowledge-based code search infrastructure. The overall architecture is shown on the right hand side. All in all, it creates a small ontology for each code part which includes two main sections. The ontology contains some facts and also some static code analysis rules for further fact extraction These ontologies step by step will be saved into the RDF repository At run-time, a backward chaining reasoner will apply the code analysis rules related to the given query to find out the answer.
  9. The main part of SE-CodeSearch is its source code ontology which we call it SICSONT (SICS Ontology). SICSONT is available at http://aseg.cs.concordia.ca/ontology Comparing to other code ontologies, SICSONT is able to represent not only code facts but also static code analysis rules formally. In addition, SICSONT is optimized for Internet-scale reasoning (We did not use most of expensive OWL constraints to make it usable for real-time reasoning) Further details regarding the ontology population are given in the paper. The ontology is publicly available on our website.
  10. SE-Codesearch uses a Semantic-Web-based Static code analysis approach which means: First of all it is a knowledge-based approach so it has its own advantages and challenges. Second, the inference engine will be responsible for static code analysis task. Third, we are using OWL-DL as the language. OWL is the defacto standard for knowledge representation so our repository can be shared on the Internet Second it is decidable since it is based on Description Logic so the answer will be ready at a proper time However representation of code analysis rules are not as easy as using a rule-based approach which is one our challenges. Representation of some code analysis rules using DL is much harder than using a SWRL or Datalog.
  11. In addition, we do not need a compiler to extract facts which lets us handle incomplete code (comparing to current code search engines) As a major part of SE-CodeSearch we created templates for automatic translation of static code analysis rules into OWL-DL. The current version of SE-CodeSearch supports three types of analysis which are : 1-Inheritance tree computation 2-Fully qualified name resolution and 3-Method call/ return statement connection A sample of such template for method call graph construction is shown here
  12. Since one of the biggest challenges for knowledge-based applications is scalability we did some scalability test on the SE-CodeSearch implementation which supports Java language currently. We have used a regular desktop computer with 3 gigabyte RAM as the server Our data set is based on a very large Java code repository extracted from Sourceforge which is about 400 gigabytes. We selected about 600000 classes and also removed binaries to simulate the incomplete data.   We selected two major types of queries which are TCQ and MCQ. We observed the response time of them as the repository grew. The graph shown here represents the test result which shows us that the response time was not affected by the repository size (which is a very positive observation for SE-CodeSearch).
  13. In the following some of the SE-CodeSearch highlights will be discussed Although OWL-DL and Semantic Web languages have lots of fancy properties, we have avoided them to remain scalable. The ontology population is optimized to use less hardware resources as usual A Disk-based inference engine is used which means it just needs little memory The inference engine is backward-chaining. That is we do not extract all the possible facts from the repository. The amount of all possible facts could be very huge. For example, in Java all classes are subclass of the Object class! This means that a forward-chaining reasoning (static code analysis) must infer a fact for each class definition that says the observed class is a sub-class of Object class!!! Which doubles the total number of facts in the repository!!!! But the backward-chaining reasoner does infer any fact beforehand. It waits for the query and then apply the rules only on a subset of data which is related to the query and its input criteria.
  14. The knowledge-based approach also helps us to increase the ability to parallelize the processing and analysis, since: 1-We just analyze each code once 2-We support incomplete code without binaries 3-We can apply static analysis independent of the parsing order 4-Our repository can evolve incrementally Also note that our model supports the “open world assumption” by default, which is unavailable for most relational databases
  15. SE-CodeSearch poster. Available at: http://aseg.cs.concordia.ca/codesearch
  16. SE-CodeSearch homepage: http://aseg.cs.concordia.ca/codesearch Source Code Ontology homepage: http://aseg.cs.concordia.ca/ontology ASEG Lab. homepage: http://aseg.cs.concordia.ca Any question: keivanloo@ieee.org