SlideShare a Scribd company logo
A Linked Data Platform for
         Mining Software Repositories


  Iman Keivanloo
  Christopher Forbes
  Aseel Hmood
  Mostafa Erfani
  Christopher Neal
  George Peristerakis
  Juergen Rilling



MSR 2012 June 2
SeCold is a “Wikipedia of source code
 related facts” produced from over
 1,000,000 open source projects.


SeCold main objectives:
 (1) establish the fundamental framework
 (2) perform data analysis


SeCold 2.0 is an ongoing research project
 (currently in its second year)
             MSR 2012          2
Software Analysis Story




Issue Tracker
Source Code
Mailing List
Versioning Control                          Some output
…



                            Some analysis

                     MSR 2012                             3
Software Analysis Story

Issue Tracker
Source Code
Mailing List
Versioning Control                                                             Some output
…


                                                Structured
                     Extraction               Internal Data
                     Process                                Analysis Process
           Raw                               Representation                    Structured
           Data                                                                Output




                          [Source Code Analysis: A Roadmap, FOSE’07]

                                  MSR 2012                                                   4
Issue Tracker
Source Code
Mailing List
Versioning Control
…

                                                                  Sharing




                     [Source code analysis: a roadmap, FOSE’07]
                     [Fostering synergies: how … ICSE-SUITE’10]

                            MSR 2012                                        5
Integration




                                                       Alignment
                     Internal   Analysis      Output
                       Data     Process




                                                                   Inter-dataset Analysis
Issue Tracker        Internal   Analysis      Output
                       Data     Process
Source Code
Mailing List
Versioning Control
…                    Internal   Analysis      Output
                       Data     Process




                     Internal   Analysis      Output
                       Data     Process

                                   MSR 2012                                                 6
How to align?

               The Challenge
   Dataset A               Dataset B




                MSR 2012               7
History of Data Sharing




                          8
Linked Data is about being …


 Online a URL for each fact!
 Standard uses HTTP, XML, HTML and …
 Open usable for both human and machines
 NOT Static data and schema are editable
 Graph-based graph of triples vs. XML (tree)
 Integrating integrated/linked on the fly

                  MSR 2012                     9
A Linked Data Platform for
SeCold Project
                                Mining Software Repositories




1- Vocabulary Set
(aka Schema, Data Model, Ontology)


Source Code Ecosystem Ontology Family (SECON)
SOCON, VERON, METON, ISSUEON, LICENSON, CLON




                     MSR 2012                            10
A Linked Data Platform for
 SeCold Project
                                      Mining Software Repositories




2- URL/ID Generation Schema
A URL for each piece of fact (e.g. var. def. stmt)
http://aseg.cs.concordia.ca/secold/page/type/java/DatasetChangeInfo

Integration Challenge
Several ways to generate URLs (e.g. random )
REPRODUCIBLE IDENTIFIERS



                           MSR 2012                                   11
A Linked Data Platform for
 SeCold Project
                                  Mining Software Repositories



3- Baseline Data Publication
General Information (    ~2,000,000 triples)
Source Code         (~2,000,000,000 triples)
Issue Tracker       ( ~30,000,000 triples)
Version Control     ( ~700,000,000 triples)



                      ~1 MILLION PROJECTS

                       MSR 2012                            12
SeCold
LinkedData Cloud (LOD)

                                                                                                    SeCold:
                                                                                                    Among the 9 largest
                                         Media                                                      datasets in the cloud

                                                                      Publication
                                                                                                                      Triple
                                                                                                        Circle size
                                                                                                                      count
             Government                                                                                 Very large    >1B

                                                                                                        Large         1B-10M

                                                                                                        Medium        10M-500k

                                                                                                        Small         500k-10k

                                          Life Science                                                  Very small    <10k




   [Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/, as of Sept 2011]



                                      MSR 2012                                                                              13
secold.org




       14
Showcase #1 (Similar Code Search)




                MSR 2012            15
Showcase #2 –Part1 (Copyright violation detection)
                   Se Clone [SeClone … ICPC’11& WCRE’11]


                                                           Line level fingerprints
                                                           Clone (Type 1,2 and 3)
                 Internal    Analysis        Output
                   Data      Process


Source Code of
25K projects                                                                         Upload


                   Ninka [A sentence-matching …, ASE’10]

                                                           License per file

                 Internal   Analysis         Output
                   Data     Process




                                  MSR 2012                                                    16
Showcase #2 –Part2 (Copyright violation detection)
e … ICPC’11& WCRE’11]



                        Line level fingerprints
                        Clone (Type 1,2 and 3)
s        Output
                                                           Copyright violation detection:

                                                           select ?fileA ?fileB where {
                                                  Upload     ?fileA testxi ?fingerprint .
                                                             ?fileB testxi ? fingerprint .
                                                             ?fileA hasLicense ?la .
                                                             ?fileB hasLicense ?lb .
-matching …, ASE’10]
                                                             Filter (?la != ?lb) }

                        License per file
       Output




                                             MSR 2012                                   17
Showcase #3 (Statistical Analysis)
                            Apache 2, 9.70%


2009            GPL
               2, 12%
                                         LGPL 2.1, 8.80%

                                               BSD, 3%                                    PHP, 0.08%       Sleepycat, 0.06%
                                                               Mozilla PL 1.1, 0.13%
                                                Mozilla PL 1.0, 2.60%                                          Artistic, 0.02%
          All Rights
                                                   MIT, 0.92%                                                 Nokos, 0.01%
        Reserved, 13%
                                                                                                           Shareware, 0.00%
                                                         Apache 1, 0.65%
                          No License , 46%          Other, 0.00568
                                                                                                         Patented, 0%
                                                                                           BSD, 0.27%




2012                                Apache 2
                                      9%

                   All Rights                                                                          Mozilla PL 1.1
                                                                                            Nokos
                   Reserved                                                                                 0%
                                    LGPL 2.1                                                 0%
                      14%
                                      12%         BSD                                                         PHP
                                                  3%                           Apache 1                       0%
           GPL 2                                         Mozilla PL 1.0          0%                                  Sleepycat
            17%                                               1%                                                         0%
                                                        Other                                 MIT               Artistic
                                                         1%                                   0%                 0%
                                                                                                                 Shareware
                                No License                                                                          0%
                                   42%                                                                           Patented
                                                                                                                    0%
                                  MSR 2012                                                                                   18
MSR 2012   19

More Related Content

What's hot

MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
Jeremy Taylor
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
Jeremy Taylor
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
Michal Zylinski
 
HiTIME project
HiTIME projectHiTIME project
HiTIME project
vty
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
Marcia Zeng
 
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
Emulex Corporation
 
Bt0066
Bt0066Bt0066
Bt0066
Simpaly Jha
 
Ado.net session01
Ado.net session01Ado.net session01
Ado.net session01
Niit Care
 
Data Integration at the Ontology Engineering Group
Data Integration at the Ontology Engineering GroupData Integration at the Ontology Engineering Group
Data Integration at the Ontology Engineering Group
Oscar Corcho
 
Tim Marston.
Tim Marston.Tim Marston.
Tim Marston.
PatrickCrompton
 

What's hot (10)

MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
HiTIME project
HiTIME projectHiTIME project
HiTIME project
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
 
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
 
Bt0066
Bt0066Bt0066
Bt0066
 
Ado.net session01
Ado.net session01Ado.net session01
Ado.net session01
 
Data Integration at the Ontology Engineering Group
Data Integration at the Ontology Engineering GroupData Integration at the Ontology Engineering Group
Data Integration at the Ontology Engineering Group
 
Tim Marston.
Tim Marston.Tim Marston.
Tim Marston.
 

Viewers also liked

Amia 2013: From EHRs to Linked Data: representing and mining encounter data f...
Amia 2013: From EHRs to Linked Data: representing and mining encounter data f...Amia 2013: From EHRs to Linked Data: representing and mining encounter data f...
Amia 2013: From EHRs to Linked Data: representing and mining encounter data f...
Carlo Torniai
 
PoolParty 4 - From Text Mining to Linked Data
PoolParty 4 - From Text Mining to Linked DataPoolParty 4 - From Text Mining to Linked Data
PoolParty 4 - From Text Mining to Linked Data
Semantic Web Company
 
Not-So-Linked Solution to the Linked Data Mining Challenge 2016
Not-So-Linked Solution to the Linked Data Mining Challenge 2016Not-So-Linked Solution to the Linked Data Mining Challenge 2016
Not-So-Linked Solution to the Linked Data Mining Challenge 2016
Jędrzej Potoniec
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsInterpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning Analytics
Mathieu d'Aquin
 
Jeanne Holm: Data Mining for Good - How Linked Data is Transforming Cities
Jeanne Holm: Data Mining for Good - How Linked Data is Transforming CitiesJeanne Holm: Data Mining for Good - How Linked Data is Transforming Cities
Jeanne Holm: Data Mining for Good - How Linked Data is Transforming Cities
Semantic Web Company
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
Heiko Paulheim
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataAn introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked Data
Fabien Gandon
 

Viewers also liked (7)

Amia 2013: From EHRs to Linked Data: representing and mining encounter data f...
Amia 2013: From EHRs to Linked Data: representing and mining encounter data f...Amia 2013: From EHRs to Linked Data: representing and mining encounter data f...
Amia 2013: From EHRs to Linked Data: representing and mining encounter data f...
 
PoolParty 4 - From Text Mining to Linked Data
PoolParty 4 - From Text Mining to Linked DataPoolParty 4 - From Text Mining to Linked Data
PoolParty 4 - From Text Mining to Linked Data
 
Not-So-Linked Solution to the Linked Data Mining Challenge 2016
Not-So-Linked Solution to the Linked Data Mining Challenge 2016Not-So-Linked Solution to the Linked Data Mining Challenge 2016
Not-So-Linked Solution to the Linked Data Mining Challenge 2016
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsInterpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning Analytics
 
Jeanne Holm: Data Mining for Good - How Linked Data is Transforming Cities
Jeanne Holm: Data Mining for Good - How Linked Data is Transforming CitiesJeanne Holm: Data Mining for Good - How Linked Data is Transforming Cities
Jeanne Holm: Data Mining for Good - How Linked Data is Transforming Cities
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataAn introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked Data
 

Similar to SeCold - A Linked Data Platform for Mining Software Repositories

Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
IGP Production Systems For Digital Archives
IGP Production Systems For Digital ArchivesIGP Production Systems For Digital Archives
IGP Production Systems For Digital Archives
Infogrid Pacific Pte. Ltd
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Cloudera, Inc.
 
notes
notesnotes
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Mark Tabladillo
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazione
MongoDB
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC Edge
DataWorks Summit
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
Jeffrey T. Pollock
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
MongoDB
 
Data Mining
Data MiningData Mining
Data Mining
swami920
 
제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata 제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata
Gruter
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012
Anand Deshpande
 
Don't Re-write Code to Get Better Analytics
Don't Re-write Code to Get Better AnalyticsDon't Re-write Code to Get Better Analytics
Don't Re-write Code to Get Better Analytics
Splunk
 
Accelerate Return on Data
Accelerate Return on DataAccelerate Return on Data
Accelerate Return on Data
Jeffrey T. Pollock
 
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106
Mark Tabladillo
 
My Master's Thesis
My Master's ThesisMy Master's Thesis
My Master's Thesis
Humoyun Ahmedov
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Denodo
 
Data Virtualization: From Zero to Hero
Data Virtualization: From Zero to HeroData Virtualization: From Zero to Hero
Data Virtualization: From Zero to Hero
Denodo
 

Similar to SeCold - A Linked Data Platform for Mining Software Repositories (20)

Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
IGP Production Systems For Digital Archives
IGP Production Systems For Digital ArchivesIGP Production Systems For Digital Archives
IGP Production Systems For Digital Archives
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
notes
notesnotes
notes
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazione
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC Edge
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
Data Mining
Data MiningData Mining
Data Mining
 
제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata 제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012
 
Don't Re-write Code to Get Better Analytics
Don't Re-write Code to Get Better AnalyticsDon't Re-write Code to Get Better Analytics
Don't Re-write Code to Get Better Analytics
 
Accelerate Return on Data
Accelerate Return on DataAccelerate Return on Data
Accelerate Return on Data
 
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106
 
My Master's Thesis
My Master's ThesisMy Master's Thesis
My Master's Thesis
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
Data Virtualization: From Zero to Hero
Data Virtualization: From Zero to HeroData Virtualization: From Zero to Hero
Data Virtualization: From Zero to Hero
 

Recently uploaded

Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 

Recently uploaded (20)

Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 

SeCold - A Linked Data Platform for Mining Software Repositories

  • 1. A Linked Data Platform for Mining Software Repositories Iman Keivanloo Christopher Forbes Aseel Hmood Mostafa Erfani Christopher Neal George Peristerakis Juergen Rilling MSR 2012 June 2
  • 2. SeCold is a “Wikipedia of source code related facts” produced from over 1,000,000 open source projects. SeCold main objectives: (1) establish the fundamental framework (2) perform data analysis SeCold 2.0 is an ongoing research project (currently in its second year) MSR 2012 2
  • 3. Software Analysis Story Issue Tracker Source Code Mailing List Versioning Control Some output … Some analysis MSR 2012 3
  • 4. Software Analysis Story Issue Tracker Source Code Mailing List Versioning Control Some output … Structured Extraction Internal Data Process Analysis Process Raw Representation Structured Data Output [Source Code Analysis: A Roadmap, FOSE’07] MSR 2012 4
  • 5. Issue Tracker Source Code Mailing List Versioning Control … Sharing [Source code analysis: a roadmap, FOSE’07] [Fostering synergies: how … ICSE-SUITE’10] MSR 2012 5
  • 6. Integration Alignment Internal Analysis Output Data Process Inter-dataset Analysis Issue Tracker Internal Analysis Output Data Process Source Code Mailing List Versioning Control … Internal Analysis Output Data Process Internal Analysis Output Data Process MSR 2012 6
  • 7. How to align? The Challenge Dataset A Dataset B MSR 2012 7
  • 8. History of Data Sharing 8
  • 9. Linked Data is about being … Online a URL for each fact! Standard uses HTTP, XML, HTML and … Open usable for both human and machines NOT Static data and schema are editable Graph-based graph of triples vs. XML (tree) Integrating integrated/linked on the fly MSR 2012 9
  • 10. A Linked Data Platform for SeCold Project Mining Software Repositories 1- Vocabulary Set (aka Schema, Data Model, Ontology) Source Code Ecosystem Ontology Family (SECON) SOCON, VERON, METON, ISSUEON, LICENSON, CLON MSR 2012 10
  • 11. A Linked Data Platform for SeCold Project Mining Software Repositories 2- URL/ID Generation Schema A URL for each piece of fact (e.g. var. def. stmt) http://aseg.cs.concordia.ca/secold/page/type/java/DatasetChangeInfo Integration Challenge Several ways to generate URLs (e.g. random ) REPRODUCIBLE IDENTIFIERS MSR 2012 11
  • 12. A Linked Data Platform for SeCold Project Mining Software Repositories 3- Baseline Data Publication General Information ( ~2,000,000 triples) Source Code (~2,000,000,000 triples) Issue Tracker ( ~30,000,000 triples) Version Control ( ~700,000,000 triples) ~1 MILLION PROJECTS MSR 2012 12
  • 13. SeCold LinkedData Cloud (LOD) SeCold: Among the 9 largest Media datasets in the cloud Publication Triple Circle size count Government Very large >1B Large 1B-10M Medium 10M-500k Small 500k-10k Life Science Very small <10k [Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/, as of Sept 2011] MSR 2012 13
  • 15. Showcase #1 (Similar Code Search) MSR 2012 15
  • 16. Showcase #2 –Part1 (Copyright violation detection) Se Clone [SeClone … ICPC’11& WCRE’11] Line level fingerprints Clone (Type 1,2 and 3) Internal Analysis Output Data Process Source Code of 25K projects Upload Ninka [A sentence-matching …, ASE’10] License per file Internal Analysis Output Data Process MSR 2012 16
  • 17. Showcase #2 –Part2 (Copyright violation detection) e … ICPC’11& WCRE’11] Line level fingerprints Clone (Type 1,2 and 3) s Output Copyright violation detection: select ?fileA ?fileB where { Upload ?fileA testxi ?fingerprint . ?fileB testxi ? fingerprint . ?fileA hasLicense ?la . ?fileB hasLicense ?lb . -matching …, ASE’10] Filter (?la != ?lb) } License per file Output MSR 2012 17
  • 18. Showcase #3 (Statistical Analysis) Apache 2, 9.70% 2009 GPL 2, 12% LGPL 2.1, 8.80% BSD, 3% PHP, 0.08% Sleepycat, 0.06% Mozilla PL 1.1, 0.13% Mozilla PL 1.0, 2.60% Artistic, 0.02% All Rights MIT, 0.92% Nokos, 0.01% Reserved, 13% Shareware, 0.00% Apache 1, 0.65% No License , 46% Other, 0.00568 Patented, 0% BSD, 0.27% 2012 Apache 2 9% All Rights Mozilla PL 1.1 Nokos Reserved 0% LGPL 2.1 0% 14% 12% BSD PHP 3% Apache 1 0% GPL 2 Mozilla PL 1.0 0% Sleepycat 17% 1% 0% Other MIT Artistic 1% 0% 0% Shareware No License 0% 42% Patented 0% MSR 2012 18
  • 19. MSR 2012 19

Editor's Notes

  1. abstraction
  2. abstraction
  3. The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!
  4. The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!
  5. http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&amp;query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&amp;debug=on&amp;timeout=&amp;format=text%2Fhtml&amp;save=display&amp;fname=
  6. http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&amp;query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&amp;debug=on&amp;timeout=&amp;format=text%2Fhtml&amp;save=display&amp;fname=
  7. What does it have to offer?How is it different from XML, DBs, …
  8. What does it have to offer?How is it different from XML, DBs, …
  9. What does it have to offer?How is it different from XML, DBs, …
  10. What does it have to offer?How is it different from XML, DBs, …
  11. What does it have to offer?How is it different from XML, DBs, …
  12. The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!
  13. The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!
  14. The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!
  15. The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!