SeCold - A Linked Data Platform for Mining Software Repositories

A Linked Data Platform for
Mining Software Repositories

Iman Keivanloo
Christopher Forbes
Aseel Hmood
Mostafa Erfani
Christopher Neal
George Peristerakis
Juergen Rilling

MSR 2012 June 2

SeCold is a “Wikipedia of source code
related facts” produced from over
1,000,000 open source projects.

SeCold main objectives:
(1) establish the fundamental framework
(2) perform data analysis

SeCold 2.0 is an ongoing research project
(currently in its second year)
MSR 2012 2

Software Analysis Story

Issue Tracker
Source Code
Mailing List
Versioning Control Some output
…

Some analysis

MSR 2012 3

Software Analysis Story

Issue Tracker
Source Code
Mailing List
Versioning Control Some output
…

Structured
Extraction Internal Data
Process Analysis Process
Raw Representation Structured
Data Output

[Source Code Analysis: A Roadmap, FOSE’07]

MSR 2012 4

Issue Tracker
Source Code
Mailing List
Versioning Control
…

Sharing

[Source code analysis: a roadmap, FOSE’07]
[Fostering synergies: how … ICSE-SUITE’10]

MSR 2012 5

Integration

Alignment
Internal Analysis Output
Data Process

Inter-dataset Analysis
Issue Tracker Internal Analysis Output
Data Process
Source Code
Mailing List
Versioning Control
… Internal Analysis Output
Data Process

Data Process

MSR 2012 6

How to align?

The Challenge
Dataset A Dataset B

MSR 2012 7

History of Data Sharing

8

Linked Data is about being …

Online a URL for each fact!
Standard uses HTTP, XML, HTML and …
Open usable for both human and machines
NOT Static data and schema are editable
Graph-based graph of triples vs. XML (tree)
Integrating integrated/linked on the fly

MSR 2012 9

SeCold Project

1- Vocabulary Set
(aka Schema, Data Model, Ontology)

Source Code Ecosystem Ontology Family (SECON)
SOCON, VERON, METON, ISSUEON, LICENSON, CLON

MSR 2012 10

SeCold Project

2- URL/ID Generation Schema
A URL for each piece of fact (e.g. var. def. stmt)
http://aseg.cs.concordia.ca/secold/page/type/java/DatasetChangeInfo

Integration Challenge
Several ways to generate URLs (e.g. random )
REPRODUCIBLE IDENTIFIERS

MSR 2012 11

SeCold Project

3- Baseline Data Publication
General Information ( ~2,000,000 triples)
Source Code (~2,000,000,000 triples)
Issue Tracker ( ~30,000,000 triples)
Version Control ( ~700,000,000 triples)

~1 MILLION PROJECTS

MSR 2012 12

SeCold
LinkedData Cloud (LOD)

SeCold:
Among the 9 largest
Media datasets in the cloud

Publication
Triple
Circle size
count
Government Very large >1B

Large 1B-10M

Medium 10M-500k

Small 500k-10k

Life Science Very small <10k

[Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/, as of Sept 2011]

MSR 2012 13

Showcase #1 (Similar Code Search)

MSR 2012 15

Showcase #2 –Part1 (Copyright violation detection)
Se Clone [SeClone … ICPC’11& WCRE’11]

Line level fingerprints
Clone (Type 1,2 and 3)
Data Process

Source Code of
25K projects Upload

Ninka [A sentence-matching …, ASE’10]

License per file

Data Process

MSR 2012 16

Showcase #2 –Part2 (Copyright violation detection)
e … ICPC’11& WCRE’11]

Line level fingerprints
Clone (Type 1,2 and 3)
s Output
Copyright violation detection:

select ?fileA ?fileB where {
Upload ?fileA testxi ?fingerprint .
?fileB testxi ? fingerprint .
?fileA hasLicense ?la .
?fileB hasLicense ?lb .
-matching …, ASE’10]
Filter (?la != ?lb) }

License per file
Output

MSR 2012 17

Showcase #3 (Statistical Analysis)
Apache 2, 9.70%

2009 GPL
2, 12%
LGPL 2.1, 8.80%

BSD, 3% PHP, 0.08% Sleepycat, 0.06%
Mozilla PL 1.1, 0.13%
Mozilla PL 1.0, 2.60% Artistic, 0.02%
All Rights
MIT, 0.92% Nokos, 0.01%
Reserved, 13%
Shareware, 0.00%
Apache 1, 0.65%
No License , 46% Other, 0.00568
Patented, 0%
BSD, 0.27%

2012 Apache 2
9%

All Rights Mozilla PL 1.1
Nokos
Reserved 0%
LGPL 2.1 0%
14%
12% BSD PHP
3% Apache 1 0%
GPL 2 Mozilla PL 1.0 0% Sleepycat
17% 1% 0%
Other MIT Artistic
1% 0% 0%
Shareware
No License 0%
42% Patented
0%
MSR 2012 18

SeCold - A Linked Data Platform for Mining Software Repositories

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (7)

Similar to SeCold - A Linked Data Platform for Mining Software Repositories

Similar to SeCold - A Linked Data Platform for Mining Software Repositories (20)

Recently uploaded

Recently uploaded (20)

SeCold - A Linked Data Platform for Mining Software Repositories

Editor's Notes