A Linked Data Platform for
         Mining Software Repositories


  Iman Keivanloo
  Christopher Forbes
  Aseel Hmood
  Mostafa Erfani
  Christopher Neal
  George Peristerakis
  Juergen Rilling



MSR 2012 June 2
SeCold is a “Wikipedia of source code
 related facts” produced from over
 1,000,000 open source projects.


SeCold main objectives:
 (1) establish the fundamental framework
 (2) perform data analysis


SeCold 2.0 is an ongoing research project
 (currently in its second year)
             MSR 2012          2
Software Analysis Story




Issue Tracker
Source Code
Mailing List
Versioning Control                          Some output
…



                            Some analysis

                     MSR 2012                             3
Software Analysis Story

Issue Tracker
Source Code
Mailing List
Versioning Control                                                             Some output
…


                                                Structured
                     Extraction               Internal Data
                     Process                                Analysis Process
           Raw                               Representation                    Structured
           Data                                                                Output




                          [Source Code Analysis: A Roadmap, FOSE’07]

                                  MSR 2012                                                   4
Issue Tracker
Source Code
Mailing List
Versioning Control
…

                                                                  Sharing




                     [Source code analysis: a roadmap, FOSE’07]
                     [Fostering synergies: how … ICSE-SUITE’10]

                            MSR 2012                                        5
Integration




                                                       Alignment
                     Internal   Analysis      Output
                       Data     Process




                                                                   Inter-dataset Analysis
Issue Tracker        Internal   Analysis      Output
                       Data     Process
Source Code
Mailing List
Versioning Control
…                    Internal   Analysis      Output
                       Data     Process




                     Internal   Analysis      Output
                       Data     Process

                                   MSR 2012                                                 6
How to align?

               The Challenge
   Dataset A               Dataset B




                MSR 2012               7
History of Data Sharing




                          8
Linked Data is about being …


 Online a URL for each fact!
 Standard uses HTTP, XML, HTML and …
 Open usable for both human and machines
 NOT Static data and schema are editable
 Graph-based graph of triples vs. XML (tree)
 Integrating integrated/linked on the fly

                  MSR 2012                     9
A Linked Data Platform for
SeCold Project
                                Mining Software Repositories




1- Vocabulary Set
(aka Schema, Data Model, Ontology)


Source Code Ecosystem Ontology Family (SECON)
SOCON, VERON, METON, ISSUEON, LICENSON, CLON




                     MSR 2012                            10
A Linked Data Platform for
 SeCold Project
                                      Mining Software Repositories




2- URL/ID Generation Schema
A URL for each piece of fact (e.g. var. def. stmt)
http://aseg.cs.concordia.ca/secold/page/type/java/DatasetChangeInfo

Integration Challenge
Several ways to generate URLs (e.g. random )
REPRODUCIBLE IDENTIFIERS



                           MSR 2012                                   11
A Linked Data Platform for
 SeCold Project
                                  Mining Software Repositories



3- Baseline Data Publication
General Information (    ~2,000,000 triples)
Source Code         (~2,000,000,000 triples)
Issue Tracker       ( ~30,000,000 triples)
Version Control     ( ~700,000,000 triples)



                      ~1 MILLION PROJECTS

                       MSR 2012                            12
SeCold
LinkedData Cloud (LOD)

                                                                                                    SeCold:
                                                                                                    Among the 9 largest
                                         Media                                                      datasets in the cloud

                                                                      Publication
                                                                                                                      Triple
                                                                                                        Circle size
                                                                                                                      count
             Government                                                                                 Very large    >1B

                                                                                                        Large         1B-10M

                                                                                                        Medium        10M-500k

                                                                                                        Small         500k-10k

                                          Life Science                                                  Very small    <10k




   [Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/, as of Sept 2011]



                                      MSR 2012                                                                              13
secold.org




       14
Showcase #1 (Similar Code Search)




                MSR 2012            15
Showcase #2 –Part1 (Copyright violation detection)
                   Se Clone [SeClone … ICPC’11& WCRE’11]


                                                           Line level fingerprints
                                                           Clone (Type 1,2 and 3)
                 Internal    Analysis        Output
                   Data      Process


Source Code of
25K projects                                                                         Upload


                   Ninka [A sentence-matching …, ASE’10]

                                                           License per file

                 Internal   Analysis         Output
                   Data     Process




                                  MSR 2012                                                    16
Showcase #2 –Part2 (Copyright violation detection)
e … ICPC’11& WCRE’11]



                        Line level fingerprints
                        Clone (Type 1,2 and 3)
s        Output
                                                           Copyright violation detection:

                                                           select ?fileA ?fileB where {
                                                  Upload     ?fileA testxi ?fingerprint .
                                                             ?fileB testxi ? fingerprint .
                                                             ?fileA hasLicense ?la .
                                                             ?fileB hasLicense ?lb .
-matching …, ASE’10]
                                                             Filter (?la != ?lb) }

                        License per file
       Output




                                             MSR 2012                                   17
Showcase #3 (Statistical Analysis)
                            Apache 2, 9.70%


2009            GPL
               2, 12%
                                         LGPL 2.1, 8.80%

                                               BSD, 3%                                    PHP, 0.08%       Sleepycat, 0.06%
                                                               Mozilla PL 1.1, 0.13%
                                                Mozilla PL 1.0, 2.60%                                          Artistic, 0.02%
          All Rights
                                                   MIT, 0.92%                                                 Nokos, 0.01%
        Reserved, 13%
                                                                                                           Shareware, 0.00%
                                                         Apache 1, 0.65%
                          No License , 46%          Other, 0.00568
                                                                                                         Patented, 0%
                                                                                           BSD, 0.27%




2012                                Apache 2
                                      9%

                   All Rights                                                                          Mozilla PL 1.1
                                                                                            Nokos
                   Reserved                                                                                 0%
                                    LGPL 2.1                                                 0%
                      14%
                                      12%         BSD                                                         PHP
                                                  3%                           Apache 1                       0%
           GPL 2                                         Mozilla PL 1.0          0%                                  Sleepycat
            17%                                               1%                                                         0%
                                                        Other                                 MIT               Artistic
                                                         1%                                   0%                 0%
                                                                                                                 Shareware
                                No License                                                                          0%
                                   42%                                                                           Patented
                                                                                                                    0%
                                  MSR 2012                                                                                   18
MSR 2012   19

SeCold - A Linked Data Platform for Mining Software Repositories

  • 1.
    A Linked DataPlatform for Mining Software Repositories Iman Keivanloo Christopher Forbes Aseel Hmood Mostafa Erfani Christopher Neal George Peristerakis Juergen Rilling MSR 2012 June 2
  • 2.
    SeCold is a“Wikipedia of source code related facts” produced from over 1,000,000 open source projects. SeCold main objectives: (1) establish the fundamental framework (2) perform data analysis SeCold 2.0 is an ongoing research project (currently in its second year) MSR 2012 2
  • 3.
    Software Analysis Story IssueTracker Source Code Mailing List Versioning Control Some output … Some analysis MSR 2012 3
  • 4.
    Software Analysis Story IssueTracker Source Code Mailing List Versioning Control Some output … Structured Extraction Internal Data Process Analysis Process Raw Representation Structured Data Output [Source Code Analysis: A Roadmap, FOSE’07] MSR 2012 4
  • 5.
    Issue Tracker Source Code MailingList Versioning Control … Sharing [Source code analysis: a roadmap, FOSE’07] [Fostering synergies: how … ICSE-SUITE’10] MSR 2012 5
  • 6.
    Integration Alignment Internal Analysis Output Data Process Inter-dataset Analysis Issue Tracker Internal Analysis Output Data Process Source Code Mailing List Versioning Control … Internal Analysis Output Data Process Internal Analysis Output Data Process MSR 2012 6
  • 7.
    How to align? The Challenge Dataset A Dataset B MSR 2012 7
  • 8.
    History of DataSharing 8
  • 9.
    Linked Data isabout being … Online a URL for each fact! Standard uses HTTP, XML, HTML and … Open usable for both human and machines NOT Static data and schema are editable Graph-based graph of triples vs. XML (tree) Integrating integrated/linked on the fly MSR 2012 9
  • 10.
    A Linked DataPlatform for SeCold Project Mining Software Repositories 1- Vocabulary Set (aka Schema, Data Model, Ontology) Source Code Ecosystem Ontology Family (SECON) SOCON, VERON, METON, ISSUEON, LICENSON, CLON MSR 2012 10
  • 11.
    A Linked DataPlatform for SeCold Project Mining Software Repositories 2- URL/ID Generation Schema A URL for each piece of fact (e.g. var. def. stmt) http://aseg.cs.concordia.ca/secold/page/type/java/DatasetChangeInfo Integration Challenge Several ways to generate URLs (e.g. random ) REPRODUCIBLE IDENTIFIERS MSR 2012 11
  • 12.
    A Linked DataPlatform for SeCold Project Mining Software Repositories 3- Baseline Data Publication General Information ( ~2,000,000 triples) Source Code (~2,000,000,000 triples) Issue Tracker ( ~30,000,000 triples) Version Control ( ~700,000,000 triples) ~1 MILLION PROJECTS MSR 2012 12
  • 13.
    SeCold LinkedData Cloud (LOD) SeCold: Among the 9 largest Media datasets in the cloud Publication Triple Circle size count Government Very large >1B Large 1B-10M Medium 10M-500k Small 500k-10k Life Science Very small <10k [Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/, as of Sept 2011] MSR 2012 13
  • 14.
  • 15.
    Showcase #1 (SimilarCode Search) MSR 2012 15
  • 16.
    Showcase #2 –Part1(Copyright violation detection) Se Clone [SeClone … ICPC’11& WCRE’11] Line level fingerprints Clone (Type 1,2 and 3) Internal Analysis Output Data Process Source Code of 25K projects Upload Ninka [A sentence-matching …, ASE’10] License per file Internal Analysis Output Data Process MSR 2012 16
  • 17.
    Showcase #2 –Part2(Copyright violation detection) e … ICPC’11& WCRE’11] Line level fingerprints Clone (Type 1,2 and 3) s Output Copyright violation detection: select ?fileA ?fileB where { Upload ?fileA testxi ?fingerprint . ?fileB testxi ? fingerprint . ?fileA hasLicense ?la . ?fileB hasLicense ?lb . -matching …, ASE’10] Filter (?la != ?lb) } License per file Output MSR 2012 17
  • 18.
    Showcase #3 (StatisticalAnalysis) Apache 2, 9.70% 2009 GPL 2, 12% LGPL 2.1, 8.80% BSD, 3% PHP, 0.08% Sleepycat, 0.06% Mozilla PL 1.1, 0.13% Mozilla PL 1.0, 2.60% Artistic, 0.02% All Rights MIT, 0.92% Nokos, 0.01% Reserved, 13% Shareware, 0.00% Apache 1, 0.65% No License , 46% Other, 0.00568 Patented, 0% BSD, 0.27% 2012 Apache 2 9% All Rights Mozilla PL 1.1 Nokos Reserved 0% LGPL 2.1 0% 14% 12% BSD PHP 3% Apache 1 0% GPL 2 Mozilla PL 1.0 0% Sleepycat 17% 1% 0% Other MIT Artistic 1% 0% 0% Shareware No License 0% 42% Patented 0% MSR 2012 18
  • 19.

Editor's Notes

  • #4 abstraction
  • #5 abstraction
  • #6 The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!
  • #7 The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!
  • #8 http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&amp;query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&amp;debug=on&amp;timeout=&amp;format=text%2Fhtml&amp;save=display&amp;fname=
  • #10 http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&amp;query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&amp;debug=on&amp;timeout=&amp;format=text%2Fhtml&amp;save=display&amp;fname=
  • #11 What does it have to offer?How is it different from XML, DBs, …
  • #12 What does it have to offer?How is it different from XML, DBs, …
  • #13 What does it have to offer?How is it different from XML, DBs, …
  • #14 What does it have to offer?How is it different from XML, DBs, …
  • #15 What does it have to offer?How is it different from XML, DBs, …
  • #16 The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!
  • #17 The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!
  • #18 The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!
  • #19 The idea is sharing. To avoid repeating. To speedup the analysis process, decrease cost and ease the research.But the first question is sharing what?!