Results of the FLOSSMetrics project, and some related tools. Presentation at the master's program on development and mangement of free software projects (Vigo, Spain).
FLOSSMETRICS: The main objective of FLOSSMETRICS is to construct, publish and analyse a large scale database with information and metrics about libre software development coming from several thousands of software projects, using existing methodologies, and tools already developed. The project will also provide a public platform for validation and industrial exploitation of results.
Reference Representation in Large Metamodel-based DatasetsMarkus Scheidgen
This document discusses different model representations for large meta-model based datasets. It compares object-by-object representation to fragmentation strategies. Fragmentation breaks models into multiple fragments stored separately. The document evaluates different fragmentation strategies through theoretical analysis and implementation tests. It also compares part-of-source and relational representations and discusses applications of model fragmentation including software engineering and scientific data analysis.
Model-based Analysis of Large Scale Software RepositoriesMarkus Scheidgen
1) The document discusses a model-based framework for analyzing large scale software repositories. It involves reverse engineering software from version control systems to create abstract syntax tree models, applying transformations and queries to derive metrics and insights, and using Scala for flexible queries and transformations.
2) Two example analyses are described: calculating design structure matrices and propagation costs, and detecting cross-cutting concerns by analyzing co-changed methods within commits.
3) The goal is to enable scalable, language-independent analysis of ultra-large repositories through model-based techniques instead of analyzing raw code directly. This allows abstracting different languages and repositories with common models and analyses.
This document describes a C++ sample program called Dictls32.exe that utilizes a File I/O API to display dictionary information from SPSS data files. It provides instructions on building the sample program using Microsoft Visual C++ and includes details on the constituent source code files that make up the program.
Creating and Analyzing Source Code Repository Models - A Model-based Approach...Markus Scheidgen
With mining software repositories (MSR), we analyze the rich data created during the whole evolution of one or more software projects. One major obstacle in MSR is the heterogeneity and complexity of source code as a data source. With model-based technology in general and reverse engineering in particular, we can use abstraction to overcome this obstacle. But, this raises a new question: can we apply existing reverse engineering frameworks that were designed to create models from a single revision of a software system to analyze all revisions of such a system at once? This paper presents a framework that uses a combination of EMF, the reverse engineering framework Modisco, a NoSQL-based model persistence framework, and OCL-like expressions to create and analyze fully resolved AST-level model representations of whole source code repositories. We evaluated the feasibility of this approach with a series of experiments on the Eclipse code-base.
Metamodeling vs Metaprogramming, A Case Study on Developing Client Libraries ...Markus Scheidgen
This document discusses metamodeling and metaprogramming approaches for developing client libraries for REST APIs. It presents a metamodel for describing REST APIs and shows how annotations in the xTend language can be used as an alternative internal DSL (domain-specific language) to generate code for a REST client library from descriptions of API requests and resource types. Active annotations are processed by custom compilers and can generate platform-specific code while providing static type safety. Both metamodeling and metaprogramming through active annotations are presented as model-driven approaches for developing web applications and REST client code.
A MEDIA SHARING PLATFORM BUILT WITH OPEN SOURCE SOFTWAREvrt-medialab
Sharing and handling media files in a professional context often requires expensive software packages. For the EBU P/SCAIE project, a platform was required that could handle an abundance of professional file formats, arbitrary large file sizes and which did not pose restrictions on the metadata format used. As there was no such software available, we decided to build a custom web-based platform, based on loosely coupled open source components.
This paper explains the architecture of the resulting platform. With a minimum of custom code, we have created a powerful platform that meets our requirements. This integration, described in this paper, is of use to organizations wishing to build their own media platform using open source components.
FLOSSMETRICS: The main objective of FLOSSMETRICS is to construct, publish and analyse a large scale database with information and metrics about libre software development coming from several thousands of software projects, using existing methodologies, and tools already developed. The project will also provide a public platform for validation and industrial exploitation of results.
Reference Representation in Large Metamodel-based DatasetsMarkus Scheidgen
This document discusses different model representations for large meta-model based datasets. It compares object-by-object representation to fragmentation strategies. Fragmentation breaks models into multiple fragments stored separately. The document evaluates different fragmentation strategies through theoretical analysis and implementation tests. It also compares part-of-source and relational representations and discusses applications of model fragmentation including software engineering and scientific data analysis.
Model-based Analysis of Large Scale Software RepositoriesMarkus Scheidgen
1) The document discusses a model-based framework for analyzing large scale software repositories. It involves reverse engineering software from version control systems to create abstract syntax tree models, applying transformations and queries to derive metrics and insights, and using Scala for flexible queries and transformations.
2) Two example analyses are described: calculating design structure matrices and propagation costs, and detecting cross-cutting concerns by analyzing co-changed methods within commits.
3) The goal is to enable scalable, language-independent analysis of ultra-large repositories through model-based techniques instead of analyzing raw code directly. This allows abstracting different languages and repositories with common models and analyses.
This document describes a C++ sample program called Dictls32.exe that utilizes a File I/O API to display dictionary information from SPSS data files. It provides instructions on building the sample program using Microsoft Visual C++ and includes details on the constituent source code files that make up the program.
Creating and Analyzing Source Code Repository Models - A Model-based Approach...Markus Scheidgen
With mining software repositories (MSR), we analyze the rich data created during the whole evolution of one or more software projects. One major obstacle in MSR is the heterogeneity and complexity of source code as a data source. With model-based technology in general and reverse engineering in particular, we can use abstraction to overcome this obstacle. But, this raises a new question: can we apply existing reverse engineering frameworks that were designed to create models from a single revision of a software system to analyze all revisions of such a system at once? This paper presents a framework that uses a combination of EMF, the reverse engineering framework Modisco, a NoSQL-based model persistence framework, and OCL-like expressions to create and analyze fully resolved AST-level model representations of whole source code repositories. We evaluated the feasibility of this approach with a series of experiments on the Eclipse code-base.
Metamodeling vs Metaprogramming, A Case Study on Developing Client Libraries ...Markus Scheidgen
This document discusses metamodeling and metaprogramming approaches for developing client libraries for REST APIs. It presents a metamodel for describing REST APIs and shows how annotations in the xTend language can be used as an alternative internal DSL (domain-specific language) to generate code for a REST client library from descriptions of API requests and resource types. Active annotations are processed by custom compilers and can generate platform-specific code while providing static type safety. Both metamodeling and metaprogramming through active annotations are presented as model-driven approaches for developing web applications and REST client code.
A MEDIA SHARING PLATFORM BUILT WITH OPEN SOURCE SOFTWAREvrt-medialab
Sharing and handling media files in a professional context often requires expensive software packages. For the EBU P/SCAIE project, a platform was required that could handle an abundance of professional file formats, arbitrary large file sizes and which did not pose restrictions on the metadata format used. As there was no such software available, we decided to build a custom web-based platform, based on loosely coupled open source components.
This paper explains the architecture of the resulting platform. With a minimum of custom code, we have created a powerful platform that meets our requirements. This integration, described in this paper, is of use to organizations wishing to build their own media platform using open source components.
Open Source project failure often stems from not setting clear objectives or having a shared vision from the start. That said there are many success stories, including two well known Statistical examples: Demetra; and Eurostat SDMX tools (SDMX-RI). However, in all these examples there was at first a founding organisation/entity that created the right environment for its successful path into a new paradigm. In the context of my presentation this being the Statistical Information System Collaboration Community (SIS-CC / http://siscc.oecd.org).
Presented at the International Marketing and Output DataBase Conference, Gozd Martuljek, September 18 - 22, 2016.
The document discusses ideas for new integrated development environments (IDEs) that actively include software change history and evolution analysis techniques. It describes harvesting data from change histories, release histories, dependencies and other sources to provide visualizations of developer networks, code evolution metrics, and other insights. The document argues that modern IDEs could integrate these kinds of historical analyses and visualizations to help developers, project managers, and newcomers better understand software structure and changes over time.
The document discusses some of the promises and perils of mining software repositories like Git and GitHub for research purposes. It notes that while these sources contain rich data on software development, there are also challenges to consider. For example, decentralized version control systems like Git allow private collaboration that may be missed. And most GitHub projects are personal and inactive, while it is also used for storage and hosting. The document recommends researchers approach these data sources carefully and provides lessons on how to properly analyze and interpret the data from repositories like Git and GitHub.
The document discusses forensic tools for Mac OS and Linux operating systems. It compares tools like Network Miner and NMAP to Hex Workshop and validates them based on software verification and validation. It also demonstrates exploring different file formats like Excel, Word, JPEG, GIF and MOV in the Hex Editor Hxd.
Marios Chatziangelou presents the EGI applications database | OSFair2017 Workshop
Workshop overview:
This collaborative workshop comes in the context of coordinating EOSC related activities across large European infrastructures at European and national level. The workshop will offer an opportunity for cross-pollination on issues ranging from open scholarship to technical service provision, training, community engagement and support. OpenAIRE NOADs, EGI NGIs, GEANT NRENs and other national e-Infrastructure representatives will discuss gaps, synergies, coordination and service integration opportunities.
DAY 3 - PARALLEL SESSION 6 & 7
Jagrat Mankad has over 15 years of experience as a software developer and principal. He has expertise in C#, C/C++, Java, Python and SQL programming languages. Some of the applications he has worked on include an idea tracking tool called IdeaWorks, a test automation framework in Python, and several tools to aid in testing central systems. He is skilled in all phases of the software development life cycle from requirements gathering to delivery. He has a bachelor's degree in computer engineering and holds a US patent.
This document provides an overview of the technologies used to develop the graphical user interface for the Tell Me Quality tool. It describes Node.js, D3.js, Git, JSON, Mustache, REST, and SHACL which were used for the frontend and backend. The frontend allows users to upload files and configuration data, view available measurements, visualize data, and see detailed results. The backend performs tasks like uploading files, configuring shape files, selecting measurements, and exporting results. An example usage is also provided.
Find Out What's New With WhiteSource May 2018- A WhiteSource WebinarWhiteSource
In our latest webinar, we learned about our latest product updates here at WhiteSource. We unveiled our new, revolutionary technology as well as highlighting other cool releases and enhancements.
Community based software development: The GRASS GIS projectMarkus Neteler
The document summarizes the GRASS GIS open source project. It discusses the project's objectives of developing free GIS software and algorithms. It describes the international development team and communication structures used, including mailing lists, wikis and bug trackers. Legal aspects of code contributions and licensing are also briefly covered.
Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery; Steve Hughes, NASA; Data Publication Repositories
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
The global need to securely derive (instant) insights, have motivated data architectures from distributed storage, to data lakes, data warehouses and lake-houses. In this talk we describe Tag.bio, a next generation data mesh platform that embeds vital elements such as domain centricity/ownership, Data as Products, Self-serve architecture, with a federated computational layer. Tag.bio data products combine data sets, smart APIs, statistical and machine learning algorithms into decentralized data products for users to discover insights using FAIR Principles. Researchers can use its point and click (no-code) system to instantly perform analysis and share versioned, reproducible results. The platform combines a dynamic cohort builder with analysis protocols and applications (low-code) to drive complex analysis workflows. Applications within data products are fully customizable via R and Python plugins (pro-code), and the platform supports notebook based developer environments with individual workspaces.
Join us for a talk/demo session on Tag.bio data mesh platform and learn how major pharma industries and university health systems are using this technology to promote value based healthcare, precision healthcare, find cures for disease, and promote collaboration (without explicitly moving data around). The talk also outlines Tag.bio secure data exchange features for real world evidence datasets, privacy centric data products (confidential computing) as well as integration with cloud services
09 The Extreme-scale Scientific Software Stack for Collaborative Open SourceRCCSRENKEI
The Extreme-scale Scientific Software Stack (E4S) provides a curated collection of open source software for exascale computing. It is based on the Spack package manager and delivers pre-built binaries and container images for popular ECP software like MPI, OpenMP, math libraries, I/O libraries and more. The E4S aims to ensure the interoperability of these software packages and their portability across different computer architectures. It provides a standardized way to install and deploy exascale software and helps application developers integrate these tools into their own projects.
The Microsoft Biology Foundation (MBF) is an open-source library of bioinformatics algorithms and services built on .NET. MBF provides modular and reusable code for tasks like genomics, sequencing, and analysis. It leverages existing Microsoft technologies and allows distribution of computations across platforms from local to cloud. The first version was released in June 2010. MBF is developed openly on CodePlex and aims to benefit both commercial and non-commercial users.
Chipster is a user-friendly microarray analysis software that provides intuitive graphical interface and comprehensive selection of analysis tools and visualizations. It uses a distributed system architecture implemented with Java and messaging to connect to external analysis tools via its graphical user interface. Chipster combines a data-centric and workflow-oriented paradigm by allowing users to record and save analyses as reusable scripts.
The document provides an overview of open source software, its history and uses in libraries. It discusses evaluating open source solutions and factors to consider such as community support, total cost of ownership, and technical requirements. Resources for finding and evaluating open source software are also listed.
Free/Open Source Software for Science & EngineeringKinshuk Sunil
This document discusses free and open source software applications that are useful for science and engineering. It provides information on the GNU operating system, popular GNU/Linux distributions, benefits of open source software including increased quality and accelerated development. It then describes several key free software applications and libraries useful for scientific computation, data analysis and visualization, including Python, NumPy/SciPy, R, LaTeX, GSL, Octave, OpenDX, and SciDAVis.
The document provides an overview of visual programming environments for science and business, using KNIME and Pipeline Pilot as examples. It defines visual programming as dragging functional components onto a canvas to create a program by configuring components and connecting them to route data. The document demonstrates creating a report of compounds registered in January using these environments and discusses the types of components, applications, and benefits of server deployment that they enable.
Software Analytics:Towards Software Mining that Matters (2014)Tao Xie
This document discusses software analytics and summarizes several related papers and projects. It introduces Software Analytics, which aims to enable software practitioners to perform data exploration and analysis to obtain useful insights. It then summarizes papers on techniques for performance debugging by mining stack traces, scalable code clone analysis, incident management for online services, and using games to teach programming.
Software Analytics: Data Analytics for Software EngineeringTao Xie
This document summarizes a presentation on software analytics and its achievements and opportunities. It begins by noting how both how software and how it is built and operated are changing, with data becoming more pervasive and development more distributed. It then defines software analytics as enabling analysis of software data to obtain insights and make informed decisions. It outlines research topics covering different areas of the software domain throughout the development cycle. It describes target audiences of software practitioners and outputs of insightful and actionable information. Selected projects demonstrating software analytics are then summarized, including StackMine for performance debugging at scale, XIAO for scalable code clone analysis, and others.
This document describes a generic open source framework for automatically generating data manipulation commands. The framework allows for data manipulation commands to be generated for a variety of web servers, web browsers, and database management systems. Key aspects of the framework include:
- Storing configuration information like the backend database, technology, and validation details in XML files.
- Parsing the XML files to dynamically generate code, setup runtime environments, and create mapping files.
- Supporting technologies like JSP, Servlets, PHP, databases like MySQL and MS Access, and servers like Tomcat and JBoss.
- Automatically generating insert, update, delete commands by mapping HTML form controls to database columns.
Open Source project failure often stems from not setting clear objectives or having a shared vision from the start. That said there are many success stories, including two well known Statistical examples: Demetra; and Eurostat SDMX tools (SDMX-RI). However, in all these examples there was at first a founding organisation/entity that created the right environment for its successful path into a new paradigm. In the context of my presentation this being the Statistical Information System Collaboration Community (SIS-CC / http://siscc.oecd.org).
Presented at the International Marketing and Output DataBase Conference, Gozd Martuljek, September 18 - 22, 2016.
The document discusses ideas for new integrated development environments (IDEs) that actively include software change history and evolution analysis techniques. It describes harvesting data from change histories, release histories, dependencies and other sources to provide visualizations of developer networks, code evolution metrics, and other insights. The document argues that modern IDEs could integrate these kinds of historical analyses and visualizations to help developers, project managers, and newcomers better understand software structure and changes over time.
The document discusses some of the promises and perils of mining software repositories like Git and GitHub for research purposes. It notes that while these sources contain rich data on software development, there are also challenges to consider. For example, decentralized version control systems like Git allow private collaboration that may be missed. And most GitHub projects are personal and inactive, while it is also used for storage and hosting. The document recommends researchers approach these data sources carefully and provides lessons on how to properly analyze and interpret the data from repositories like Git and GitHub.
The document discusses forensic tools for Mac OS and Linux operating systems. It compares tools like Network Miner and NMAP to Hex Workshop and validates them based on software verification and validation. It also demonstrates exploring different file formats like Excel, Word, JPEG, GIF and MOV in the Hex Editor Hxd.
Marios Chatziangelou presents the EGI applications database | OSFair2017 Workshop
Workshop overview:
This collaborative workshop comes in the context of coordinating EOSC related activities across large European infrastructures at European and national level. The workshop will offer an opportunity for cross-pollination on issues ranging from open scholarship to technical service provision, training, community engagement and support. OpenAIRE NOADs, EGI NGIs, GEANT NRENs and other national e-Infrastructure representatives will discuss gaps, synergies, coordination and service integration opportunities.
DAY 3 - PARALLEL SESSION 6 & 7
Jagrat Mankad has over 15 years of experience as a software developer and principal. He has expertise in C#, C/C++, Java, Python and SQL programming languages. Some of the applications he has worked on include an idea tracking tool called IdeaWorks, a test automation framework in Python, and several tools to aid in testing central systems. He is skilled in all phases of the software development life cycle from requirements gathering to delivery. He has a bachelor's degree in computer engineering and holds a US patent.
This document provides an overview of the technologies used to develop the graphical user interface for the Tell Me Quality tool. It describes Node.js, D3.js, Git, JSON, Mustache, REST, and SHACL which were used for the frontend and backend. The frontend allows users to upload files and configuration data, view available measurements, visualize data, and see detailed results. The backend performs tasks like uploading files, configuring shape files, selecting measurements, and exporting results. An example usage is also provided.
Find Out What's New With WhiteSource May 2018- A WhiteSource WebinarWhiteSource
In our latest webinar, we learned about our latest product updates here at WhiteSource. We unveiled our new, revolutionary technology as well as highlighting other cool releases and enhancements.
Community based software development: The GRASS GIS projectMarkus Neteler
The document summarizes the GRASS GIS open source project. It discusses the project's objectives of developing free GIS software and algorithms. It describes the international development team and communication structures used, including mailing lists, wikis and bug trackers. Legal aspects of code contributions and licensing are also briefly covered.
Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery; Steve Hughes, NASA; Data Publication Repositories
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
The global need to securely derive (instant) insights, have motivated data architectures from distributed storage, to data lakes, data warehouses and lake-houses. In this talk we describe Tag.bio, a next generation data mesh platform that embeds vital elements such as domain centricity/ownership, Data as Products, Self-serve architecture, with a federated computational layer. Tag.bio data products combine data sets, smart APIs, statistical and machine learning algorithms into decentralized data products for users to discover insights using FAIR Principles. Researchers can use its point and click (no-code) system to instantly perform analysis and share versioned, reproducible results. The platform combines a dynamic cohort builder with analysis protocols and applications (low-code) to drive complex analysis workflows. Applications within data products are fully customizable via R and Python plugins (pro-code), and the platform supports notebook based developer environments with individual workspaces.
Join us for a talk/demo session on Tag.bio data mesh platform and learn how major pharma industries and university health systems are using this technology to promote value based healthcare, precision healthcare, find cures for disease, and promote collaboration (without explicitly moving data around). The talk also outlines Tag.bio secure data exchange features for real world evidence datasets, privacy centric data products (confidential computing) as well as integration with cloud services
09 The Extreme-scale Scientific Software Stack for Collaborative Open SourceRCCSRENKEI
The Extreme-scale Scientific Software Stack (E4S) provides a curated collection of open source software for exascale computing. It is based on the Spack package manager and delivers pre-built binaries and container images for popular ECP software like MPI, OpenMP, math libraries, I/O libraries and more. The E4S aims to ensure the interoperability of these software packages and their portability across different computer architectures. It provides a standardized way to install and deploy exascale software and helps application developers integrate these tools into their own projects.
The Microsoft Biology Foundation (MBF) is an open-source library of bioinformatics algorithms and services built on .NET. MBF provides modular and reusable code for tasks like genomics, sequencing, and analysis. It leverages existing Microsoft technologies and allows distribution of computations across platforms from local to cloud. The first version was released in June 2010. MBF is developed openly on CodePlex and aims to benefit both commercial and non-commercial users.
Chipster is a user-friendly microarray analysis software that provides intuitive graphical interface and comprehensive selection of analysis tools and visualizations. It uses a distributed system architecture implemented with Java and messaging to connect to external analysis tools via its graphical user interface. Chipster combines a data-centric and workflow-oriented paradigm by allowing users to record and save analyses as reusable scripts.
The document provides an overview of open source software, its history and uses in libraries. It discusses evaluating open source solutions and factors to consider such as community support, total cost of ownership, and technical requirements. Resources for finding and evaluating open source software are also listed.
Free/Open Source Software for Science & EngineeringKinshuk Sunil
This document discusses free and open source software applications that are useful for science and engineering. It provides information on the GNU operating system, popular GNU/Linux distributions, benefits of open source software including increased quality and accelerated development. It then describes several key free software applications and libraries useful for scientific computation, data analysis and visualization, including Python, NumPy/SciPy, R, LaTeX, GSL, Octave, OpenDX, and SciDAVis.
The document provides an overview of visual programming environments for science and business, using KNIME and Pipeline Pilot as examples. It defines visual programming as dragging functional components onto a canvas to create a program by configuring components and connecting them to route data. The document demonstrates creating a report of compounds registered in January using these environments and discusses the types of components, applications, and benefits of server deployment that they enable.
Software Analytics:Towards Software Mining that Matters (2014)Tao Xie
This document discusses software analytics and summarizes several related papers and projects. It introduces Software Analytics, which aims to enable software practitioners to perform data exploration and analysis to obtain useful insights. It then summarizes papers on techniques for performance debugging by mining stack traces, scalable code clone analysis, incident management for online services, and using games to teach programming.
Software Analytics: Data Analytics for Software EngineeringTao Xie
This document summarizes a presentation on software analytics and its achievements and opportunities. It begins by noting how both how software and how it is built and operated are changing, with data becoming more pervasive and development more distributed. It then defines software analytics as enabling analysis of software data to obtain insights and make informed decisions. It outlines research topics covering different areas of the software domain throughout the development cycle. It describes target audiences of software practitioners and outputs of insightful and actionable information. Selected projects demonstrating software analytics are then summarized, including StackMine for performance debugging at scale, XIAO for scalable code clone analysis, and others.
This document describes a generic open source framework for automatically generating data manipulation commands. The framework allows for data manipulation commands to be generated for a variety of web servers, web browsers, and database management systems. Key aspects of the framework include:
- Storing configuration information like the backend database, technology, and validation details in XML files.
- Parsing the XML files to dynamically generate code, setup runtime environments, and create mapping files.
- Supporting technologies like JSP, Servlets, PHP, databases like MySQL and MS Access, and servers like Tomcat and JBoss.
- Automatically generating insert, update, delete commands by mapping HTML form controls to database columns.
Similar to Results of the FLOSSMetrics project (20)
1. FLOSSMetrics: data about libre software
development
Jesus M. Gonzalez-Barahona
(GSyC/LibreSoft, URJC)
http://identi.ca/jgbarah http://twitter.com/jgbarah
jgb@gsyc.es
Master’s Program on Development and
Management of Free Software Projects
(Open Session) Vigo, Spain, March 17th 2011
3. 1
c 2006-2011 GSyC/LibreSoft
Some rights reserved. This document is distributed under the
Creative Commons Attribution-ShareAlike 3.0 licence, available
in http://creativecommons.org/licenses/by-sa/3.0/
c GSyC/LibreSoft
4. FLOSSMetrics: base ideas 2
FLOSSMetrics: base ideas
Libre software development:
Lots of opinions, few known facts
Researcher-friendly: public data, reproducibility, validation of
results, large samples
Interest by volunteers and companies
Main questions:
Can libre software development be improved?
Can software engineering learn from libre software?
Can projects better understand their processes and products?
http://flossmetrics.org
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
5. FLOSSMETRICS goals 3
FLOSSMETRICS goals
Retrieval of data from (thousands of) libre software projects
Analysis about actors, artefacts and processes involved in de-
velopment
Higher level studies: software evolution, human resources, ef-
fort estimation, productivity, quality, etc.
Database available to other researchers, developers, etc.
Providing tools for development follow-up
Involvement with the libre software community
http://flossmetrics.org
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
6. Main results 4
Main results
Huge database with factual details about libre software deve-
lopment (accessible to everyone)
Higher level analysis and studies
Sustainable platform for benchmarking and analysis
Targeted reports (SMEs, industry, etc.)
Focus on providing data and information that others can use
for research, evaluation, follow-up
http://flossmetrics.org
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
7. Partners 5
Partners
Universidad Rey Juan Carlos (ES)
University of Maastricht (NL)
Wirtsshaftuniversitaet Wien (AT)
Aristotle Univeristy of Thessaloniki (GR)
Conecta s.r.l (IT)
ZEA Partners (BE)
Philips Medical Systems (NL)
Project funded by the European Commission
(FP6-IST programme)
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
8. Final results (2011) 6
Final results (2011)
Full MySQL dumps for about 2,800 projects:
• CVS, Subversion, git commit records ( 2,000)
• Metrics (size, complexity) for source code ( 900)
• Mailing lists main headers ( 400)
• Issue tracking system (bug reports, etc.) ( 750)
Focused report on SMEs, and focused studies
Aggregated database (all projects together, separated dumps
for SCM, mailing lists, issue tracking systems)
Melquiades: http://melquiades.flossmetrics.org
API for accessing the data
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
9. Final results (2011) (cont.) 7
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
10. Final results (2011) (cont.) 8
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
11. Final results (2011) (cont.) 9
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
12. Tools 10
Tools
LibreSoft Tools suite
• CVSAnalY already work with CVS, SVN, git
• CVSAnalY produces complexity metrics counts for each
release of each file (C, C++, Java, Python)
• Bicho: bug reports from SourceForge (Bugzilla coming soon)
• MLStat: mailing lists, hiding real email addresses
All of them integrated in Melquiades
http://melquiades.flossmetrics.org
http://forge.morfeo-project.org/projects/libresoft-tools/
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
13. Kinds of studies performed 11
Kinds of studies performed
Evolution of projects (code, communities, responsiveness, etc.)
Detection of deviations (disruptions in bug fixing practices)
Human resources (what happens when the core team chan-
ges?)
Effort and value estimation (how much time was put in a
project?)
Parameters to characterize the status of a project
Sustainability conditions
But the imagination is the limit!
http://flossmetrics.org
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
14. CVSAnalY 2.0 features 12
CVSAnalY 2.0 features
Browses an SCM repository producing a database with:
• All metainformation (commit records, etc.)
• Metrics for each release of each file
Also produces some tables suitable for specific analysis
Multiple SCMs: CVS, svn, git
Whole history in the database, it’s possible to rebuild the files
tree for any revision
Tags and branches support
Option to save the log to a file while parsing
Extensions system, incremental capabilities
Multiple database system support (MySQL and SQLite)
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
15. CVSAnalY 2.0 Extensions 13
CVSAnalY 2.0 Extensions
Extension: a “plugin” for CVSAnalY
Add information to the database, based in the information in
the database and maybe the repository
Usually: new tables for specific studies
Simple example: commits per month per commiter
Extensions add one or more tables to the database but they
never modify the existing ones
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
16. Some CVSAnalY 2.0 Extensions 14
Some CVSAnalY 2.0 Extensions
FileTypes: adds a table containing information about the type
of every file in the database (code, documentation, i18n, etc.)
Metrics: analyzes every revision of every file calculating me-
trics like sloc and complexity metrics (mccabe, halstead).
It currently supports metrics for C/C++, Python, Java and
ADA.
CommitsLOC: adds a new table with information about the
total lines added/removed for every commit
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
17. MailingListStats 15
MailingListStats
Parses mbox information (RFC 822)
Stores results (headers, body) in a MySQL database:
• Sender, CCs, etc.
• Time / Date
• Subject
• ...
Incremental
Can store multiple projects in a single database
Provides also some stats at the end
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
18. Bicho 16
Bicho
Aimed at parsing issue tracking systems.
Results stored in a MySQL database.
Information about each issue (bug report), and its modifica-
tions.
Currently supports:
• SourceForge (HTML parsing, old interface, adapting to
the new one)
• BugZilla: GNOME, KDE, adapting to others
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
19. Demo: queries on the aggregated SCM database 17
Demo: queries on the aggregated SCM
database
Aggregated SCM database of June 2nd 2009
518 projects
2,987,961 file revisions
2,440,127 commit records
20,583 committers
zcat fm3_aggregatedb_scm_20090602T12:06:00.sql.gz |
mysql --user=jgb --password=XXXXX fm3_aggregatedb_scm
SELECT COUNT(*) FROM projects
SELECT COUNT(*) FROM files
SELECT count(*) FROM scmlog
SELECT count(*) FROM people
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
20. Demo: some queries 18
Demo: some queries
Total number of SLOC for all revisions of all files
Total number of SLOC and file revisions for C files
Total number of file revisions and files for C files
SELECT SUM(sloc) FROM metrics
SELECT SUM(sloc), COUNT(file_id) FROM metrics WHERE lang="ansic"
SELECT COUNT(file_id), COUNT(DISTINCT(file_id)) FROM metrics
WHERE lang="ansic"
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
21. Demo: some queries (2) 19
Demo: some queries (2)
Number of commits by committer and quarter
Same, for those committers with at least 10 commits per
quarter, ordered by time
select year(date), quarter(date), committer_id, count(id)
from scmlog group by year(date), quarter(date), committer_id
select year(date), quarter(date), committer_id, count(id)
from scmlog group by year(date), quarter(date), committer_id
having count(id) > 10
order by year(date), quarter(date), count(id) desc
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
22. Demo: some queries (3) 20
Demo: some queries (3)
Commits by committer and quarter for Evolution
Commits per quarter for Konqueror
select year(date), quarter(date), committer_id, count(id)
from scmlog
where project_id = 119
group by year(date), quarter(date), committer_id
order by year(date), quarter(date), count(id) desc
select year(date), quarter(date), count(id)
from scmlog
where project_id = 237
group by year(date), quarter(date)
order by year(date), quarter(date) desc
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
23. Demo: some queries (4) 21
Demo: some queries (4)
Create a “history” view for Evolution
Commits per committer and quarter
create view history (year, quarter, committer_id, commits)
as select year(date), quarter(date), committer_id, count(id)
from scmlog
where project_id = 119
group by year(date), quarter(date), committer_id
order by year(date), quarter(date), count(id) desc
select * from history
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
24. Demo: some queries (5) 22
Demo: some queries (5)
On the “history” view for Evolution
Commits per committer and quarter for committers active in
2001
Commits per quarter for committers active in 2001
select * from history
where committer_id in
(select committer_id from history where year=2001)
order by year, quarter, commits
select year, quarter, sum(commits) from history
where committer_id in
(select committer_id from history where year=2001)
group by year, quarter
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
25. Retrieving information: general problems 23
Retrieving information: general problems
Diversity:
Kinds of forges: difficult to automate
Kinds of projects: not all projects in SF are relevant
Sources for same project: forge(s), distributions...
Missing information:
Hidden information (eg: mail headers)
Lost information (eg: transition from CVS to SVN)
Bugs and errors (eg: old locks in SCM)
Stress to projects infrastructure!!
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
26. Retrieving information: SCM problems 24
Retrieving information: SCM problems
Different systems (CVS, Subversion, git, Bazaar, Mercurial,
etc.)
Different models (file-based, commit-based, distributed)
Bots performing commits
Large transitions don’t preserve information
Performance issues (systems poorly designed for massive re-
trieval)
But at least we have facilities for incremental retrieval
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
27. Retrieving information: BTS problems 25
Retrieving information: BTS problems
Different systems (Bugzilla, SourceForge, GForge, trac, Laun-
chpad, etc.)
Different models (bug cycle, bug report parameters, etc)
Different uses (issue tracker, only bugs, scheduler, etc.)
Bots acting on bug reports
Lack of facilities for incremental retrieval
Performance issues (systems not really designed for massive
retrieval)
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
28. Retrieving information: Mailing lists problems 26
Retrieving information: Mailing lists
problems
Different systems (usually accessible only through HTML)
Partial information (missing headers)
Bots sending email (eg: commit messages)
Spam (mixed with real messages)
But email messages are pretty uniform in format
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
29. Retrieving information: All together 27
Retrieving information: All together
How to track actors and products:
• Different repositories of the same project
• Different projects
SourceForge helps a bit (?)
Massive information (when dealing with 1,000s projects)
Exchange formats (for third parties and reproduction)
Tracking information (where did this commit record came
from?):
• Repositories change
• Retrieval tools change
• Errors do occur
c GSyC/LibreSoft FLOSSMetrics: data about libre software development
30. Interested? 28
Interested?
Detailed description of work available from the website
All the software used is libre software
Tell us about your pet project, we can analyze it
Interested in knowing how this is useful for you: provide feed-
back about your interests, needs
Open to contributions: plug-ins for new analysis tools
Willing to collaborate with projects
Very interested in feedback from researchers!!
http://flossmetrics.org
http://forge.morfeo-project.org/projects/libresoft-tools
c GSyC/LibreSoft FLOSSMetrics: data about libre software development