The document discusses a study examining whether core teams of GitHub projects follow the Pareto principle, which states that 80% of consequences come from 20% of causes. The study collected and analyzed data from over 8.5 million GitHub repositories to identify core team members and their activities. It found that more than half of projects did not follow the Pareto principle and most projects had 15 or fewer core developers. There were no major differences found between the activities of core and non-core developers.
This document discusses Agile, DevOps, and their implementation at USPTO. It provides background on Agile being a lightweight framework based on the Agile Manifesto. DevOps aims to improve collaboration between development and operations teams through practices like automation. USPTO adopted DevOps to enable continuous rapid development through continuous rapid deployment, overcoming barriers of legacy production processes. The document outlines USPTO's DevOps journey, including adopting practices like a deployment pipeline and production monitoring. It also discusses top challenges to DevOps adoption like fear of failure and bureaucracy, and how to start small and show value to gain support.
Getting Started With Selenium at ShutterstockSauce Labs
The document discusses getting started with Selenium automation at Shutterstock. It outlines Shutterstock's goals of building an enterprise-strength automation platform quickly to catch bugs earlier. Shutterstock chose Selenium due to its open source nature and partnered with an automation company to meet its goals rapidly during a period of growth. After several months, Shutterstock has developed over 600 automated test cases that run in under 2.5 hours and aims to continue expanding its automation program.
The document summarizes the top takeaways from the AGILE2017 conference. It discusses trends seen at the conference around topics like leadership, expanding agile practices beyond engineering, whole team involvement in UX, containerized microservices enabling NoOps, the value of ATDD/BDD, and re-teaming of teams. It also covers sessions on estimating time/cost using statistical techniques and the ongoing debate around #NoEstimates.
Community and Code: Lessons from NESCent HackathonsArlin Stoltzfus
Hackathons are an explosive trend, but why? What makes them work? What do they accomplish? How do I organize a hackathon for maximum effectiveness? In spite of thee popularity of hackathons, there has been very little systematic research into what makes them valued and successful. This slide deck provides an overview of conclusions drawn from studying a series of well documented hackathons sponsored by the National Evolutionary Synthesis Center from 2006 to 2015. For more online resources, see https://nescent.github.io/community-and-code/ .
This document discusses choosing an agile methodology for software development projects. It provides an overview of various agile methodologies like Scrum, Kanban, Scrumban and SAFe. It emphasizes that there is no single best methodology and that factors like team size, requirements uncertainty, backlog size and maintenance needs should be considered. The document recommends establishing a project evaluation committee to help organizations select the most appropriate methodology based on these factors to improve project success rates.
The document discusses using supercomputers like the FX10 to help with mining large software datasets in mobile software repositories (MSR) research. It describes a case study where code clone detection was performed on the Apache CXF project using a desktop computer, another desktop, and the FX10 supercomputer. The FX10 was significantly faster, finishing the task in 42 seconds compared to over 2 hours for the desktops. Challenges of scaling MSR analysis to very large datasets like the entire UCI repository are also discussed.
This document discusses research on improving bug prediction models by accounting for development effort. It finds that process metrics are still better predictors than product metrics in effort-aware models. While previous research found package-level predictions more effective than file-level, this study finds file-level predictions are more effective when accounting for effort using three approaches: directly using package metrics, lifting file metrics to package-level, and lifting file predictions to package-level. Lifting file predictions to package-level yielded the best performance.
A Study of the Quality-Impacting Practices of Modern Code Review at Sony MobileSAIL_QU
Sony Mobile uses code review tools like Gerrit to facilitate code reviews of commits. The study found that components with a higher ratio of third-party code and those where developers frequently self-approved or self-verified their own code without peer review were more defect-prone. Additionally, components with high rates of code patches after initial approval tended to be less defect-prone. Qualitative interviews with developers validated these findings and indicated that external code takes more time and effort to understand, third-party bias may impact self-reviews, and in-person communication improves code quality over tools alone. Sony Mobile is now discouraging self-verification, encouraging passive reviewers to participate more, and focusing QA testing on external code coverage.
This document discusses Agile, DevOps, and their implementation at USPTO. It provides background on Agile being a lightweight framework based on the Agile Manifesto. DevOps aims to improve collaboration between development and operations teams through practices like automation. USPTO adopted DevOps to enable continuous rapid development through continuous rapid deployment, overcoming barriers of legacy production processes. The document outlines USPTO's DevOps journey, including adopting practices like a deployment pipeline and production monitoring. It also discusses top challenges to DevOps adoption like fear of failure and bureaucracy, and how to start small and show value to gain support.
Getting Started With Selenium at ShutterstockSauce Labs
The document discusses getting started with Selenium automation at Shutterstock. It outlines Shutterstock's goals of building an enterprise-strength automation platform quickly to catch bugs earlier. Shutterstock chose Selenium due to its open source nature and partnered with an automation company to meet its goals rapidly during a period of growth. After several months, Shutterstock has developed over 600 automated test cases that run in under 2.5 hours and aims to continue expanding its automation program.
The document summarizes the top takeaways from the AGILE2017 conference. It discusses trends seen at the conference around topics like leadership, expanding agile practices beyond engineering, whole team involvement in UX, containerized microservices enabling NoOps, the value of ATDD/BDD, and re-teaming of teams. It also covers sessions on estimating time/cost using statistical techniques and the ongoing debate around #NoEstimates.
Community and Code: Lessons from NESCent HackathonsArlin Stoltzfus
Hackathons are an explosive trend, but why? What makes them work? What do they accomplish? How do I organize a hackathon for maximum effectiveness? In spite of thee popularity of hackathons, there has been very little systematic research into what makes them valued and successful. This slide deck provides an overview of conclusions drawn from studying a series of well documented hackathons sponsored by the National Evolutionary Synthesis Center from 2006 to 2015. For more online resources, see https://nescent.github.io/community-and-code/ .
This document discusses choosing an agile methodology for software development projects. It provides an overview of various agile methodologies like Scrum, Kanban, Scrumban and SAFe. It emphasizes that there is no single best methodology and that factors like team size, requirements uncertainty, backlog size and maintenance needs should be considered. The document recommends establishing a project evaluation committee to help organizations select the most appropriate methodology based on these factors to improve project success rates.
The document discusses using supercomputers like the FX10 to help with mining large software datasets in mobile software repositories (MSR) research. It describes a case study where code clone detection was performed on the Apache CXF project using a desktop computer, another desktop, and the FX10 supercomputer. The FX10 was significantly faster, finishing the task in 42 seconds compared to over 2 hours for the desktops. Challenges of scaling MSR analysis to very large datasets like the entire UCI repository are also discussed.
This document discusses research on improving bug prediction models by accounting for development effort. It finds that process metrics are still better predictors than product metrics in effort-aware models. While previous research found package-level predictions more effective than file-level, this study finds file-level predictions are more effective when accounting for effort using three approaches: directly using package metrics, lifting file metrics to package-level, and lifting file predictions to package-level. Lifting file predictions to package-level yielded the best performance.
A Study of the Quality-Impacting Practices of Modern Code Review at Sony MobileSAIL_QU
Sony Mobile uses code review tools like Gerrit to facilitate code reviews of commits. The study found that components with a higher ratio of third-party code and those where developers frequently self-approved or self-verified their own code without peer review were more defect-prone. Additionally, components with high rates of code patches after initial approval tended to be less defect-prone. Qualitative interviews with developers validated these findings and indicated that external code takes more time and effort to understand, third-party bias may impact self-reviews, and in-person communication improves code quality over tools alone. Sony Mobile is now discouraging self-verification, encouraging passive reviewers to participate more, and focusing QA testing on external code coverage.
Defect Prediction: Accomplishments and Future ChallengesYasutaka Kamei
The document discusses the accomplishments and future challenges of defect prediction in software engineering. It provides an overview of defect prediction, including leveraging data from repositories to measure source code metrics and build prediction models. Major accomplishments include increased data availability and openness, the ability to extract various metric types, and improved modeling performance. However, challenges remain such as keeping up with fast development paces and making models more accessible. The document argues that future areas of focus include defect prediction for mobile apps and integrating just-in-time models into continuous integration processes.
This study analyzed 10 large open source projects to understand build system maintenance effort. It found that build systems accounted for around 9% of total files on average. Build code evolved at a similar rate to source code, with some projects experiencing higher build churn. Changes to build and source code were often logically coupled, with some work items affecting both. Responsibility for build maintenance was usually distributed across developers rather than concentrated in a small team. The findings suggest build systems require significant effort to maintain and that tool support could help address this.
An Automated Approach for Recommending When to Stop Performance TestsSAIL_QU
—Performance issues are often the cause of failures in
today’s large-scale software systems. These issues make performance
testing essential during software maintenance. However,
performance testing is faced with many challenges. One challenge
is determining how long a performance test must run. Although
performance tests often run for hours or days to uncover
performance issues (e.g., memory leaks), much of the data that is
generated during a performance test is repetitive. Performance
analysts can stop their performance tests (to reduce the time
to market and the costs of performance testing) if they know
that continuing the test will not provide any new information
about the system’s performance. To assist performance analysts
in deciding when to stop a performance test, we propose an
automated approach that measures how much of the data that is
generated during a performance test is repetitive. Our approach
then provides a recommendation to stop the test when the data
becomes highly repetitive and the repetitiveness has stabilized
(i.e., little new information about the systems’ performance is
generated).
A Holistic Approach to Evolving Software SystemsMichele Lanza
The document discusses a holistic approach to evolving software systems over time. It addresses how systems will scale to much larger sizes in 10-15 years, how they will be structured and evolved, and how to ensure their dependability, security, safety and reliability as they grow increasingly complex. The author advocates considering all aspects of a system's development and usage when planning how it will adapt to future changes and challenges.
An Empirical Study of Goto in C Code from GitHub RepositoriesSAIL_QU
Developers still use goto statements in practice despite arguments against them. This study analyzed over 11,000 GitHub projects and found goto statements in around 11% of C files. Goto statements were primarily used for error handling and cleanup. The study also analyzed commit histories of 6 projects and found that developers rarely remove or modify goto statements, even when fixing post-release bugs. This suggests that while goto statements have drawbacks, developers still find them useful for certain tasks like error handling.
JVM JIT compilation overview by Vladimir IvanovZeroTurnaround
The document provides an overview of JVM JIT-compilers, including:
- JIT-compilers in the HotSpot JVM dynamically compile bytecode to native machine code during program execution for improved performance compared to interpretation alone.
- JIT-compilers use profiling information gathered during execution to perform aggressive optimizations like inlining and devirtualization.
- The monitoring and debugging of JIT-compilers in the HotSpot JVM can be done using options like -XX:+PrintCompilation, -XX:+PrintInlining, and -XX:+PrintAssembly.
This document summarizes a presentation about balancing speed and quality in DevOps. It includes an agenda for guest speakers from Forrester Research and Quali discussing building a DevOps operating model, challenges of DevOps in enterprises, and Quali's cloud sandbox approach. The presentation covers topics like DevOps challenges of adopting new technologies while maintaining quality, the need for speed in development while minimizing risk, and how cloud sandboxes can help provide configurable test environments to move fast but reduce risks.
It's all about feedback - code review as a great tool in the agile toolboxStefan Lay
This document discusses how code review can be a valuable tool for agile teams. It provides arguments for how code review complements pair programming by allowing for asynchronous feedback from multiple reviewers. It also describes best practices for code review, such as keeping changes small and focused. The document advocates for using Git and Gerrit to facilitate code review at scale across large projects and multiple teams. Standardization of infrastructure and processes like contributor guides are highlighted as important for collaboration.
Using Github Insight as metric for the Developer collaboration and work metri...Najib Radzuan
Usually, the company used Jira board to gauge or measure how Sprint or team performance, but with GitHub Insight we can also determine whether the developer doing a great job or the opposite. During COVID-19 pandemic, the most developers working from home and using GitHub Insight to see all the activity and collaboration between developers become easier.
The document introduces using GitHub for team collaboration on projects. It outlines topics like organizing team members and permissions, code merging workflows using the fork and pull request model, managing issues, and conducting code reviews. The goal is for teams of 3-4 people to simultaneously contribute to a shared code repository while using GitHub features for project management and version control.
[DSC Croatia 22] How we create and leverage data services in GitLab - Radovan...DataScienceConferenc1
Providing a walkthrough over the process of creating, improving, and scaling the Data product using a modern DevOps stack. Exposing details of the use case how we embrace the open-source philosophy to help us provide faster time to market. Will discuss how to use the advantage of the internal product to make us more agile in the daily job of creating great data products.
This document is a slide deck presentation about enabling agility through DevOps. It discusses why DevOps is needed, defines DevOps, and outlines how practices like continuous integration, continuous delivery, and infrastructure as code can help enable faster delivery, higher quality, and more stable environments. It also provides recommendations for adopting DevOps and getting started with a DevOps transformation.
Code review is one of the crucial software activities where developers and stakeholders collaborate with each other in order to assess software changes. Since code review processes act as a final gate for new software changes to be integrated into the software product, an intense collaboration is necessary in order to prevent defects and produce a high quality of software products. Recently, code review analytics has been implemented in projects (for example, StackAnalytics4 of the OpenStack project) to monitor the collaboration activities between developers and stakeholders in the code review processes. Yet, due to the large volume of software data, code review analytics can only report a static summary (e.g., counting), while neither insights nor instant suggestions are provided. Hence, to better gain valuable insights from software data and help software projects make a better decision, we conduct an empirical investigation using statistical approaches. In particular, we use the large-scale data of 196,712 reviews spread across the Android, Qt, and OpenStack open source projects to train a prediction model in order to uncover the relationship between the characteristics of software changes and the likelihood of having poor code review collaborations. We extract 20 patch characteristics which are grouped along five dimensions, i.e., software changes properties, review participation history, past involvement of a code author, past involvement of reviewers, and review environment dimensions. To validate our findings, we use the bootstrap technique which repeats the experiment 1,000 times. Due to the large volume of studied data, and an intensive computation of characteristic extraction and find- ing validation, the use of the High-Performance-Computing (HPC) re- sources is mandatory to expedite the analysis and generate insights in a timely manner. Through our case study, we find that the amount of review participation in the past and the description length of software changes are a significant indicator that new software changes will suffer from poor code review collaborations [2017]. Moreover, we find that the purpose of introducing new features can increase the likelihood that new software changes will receive late collaboration from reviewers. Our findings highlight the need for the policies of software change submission that monitor these characteristics in order to help software projects improve the quality of code reviews processes. Moreover, based on our findings, future work should develop real-time code review analytics implemented on HPC resources in order to instantly provide insights and suggestions to software projects
DevOps is a combination of cultural philosophies, practices, and tools that increases an organization's ability to deliver applications and services at high velocity. The DevOps lifecycle includes seven phases: continuous development, continuous integration, continuous testing, continuous delivery, continuous deployment, continuous monitoring, and continuous feedback. Continuous integration involves committing code changes frequently and building and testing the code continuously to identify problems early.
An Ultimate Guide To Hire Python DeveloperRishiVardhaniM
If you are looking for a python developer, it is not as easy as it sounds. There are many factors that come into play when hiring a developer. This guide will help you find the best python developer for your project.
https://www.hackerearth.com/recruit/resources/e-books/hire-python-developer/
Governance for AEM/CMS Projects
Document a best practice project framework
Demonstrate a successful implementation
List key lessons learned and gotchas
Help answer questions to avoid pitfalls and reduce learning curve
Bring together a community of professionals
Develop a better understanding in running projects efficiently
Enable Collaborative Development Process
Open Source Contribution Policies That Don't SuckTobie Langel
Open source contribution policies are long, boring, overlooked documents, that generally suck. They're designed to protect the company at all costs. But in the process, end up hurting engineering productivity, and morale. Sometimes they even unknowingly put corporate IP at risk.
But that's not inevitable.
It's possible to write open source contribution policies that make engineers lives easier, boost morale and productivity, reduce attrition, and attract new talent. And it's possible to do so while reducing the company's IP risk, not increasing it.
In this talk, we'll look at the general structure of contribution policies, examples in the wild, and tactics to make them suck less.
We'll also look at how to turn these policies into self-service software, preventing the tedious email back and forth between engineering and legal in most cases and making open source contribution a breeze.
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
Over the past eight or nine years, applying DevOps practices to various areas of technology within business has grown in popularity and produced demonstrable results. These principles are particularly fruitful when applied to a data analytics environment. Bob Eilbacher explains how to implement a strong DevOps practice for data analysis, starting with the necessary cultural changes that must be made at the executive level and ending with an overview of potential DevOps toolchains. Bob also outlines why DevOps and disruption management go hand in hand.
Topics include:
- The benefits of a DevOps approach, with an emphasis on improving quality and efficiency of data analytics
- Why the push for a DevOps practice needs to come from the C-suite and how it can be integrated into all levels of business
- An overview of the best tools for developers, data analysts, and everyone in between, based on the business’s existing data ecosystem
- The challenges that come with transforming into an analytics-driven company and how to overcome them
- Practical use cases from Caserta clients
This presentation was originally given by Bob at the 2017 Strata Data Conference in New York City.
Jay Lyman 451 ResearchBrent Beer GitHubSteven Anderson Sendachi talk about these topics:
Cloud, DevOps, agile development capability and adoption of containers are all important in both perception and reality.
Enterprise adoption of cloud computing, DevOps, agile development and containers are all growing, including production use.
Modernizing applications to SaaS & migrating them to the cloud are equally important as net-new, so-called ‘cloud-native’ applications.
Advantages and benefits of these technologies and methodologies center on: flexibility and speed, cost reduction, improvements in resiliency and reliability and fitness for new/emerging applications.
Barriers center on: lack of internal skills, immaturity, lack of familiarity, satisfaction with current technology, cost and security.
'Open source contribution policies that don’t suck!'Shane Coughlan
This document discusses open source contribution policies and provides guidance on creating a policy that is effective for both legal and engineering teams. It notes that having no policy does not mean having no rules, and that a policy can be too restrictive. An ideal policy is permissive, explicit, informative, frictionless, and minimizes risk while being consistently followed. The document outlines different considerations for using open source, contributing outside of work, contributing at work, patching, and releasing open source. It recommends treating the policy like an app to streamline the process and use data to promote open source activity.
Automatic Identification of Informative Code in Stack Overflow PostsPreetha Chatterjee
Despite Stack Overflow’s popularity as a resource for solving coding problems, identifying relevant information from an individual post remains a challenge. The overload of information in a post can make it difficult for developers to identify specific and targeted code fixes. In this paper, we aim to help users identify informative code segments, once they have narrowed down their search to a post relevant to their task. Specifically, we explore natural language- based approaches to extract problematic and suggested code pairs from a post. The goal of the study is to investigate the potential of designing a browser extension to draw the readers’ attention to relevant code segments, and thus improve the experience of software engineers seeking help on Stack Overflow.
Defect Prediction: Accomplishments and Future ChallengesYasutaka Kamei
The document discusses the accomplishments and future challenges of defect prediction in software engineering. It provides an overview of defect prediction, including leveraging data from repositories to measure source code metrics and build prediction models. Major accomplishments include increased data availability and openness, the ability to extract various metric types, and improved modeling performance. However, challenges remain such as keeping up with fast development paces and making models more accessible. The document argues that future areas of focus include defect prediction for mobile apps and integrating just-in-time models into continuous integration processes.
This study analyzed 10 large open source projects to understand build system maintenance effort. It found that build systems accounted for around 9% of total files on average. Build code evolved at a similar rate to source code, with some projects experiencing higher build churn. Changes to build and source code were often logically coupled, with some work items affecting both. Responsibility for build maintenance was usually distributed across developers rather than concentrated in a small team. The findings suggest build systems require significant effort to maintain and that tool support could help address this.
An Automated Approach for Recommending When to Stop Performance TestsSAIL_QU
—Performance issues are often the cause of failures in
today’s large-scale software systems. These issues make performance
testing essential during software maintenance. However,
performance testing is faced with many challenges. One challenge
is determining how long a performance test must run. Although
performance tests often run for hours or days to uncover
performance issues (e.g., memory leaks), much of the data that is
generated during a performance test is repetitive. Performance
analysts can stop their performance tests (to reduce the time
to market and the costs of performance testing) if they know
that continuing the test will not provide any new information
about the system’s performance. To assist performance analysts
in deciding when to stop a performance test, we propose an
automated approach that measures how much of the data that is
generated during a performance test is repetitive. Our approach
then provides a recommendation to stop the test when the data
becomes highly repetitive and the repetitiveness has stabilized
(i.e., little new information about the systems’ performance is
generated).
A Holistic Approach to Evolving Software SystemsMichele Lanza
The document discusses a holistic approach to evolving software systems over time. It addresses how systems will scale to much larger sizes in 10-15 years, how they will be structured and evolved, and how to ensure their dependability, security, safety and reliability as they grow increasingly complex. The author advocates considering all aspects of a system's development and usage when planning how it will adapt to future changes and challenges.
An Empirical Study of Goto in C Code from GitHub RepositoriesSAIL_QU
Developers still use goto statements in practice despite arguments against them. This study analyzed over 11,000 GitHub projects and found goto statements in around 11% of C files. Goto statements were primarily used for error handling and cleanup. The study also analyzed commit histories of 6 projects and found that developers rarely remove or modify goto statements, even when fixing post-release bugs. This suggests that while goto statements have drawbacks, developers still find them useful for certain tasks like error handling.
JVM JIT compilation overview by Vladimir IvanovZeroTurnaround
The document provides an overview of JVM JIT-compilers, including:
- JIT-compilers in the HotSpot JVM dynamically compile bytecode to native machine code during program execution for improved performance compared to interpretation alone.
- JIT-compilers use profiling information gathered during execution to perform aggressive optimizations like inlining and devirtualization.
- The monitoring and debugging of JIT-compilers in the HotSpot JVM can be done using options like -XX:+PrintCompilation, -XX:+PrintInlining, and -XX:+PrintAssembly.
This document summarizes a presentation about balancing speed and quality in DevOps. It includes an agenda for guest speakers from Forrester Research and Quali discussing building a DevOps operating model, challenges of DevOps in enterprises, and Quali's cloud sandbox approach. The presentation covers topics like DevOps challenges of adopting new technologies while maintaining quality, the need for speed in development while minimizing risk, and how cloud sandboxes can help provide configurable test environments to move fast but reduce risks.
It's all about feedback - code review as a great tool in the agile toolboxStefan Lay
This document discusses how code review can be a valuable tool for agile teams. It provides arguments for how code review complements pair programming by allowing for asynchronous feedback from multiple reviewers. It also describes best practices for code review, such as keeping changes small and focused. The document advocates for using Git and Gerrit to facilitate code review at scale across large projects and multiple teams. Standardization of infrastructure and processes like contributor guides are highlighted as important for collaboration.
Using Github Insight as metric for the Developer collaboration and work metri...Najib Radzuan
Usually, the company used Jira board to gauge or measure how Sprint or team performance, but with GitHub Insight we can also determine whether the developer doing a great job or the opposite. During COVID-19 pandemic, the most developers working from home and using GitHub Insight to see all the activity and collaboration between developers become easier.
The document introduces using GitHub for team collaboration on projects. It outlines topics like organizing team members and permissions, code merging workflows using the fork and pull request model, managing issues, and conducting code reviews. The goal is for teams of 3-4 people to simultaneously contribute to a shared code repository while using GitHub features for project management and version control.
[DSC Croatia 22] How we create and leverage data services in GitLab - Radovan...DataScienceConferenc1
Providing a walkthrough over the process of creating, improving, and scaling the Data product using a modern DevOps stack. Exposing details of the use case how we embrace the open-source philosophy to help us provide faster time to market. Will discuss how to use the advantage of the internal product to make us more agile in the daily job of creating great data products.
This document is a slide deck presentation about enabling agility through DevOps. It discusses why DevOps is needed, defines DevOps, and outlines how practices like continuous integration, continuous delivery, and infrastructure as code can help enable faster delivery, higher quality, and more stable environments. It also provides recommendations for adopting DevOps and getting started with a DevOps transformation.
Code review is one of the crucial software activities where developers and stakeholders collaborate with each other in order to assess software changes. Since code review processes act as a final gate for new software changes to be integrated into the software product, an intense collaboration is necessary in order to prevent defects and produce a high quality of software products. Recently, code review analytics has been implemented in projects (for example, StackAnalytics4 of the OpenStack project) to monitor the collaboration activities between developers and stakeholders in the code review processes. Yet, due to the large volume of software data, code review analytics can only report a static summary (e.g., counting), while neither insights nor instant suggestions are provided. Hence, to better gain valuable insights from software data and help software projects make a better decision, we conduct an empirical investigation using statistical approaches. In particular, we use the large-scale data of 196,712 reviews spread across the Android, Qt, and OpenStack open source projects to train a prediction model in order to uncover the relationship between the characteristics of software changes and the likelihood of having poor code review collaborations. We extract 20 patch characteristics which are grouped along five dimensions, i.e., software changes properties, review participation history, past involvement of a code author, past involvement of reviewers, and review environment dimensions. To validate our findings, we use the bootstrap technique which repeats the experiment 1,000 times. Due to the large volume of studied data, and an intensive computation of characteristic extraction and find- ing validation, the use of the High-Performance-Computing (HPC) re- sources is mandatory to expedite the analysis and generate insights in a timely manner. Through our case study, we find that the amount of review participation in the past and the description length of software changes are a significant indicator that new software changes will suffer from poor code review collaborations [2017]. Moreover, we find that the purpose of introducing new features can increase the likelihood that new software changes will receive late collaboration from reviewers. Our findings highlight the need for the policies of software change submission that monitor these characteristics in order to help software projects improve the quality of code reviews processes. Moreover, based on our findings, future work should develop real-time code review analytics implemented on HPC resources in order to instantly provide insights and suggestions to software projects
DevOps is a combination of cultural philosophies, practices, and tools that increases an organization's ability to deliver applications and services at high velocity. The DevOps lifecycle includes seven phases: continuous development, continuous integration, continuous testing, continuous delivery, continuous deployment, continuous monitoring, and continuous feedback. Continuous integration involves committing code changes frequently and building and testing the code continuously to identify problems early.
An Ultimate Guide To Hire Python DeveloperRishiVardhaniM
If you are looking for a python developer, it is not as easy as it sounds. There are many factors that come into play when hiring a developer. This guide will help you find the best python developer for your project.
https://www.hackerearth.com/recruit/resources/e-books/hire-python-developer/
Governance for AEM/CMS Projects
Document a best practice project framework
Demonstrate a successful implementation
List key lessons learned and gotchas
Help answer questions to avoid pitfalls and reduce learning curve
Bring together a community of professionals
Develop a better understanding in running projects efficiently
Enable Collaborative Development Process
Open Source Contribution Policies That Don't SuckTobie Langel
Open source contribution policies are long, boring, overlooked documents, that generally suck. They're designed to protect the company at all costs. But in the process, end up hurting engineering productivity, and morale. Sometimes they even unknowingly put corporate IP at risk.
But that's not inevitable.
It's possible to write open source contribution policies that make engineers lives easier, boost morale and productivity, reduce attrition, and attract new talent. And it's possible to do so while reducing the company's IP risk, not increasing it.
In this talk, we'll look at the general structure of contribution policies, examples in the wild, and tactics to make them suck less.
We'll also look at how to turn these policies into self-service software, preventing the tedious email back and forth between engineering and legal in most cases and making open source contribution a breeze.
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
Over the past eight or nine years, applying DevOps practices to various areas of technology within business has grown in popularity and produced demonstrable results. These principles are particularly fruitful when applied to a data analytics environment. Bob Eilbacher explains how to implement a strong DevOps practice for data analysis, starting with the necessary cultural changes that must be made at the executive level and ending with an overview of potential DevOps toolchains. Bob also outlines why DevOps and disruption management go hand in hand.
Topics include:
- The benefits of a DevOps approach, with an emphasis on improving quality and efficiency of data analytics
- Why the push for a DevOps practice needs to come from the C-suite and how it can be integrated into all levels of business
- An overview of the best tools for developers, data analysts, and everyone in between, based on the business’s existing data ecosystem
- The challenges that come with transforming into an analytics-driven company and how to overcome them
- Practical use cases from Caserta clients
This presentation was originally given by Bob at the 2017 Strata Data Conference in New York City.
Jay Lyman 451 ResearchBrent Beer GitHubSteven Anderson Sendachi talk about these topics:
Cloud, DevOps, agile development capability and adoption of containers are all important in both perception and reality.
Enterprise adoption of cloud computing, DevOps, agile development and containers are all growing, including production use.
Modernizing applications to SaaS & migrating them to the cloud are equally important as net-new, so-called ‘cloud-native’ applications.
Advantages and benefits of these technologies and methodologies center on: flexibility and speed, cost reduction, improvements in resiliency and reliability and fitness for new/emerging applications.
Barriers center on: lack of internal skills, immaturity, lack of familiarity, satisfaction with current technology, cost and security.
'Open source contribution policies that don’t suck!'Shane Coughlan
This document discusses open source contribution policies and provides guidance on creating a policy that is effective for both legal and engineering teams. It notes that having no policy does not mean having no rules, and that a policy can be too restrictive. An ideal policy is permissive, explicit, informative, frictionless, and minimizes risk while being consistently followed. The document outlines different considerations for using open source, contributing outside of work, contributing at work, patching, and releasing open source. It recommends treating the policy like an app to streamline the process and use data to promote open source activity.
Automatic Identification of Informative Code in Stack Overflow PostsPreetha Chatterjee
Despite Stack Overflow’s popularity as a resource for solving coding problems, identifying relevant information from an individual post remains a challenge. The overload of information in a post can make it difficult for developers to identify specific and targeted code fixes. In this paper, we aim to help users identify informative code segments, once they have narrowed down their search to a post relevant to their task. Specifically, we explore natural language- based approaches to extract problematic and suggested code pairs from a post. The goal of the study is to investigate the potential of designing a browser extension to draw the readers’ attention to relevant code segments, and thus improve the experience of software engineers seeking help on Stack Overflow.
PMI Thailand: DevOps / Roles of Project Manager (20-May-2020)Gonzague PATINIER
DevOps seems to be the latest ‘buzzword’ and trend in the IT industry. This is driven by business needs for ever-faster deployment of new functionality and frustrations with the time and effort it takes to get new systems into operations. It is no longer a question of ‘should we adopt DevOps’, but ‘when and how’?
DevOps represents a significant cultural and behavioral change and many organizations fail to address this in their adoption. Gartner defines DevOps as a change in IT culture, focusing on rapid IT service delivery through the adoption of agile, lean practices in the context of a system-oriented approach. These culture changes include organization changes, impacting structure, roles and responsibilities.
What and where is the role of the project manager in organizations that have transitioned towards adopting DevOPs? Join us and let’s discuss DevOps and answer your questions followed by an informative discussion.
The document discusses how GitLab.com builds its data services and products. It describes how GitLab.com uses its own DevOps platform to build an Enterprise Data Platform that analyzes data from GitLab.com. The data team faces challenges around scaling, visibility, and speed. To address these, the team takes actions like open sourcing tools, adopting DevOps practices, and establishing roles, processes, and technologies to build a trusted data model and framework. The key takeaways emphasize continuous iteration, discipline, automation, and living the company values.
Are you a:
- University student or fresh graduate wishing to pursue a career in DevOps and want to prepare for it?
- Software Engineer (developer, tester, etc.) who is curious about DevOps?
- Software Engineer (developer, tester, etc.) wishing to switch from his/her current role to a DevOps related role?
This session is just for you!
Check out the video on YouTube at https://www.youtube.com/watch?v=yYWEOdORH40
Practical success story for building DevOps culture in Product company within classical development team from scratch: growing t-shaped skills, knowledge sharing practices used, tools to build efficient delivery ecosystem.
https://xpdays.com.ua/programs/devops-applied-survival-guide/
Similar to Revisiting the Applicability of the Pareto Principle to Core Development Teams in Open Source Software Projects (20)
Studying the Integration Practices and the Evolution of Ad Libraries in the G...SAIL_QU
In-app advertisements have become a major revenue for app developers in the mobile app economy. Ad libraries play an integral part in this ecosystem as app
developers integrate these libraries into their apps to display ads. However, little is known about how app developers integrate these libraries with their apps and how these libraries have evolved over time.
In this thesis, we study the ad library integration practices and the evolution of such libraries. To understand the integration practices of ad libraries, we manually study apps and derive a set of rules to automatically identify four strategies for integrating
multiple ad libraries. We observe that integrating multiple ad libraries commonly occurs in apps with a large number of downloads and ones in categories with a high percentage of apps that display ads. We also observe that app developers prefer to manage their own integrations instead of using off the shelf features of ad libraries for integrating multiple ad libraries.
To study the evolution of ad libraries, we conduct a longitudinal study of the 8 most popular ad libraries. In particular, we look at their evolution in terms of size, the main drivers for releasing a new ad library version, and their architecture. We observe that ad libraries are continuously evolving with a median release interval of 34 days. Some ad libraries have grown exponentially in size (e.g., Facebook Audience Network ad library), while other libraries have worked to reduce their size. To study the main drivers for releasing an ad library version, we manually study the release notes of the eight studied ad libraries. We observe that ad library developers continuously update their ad libraries to support a wider range of Android versions (i.e., to ensure that more devices can use the libraries without errors). Finally, we derive a reference architecture for ad libraries and study how the studied ad libraries diverged from this architecture during our study period.
Our findings can assist ad library developers to understand the challenges for developing ad libraries and the desired features of these libraries.
Improving the testing efficiency of selenium-based load testsSAIL_QU
Slides for a paper published at AST 2019:
Shahnaz M. Shariff, Heng Li, Cor-Paul Bezemer, Ahmed E. Hassan, Thanh H. D. Nguyen, and Parminder Flora. 2019. Improving the testing efficiency of selenium-based load tests. In Proceedings of the 14th International Workshop on Automation of Software Test (AST '19). IEEE Press, Piscataway, NJ, USA, 14-20. DOI: https://doi.org/10.1109/AST.2019.00008
Studying User-Developer Interactions Through the Distribution and Reviewing M...SAIL_QU
This document discusses studying user-developer interactions through the distribution and reviewing mechanisms of the Google Play Store. It analyzes emergency updates made by developers to fix issues, the dialogue between users and developers through reviews and responses, and how the reviewing mechanism can help identify good and bad updates. The study found that responding to reviews is six times more likely to increase an app's rating, with 84% of rating increases going to four or five stars. Three common patterns of developer responses were identified: responding to negative or long reviews, only negative reviews, and reviews shortly after an update.
Studying online distribution platforms for games through the mining of data f...SAIL_QU
Our studies of Steam platform data provided insights into online game distribution:
1) Urgent game updates were used to fix crashes, balance issues, and functionality; frequent updaters released more 0-day patches.
2) The Early Access model attracted indie developers and increased game participation; reviews were more positive during Early Access.
3) Game reviews were typically short and in English; sales increased review volume more than new updates; negative reviews came after longer play.
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...SAIL_QU
This study analyzed factors that impact the speed of questions receiving accepted answers on four popular Stack Exchange websites: Stack Overflow, Mathematics, Ask Ubuntu, and Super User. The researchers examined question, answerer, asker, and answer factors from over 150,000 questions. They built classification models and found that key factors for fast answers included the past speed of answerers, length of the question, and past speed of answers for the question's tags. The models achieved AUCs of 0.85-0.95. Fast answers relied heavily on answerers, especially frequent answerers. The study suggests improving incentives for non-frequent and more difficult questions to attract diverse answerers.
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...SAIL_QU
Selenium is a popular tool for browser-based automation testing. The author analyzes challenges in using Selenium by mining Selenium questions on Stack Overflow. Programming language-related questions, especially for Java and Python, are most common and growing fastest. Less than half of questions receive accepted answers, and questions about browsers and components take longest. In the second part, the author develops an approach to improve efficiency of Selenium-based load testing by sharing browsers among user instances. This increases the number of error-free users by 20-22% while reducing memory usage.
Mining Development Knowledge to Understand and Support Software Logging Pract...SAIL_QU
This document summarizes Heng Li's PhD thesis on mining development knowledge to understand and support software logging practices. It discusses how logging code is used to record runtime information but can be difficult for developers to maintain. The thesis aims to understand current logging practices and develop tools by mining change history, source code, issue reports, and other development knowledge. It presents research that analyzes logging-related issues to identify developers' logging concerns, uses code topics and structure to predict where logging statements should be added, leverages code changes to suggest when logging code needs updating, and applies machine learning models to recommend appropriate log levels.
Which Log Level Should Developers Choose For a New Logging Statement?SAIL_QU
The document discusses choosing an appropriate log level when adding a new logging statement. It finds that an ordinal regression model can effectively model log levels, achieving an AUC of 0.76-0.81 in within-project evaluation and 0.71-0.8 in cross-project evaluation. The most influential factors for determining log levels vary between projects and include metrics related to the logging statement, containing code block, and file as well as code change and historical change metrics.
Towards Just-in-Time Suggestions for Log ChangesSAIL_QU
The document presents a study on providing just-in-time suggestions for log changes when developers make code changes. The researchers analyzed over 32,000 log changes from 4 systems. They found 20 reasons for log changes that fall into 4 categories: block changes, log improvements, dependence-driven changes, and logging issues. A random forest classifier using 25 software metrics related to code changes, history, and complexity achieved 0.84-0.91 AUC in predicting whether a log change is needed. Change metrics and product metrics were the most influential factors. The study aims to help developers make better logging decisions for failure diagnosis.
The Impact of Task Granularity on Co-evolution AnalysesSAIL_QU
The document discusses how task granularity at different levels (e.g. commits, pull requests, work items) can impact analyses of co-evolution in software projects. It finds that analyzing at the commit-level can overlook relationships between tasks that span multiple commits. Work item level analysis is recommended to provide a more complete view of co-evolution, as median of 29% of work items consist of multiple commits, and analyzing at the commit level would miss 24% of co-changed files and inability to group 83% of related commits.
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...SAIL_QU
1) Initial bug fix discussions with more comments and more developers participating are more likely to experience later bug reworking through re-opening or re-patching of the bug.
2) Manual analysis found that defective initial fixes and failure to reach consensus in discussions contributed to later reworking.
3) For re-opened bugs, initial discussions focused on addressing a particular problem through a burst of comments, while re-patched bugs lacked thorough code review and testing during the initial fix period.
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...SAIL_QU
This study examined the relationship between mobile device attributes and user-perceived quality of Android apps. The researchers analyzed 150,373 star ratings from Google Play across 30 devices and 280 apps. They found that the perceived quality of apps varies across devices, and having better characteristics of an attribute does not necessarily correlate with higher quality. Device OS version, resolution, and CPU showed significant relationships with ratings, as did some app attributes like lines of code and number of inputs. However, some device attributes had stronger relationships than app attributes.
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...SAIL_QU
This document presents the results of a large-scale study on the impact of feature selection techniques on defect classification models. The study used expanded scopes including multiple datasets from NASA and PROMISE with different feature types, more classification techniques from different paradigms, and additional feature selection techniques. The results show that correlation-based feature subset selection techniques like FS1 and FS2 consistently appear in the top ranks across most of the datasets, projects within the datasets, and classification techniques. The document concludes that future defect classification studies should consider applying correlation-based feature selection techniques.
Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU
The study analyzes user-developer interactions through reviews and responses on the Google Play Store. It finds that responding to reviews has a significant positive impact, with 84% of rating increases due to the developer addressing the issue or providing guidance. Three common response patterns were identified: only negative reviews, negative or longer reviews, and reviews shortly after an update. Developers most often thank the user, ask for details, provide guidance, or ask for an endorsement. Guidance responses can address common issues through FAQs. The analysis considered over 2,000 apps, 355,000 review changes, 128,000 responses, and 4 million reviews.
What Do Programmers Know about Software Energy Consumption?SAIL_QU
This document summarizes the results of a survey of 122 programmers about their knowledge of software energy consumption. The survey found that programmers have limited awareness of energy consumption and how to reduce it. They were unaware of the main causes of high energy usage. Programmers lacked knowledge about how to properly rank the energy consumption of different hardware components and were unfamiliar with strategies to improve efficiency, such as minimizing I/O and avoiding polling. The study concludes that programmers would benefit from more education on software energy usage and its causes.
Revisiting the Experimental Design Choices for Approaches for the Automated R...SAIL_QU
Prior research on automated duplicate issue report retrieval focused on improving performance metrics like recall rate. The author revisits experimental design choices from four perspectives: needed effort, data changes, data filtration, and evaluation process.
The thesis contributions are: 1) Showing the importance of considering needed effort in performance measurement. 2) Proposing a "realistic evaluation" approach and analyzing prior findings with it. 3) Developing a genetic algorithm to filter old issue reports and improve performance. 4) Highlighting the impact of "just-in-time" features on evaluation. The findings help better understand benefits and limitations of prior work in this area.
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsSAIL_QU
The document summarizes a large-scale field study that tracked the program comprehension activities of 78 professional developers over 3,148 hours. The study found that:
1) Program comprehension accounted for approximately 58% of developers' time on average, with navigation and editing making up the remaining portions.
2) Developers frequently used web browsers and document editors to aid comprehension beyond just IDEs.
3) Interviews and observations revealed that insufficient documentation, unclear code, and complex inheritance hierarchies contributed to long comprehension sessions.
SOCRadar's Aviation Industry Q1 Incident Report is out now!
The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers.
SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.
What is Augmented Reality Image Trackingpavan998932
Augmented Reality (AR) Image Tracking is a technology that enables AR applications to recognize and track images in the real world, overlaying digital content onto them. This enhances the user's interaction with their environment by providing additional information and interactive elements directly tied to physical images.
Takashi Kobayashi and Hironori Washizaki, "SWEBOK Guide and Future of SE Education," First International Symposium on the Future of Software Engineering (FUSE), June 3-6, 2024, Okinawa, Japan
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
WhatsApp offers simple, reliable, and private messaging and calling services for free worldwide. With end-to-end encryption, your personal messages and calls are secure, ensuring only you and the recipient can access them. Enjoy voice and video calls to stay connected with loved ones or colleagues. Express yourself using stickers, GIFs, or by sharing moments on Status. WhatsApp Business enables global customer outreach, facilitating sales growth and relationship building through showcasing products and services. Stay connected effortlessly with group chats for planning outings with friends or staying updated on family conversations.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
SMS API Integration in Saudi Arabia| Best SMS API ServiceYara Milbes
Discover the benefits and implementation of SMS API integration in the UAE and Middle East. This comprehensive guide covers the importance of SMS messaging APIs, the advantages of bulk SMS APIs, and real-world case studies. Learn how CEQUENS, a leader in communication solutions, can help your business enhance customer engagement and streamline operations with innovative CPaaS, reliable SMS APIs, and omnichannel solutions, including WhatsApp Business. Perfect for businesses seeking to optimize their communication strategies in the digital age.
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesQuickdice ERP
Explore the seamless transition to e-invoicing with this comprehensive guide tailored for Saudi Arabian businesses. Navigate the process effortlessly with step-by-step instructions designed to streamline implementation and enhance efficiency.
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
Transform Your Communication with Cloud-Based IVR SolutionsTheSMSPoint
Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Revisiting the Applicability of the Pareto Principle to Core Development Teams in Open Source Software Projects
1. Is the Pareto Principle
Applicable to the Core
Teams of GitHub Projects?
Kazuhiro
Yamashita
Yasutaka
Kamei
Shane
McIntosh
Naoyasu
Ubayashi
Ahmed
E. Hassan
2. Core developers play a critical
role
in software development
2
Core developers are responsible
for guiding and coordinating the
development of an OSS project.
The most productive developers
who have made roughly 80% of
the total contributions.
Nakakoji
Mockus
3. In fact, some argue that core
developers in OSS projects follow the
Pareto Principle
5
Effort Result
80% 80%
20%20%
4. Pareto Principle in Software
Development
6
20 %
80 % 20 %
80 %
Project
Developers Artifacts
5. Prior studies have arrived at mixed
conclusions about core teams and the
Pareto Principle
7
Pareto Non-Pareto
Goeminne
IWSQM
Robles
RAMSS
Mockus
TOSEM
Geldenhuys
ECSEAA
Koch
ISJ Dinh-Trong
TSE
The results depend on small number
of case study systems
Other
6. Prior studies have arrived at mixed
conclusions about core teams and the
Pareto Principle
8
< 10 or 15 Other
Goeminne
IWSQM
Robles
RAMSS
Mockus
TOSEM
Geldenhuys
ECSEAA
Koch
ISJ
Dinh-Trong
TSE
7. Overview of our study of core
teams on GitHub
19
Applicability of the Pareto Principle
Number of Core Developers
8. Overview of our study of core
teams on GitHub
20
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
9. Collecting and analyzing
GitHub data to study core team
activity
21
Filter Heuristics
Core
Non-Core
Core
Non-Core
Calc Prop
Projects
Core
Non-Core
Classify
Commits
Core Team Size Activity
10. Collecting and analyzing
GitHub data to study core team
activity
22
Filter Heuristics
Core
Non-Core
Projects
22
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
11. Preprocessing GitHub data to handle
forks, duplicates, and to remove
immature projects
23
8,510,504 repositories -> 2,496 repositories
12. Collecting and analyzing
GitHub data to study core team
activity
24
Filter Heuristics
Core
Non-Core
Projects
24
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
13. Using heuristics to identify core
team members
26
Commit-based LOC-based Access-based
Core Core Core
14. 29
A B C D
Our commit-based core
contributor heuristic
Number of
Commits
= Commit
16. Step2: Compute the proportion
of commits that each
contributor
32
A BC D
60% 20% 10% 10%
Commits ratio
17. Step3: Core contributors are those
developers below the 0.8 cumulative
contribution cutoff
33
A BC D
0.8
1.0
0.6
Cumulative
ratio
Pct. CoreDev
2/4*100 = 50%
Num CoreDev
2
18. Collecting and analyzing
GitHub data to study core team
activity
35
Filter Heuristics
Core
Non-Core
Projects
35
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
19. Overview of our study of core
teams on GitHub
36
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
20. Overview of our study of core
teams on GitHub
37
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
21. Collecting and analyzing
GitHub data to study core team
activity
38
Filter Heuristics
Core
Non-Core
Projects
38
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
22. Our approach to study Core
Team Size
40
30%20%10%
Percentage of Core Devs
Compliance with
the Pareto Principle
Stratify projects along the confounding factors
Small Medium Large Small Medium Large Small Medium Large
LOC Total Author Age
The example project does not
follow the Pareto Principle
24. Often, there are fewer than 15
core developers in a projects
44
Number of core developers in projects
88% 98% 96%
Commit-Based LOC-Based Access-Based
25. Overview of our study of core
teams on GitHub
45
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
26. Overview of our study of core
teams on GitHub
48
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
27. Collecting and analyzing
GitHub data to study core team
activity
49
Filter Heuristics
Core
Non-Core
Projects
49
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
28. Our approach to study
activity
50
By using the keywords, we classify the commits.
Development
Activity Type Keywords
Forward Engineering implement, add, request
Maintenance
Reengineering optimiz, adjust
Corrective Engineering bug, fix, issue, error
Management license, formatting, TODO
29. No big differences in
proportions of development
activities
54
Commit-Based LOC-Based Access-Based
30. Overview of our study of core
teams on GitHub
55
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
There are no big differences
between
core and non-core activities
31. Overview of our study of core
teams on GitHub
56
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
There are no big differences
between
core and non-core activities
32. Extremely large core team may
be interesting
58
Heuristic -15 16-20 21-50 51-100 101-
Commit-
Based 2,197 98 137 17 47
LOC-
Based 2,454 15 13 4 10
Access-
Based 1,164 24 24 0 0
33. Many projects face a risk of
bus factor
59
Commit-Based LOC-Based Access-Based
43% (Core=1: 8%) 81% (Core=1: 24%) 54% (Core=1: 21%)
In fact, most of projects have less than 5 core developers
44. Fork
73
One of the features of GitHub
Fork (clone)
Original
Repository
Fork
Repository
Pull Request
45. Data Extraction
74
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
46. Data Extraction
75
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Filter repositories which is developed
outside of GitHub.
47. Data Extraction
76
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Filter repositories which is developed
outside of GitHub.
8,510,504 repositories -> 4,618 repositories
56. Data Extraction
85
(5) Filter projects by metrics
4,618 repositories -> 2,496 repositories
Filter less than 10 devs repositories.
Filter less than 1,000 LOC repositories.
Editor's Notes
I’m Kazuhiro Yamashita, a PhD student at Kyushu University, Japan.
Today, I would like to talk about my research.
The slide title is “Is the Pareto principle applicable to core teams of github projects?”
This is a collaboration work of Kyushu University and Queen’s University.
In this study, we focus on core developers and the Pareto principle.
Core developers are developers who play important roles in software development projects.
For example, Nakakoji et al. state that core developers are responsible for guiding and coordinating the development of an OSS project.
On the other hand, Mockus et al. define core developers as the most productive developers who have made roughly 80% of the total contributions.
The definitions are little bit different but both definitions say core developers are important.
From the facts, core developers are a key of success for OSS projects.
Hence, there are papers which focus on core developers.
This is the agenda of this slide.
First we look at the definitions of core developers and the pareto principle.
Next, we show the previous results. Then, we show our research questions derived from previous results.
After our research questions, we describe our case study. Finally, we conclude this study.
Therefore, there are papers which focus on core developers.
And there are some papers which claim that the size of core developers in a successful project is follow the pareto principle.
Some of the papers argue that the proportions of core developers in OSS projects follow the Pareto principle.
The Pareto principle is also known as 80-20 rules and it states that roughly 80% of the results come from 20% of the causes like this figure.
The principle is originally from economics field, but it is also applied to various kinds of field and software engineering field.
Such papers claim that 20% of developers produce 80% of artifacts in software development context.
As we described, there are papers which claim that the size of core developers in a successful project follows the Pareto principle.
On the other hand, there are papers which claim that the size of core developers does not follow the Pareto principle.
In other words, prior studies have arrived at mixed conclusions about core teams and the Pareto principle.
We assume that the reason why such mixed conclusions are obtained is that the results depend on small number of case study systems.
In fact, the prior studies used at most 9 OSS projects.
Addition to the Pareto principle, prior studies also have arrived at mixed conclusions about the number of core developers.
Mockus et al. claim that the number of core developers is less than 10 or 15, but some papers show other opinions.
For instance, Dinh-Trong et al. showed that 27 to 42 developers contribute to more than 80% of contributions in FreeBSD project.
Therefore, there are papers which focus on core developers.
And there are some papers which claim that the size of core developers in a successful project is follow the pareto principle.
Therefore, there are papers which focus on core developers.
And there are some papers which claim that the size of core developers in a successful project is follow the pareto principle.
On the other hand, there is a paper which claims that the proportion of core developers do not follow the pareto principle.
Addition to the pareto principle, some papers show that the exact number of core developers.
But, the numbers are different according to the papers.
When we consider why such discrepancies are happened, we find that all results depend on small number of case study systems.
From the previous work, we derive research question 1 and the motivation.
In RQ1, we would like to generalize the previous results, in other words, we would like to know the proportion of core developers follow the pareto principle?
Additionally, we also would like to know the general number of core developers.
Therefore, we formulate the research question.
Addition to the size of core developers, Mockus et al. claim that a group which is larger by an order of magnitude than the core team, will repair defects.
From the state, we assume that non-core developers more work on bug fixing activity than implementing new functions.
Therefore, we formulate research question 2 according to the assumption.
The motivation of RQ2 is that we would like to know the proportions of activities of core and non-core developers.
By declaring the proportion of activities, we would like to confirm our assumption.
The second research question is that …
From the points, we derived first part of our study.
In this part, we focus on core team size and study the applicability of the Pareto principle to core developers using GitHub projects.
Not only proportions, but also numbers of core developers are argued in prior studies.
Therefore, we also study numbers of core developers in this part.
In the second part of our study, we focus on the activities of core and non-core developers.
The part is also derived from a prior study.
In prior study, Mockus states that a group, which is larger by an order of magnitude than the core team, will repair defects.
From the state, we assume that non-core developers work on more fixing bugs than implementing new functionalities.
Hence, we study the activities of core and non-core developers in second part.
This is an overview of our study.
Now we show the steps for collecting and analyzing github data to study core team activity.
As the common part of both studies, we perform two steps to collect data and identify core developers.
After the two steps, we perform both studies.
In the study for core team size, we calculate the proportions and numbers of core developers of each project then we identify the proportions follow the Pareto principle or not.
In the study for activity, we extract commits of both type of developers then we classify the commits and compare their activities.
We explain each step of our study.
First, we show how to filter projects.
In this study, we used GitHub projects as dataset. First of all, the dataset includes 8.5million repositories.
However, there are also included repositories such as fork repositories, duplicates and immature projects.
To remove such repositories, we preprocess the dataset.
After the preprocessing, 2,496 repositories remain.
We conduct our case study on the 2,496 repositories.
Next, we show heuristics that we use to identify core developers.
In this study, we used three heuristics to identify core developers.
In Commit-based heuristic, we identify core developers using amount of commits of each developer.
In LOC-based heuristic, we identify core developers using amount of LOC which is changed by developers.
In access based heuristic, we identify core developers using access right.
With regard to the access-based heuristic, we can identify core developers from the developer has access right to the repository or not.
However, in commit and loc based heuristic, we need to a way to identify core and non-core developers.
We show steps to identify core developers in commit-based heuristic using this example project.
In this project, there are 4 developers and they made some commits.
As first step, we sort developers by their number of commits in descending order.
After sorting, we calculate the proportions of commits of each developer.
For example, developer A made 6 commits out of 10 commits. Hence, the proportion of developer A is 60%
Finally, we calculate cumulative proportion and identify developers who are below the 0.8 cumulative cutoff as core developers.
In this example, developers A and C are core developers, and B and D are non-core developers.
The percentage of core developers, in this case, is 50% and the number of core developers is 2.
LOC-based heuristic has same steps with commit-based heuristic but it uses LOC instead of the number of commits.
We identified core and non-core developers in each project.
Now we show the answers to our questions.
These are our two questions.
First we show the results about core team size.
The questions that we address are: Is the Pareto principle applicable? and What is general number of core developers?
Here is the part in this figure.
This slide shows our concrete approach to study core team size.
To check the applicability of the Pareto principle, we need to define thresholds.
In this study, we define the range between 10% to 30% as the thresholds.
Therefore, the example project that we showed to explain steps of our heuristic does not follow the Pareto principle.
It is because that the example project has 50% of core developers.
Addition to check the applicability, we stratify projects along the confounding factors to find out trends.
That’s why we assume that the three factors LOC, total authors and project age may affect the size of core developers.
For example, a project that has small total authors tends to be higher proportion of core developers.
Since the results of all heuristics and confounding factors have similar trend, we show only the result of commit-based heuristic and dividing by LOC.
スライド的に分かる様に
These figures show the results of commit-based heuristics and divided by LOC.
From the left side, figures show the distribution of projects small, medium and large LOC projects respectively.
In each figure, this dotted lines are
These figures show the results of commit-based heuristics and divided by LOC.
The x-axis shows the percentages of core developers and the y-axis shows the number of projects.
From the left side, figures show the distribution of projects small, medium and large LOC projects respectively.
In each figure, this dotted lines are thresholds of the Pareto principle.
From the figures, we find that the proportions of core developers are widespread.
In fact, more than half of projects are outside of the range of the Pareto principle.
Therefore, we conclude that the proportions of core developers do not follow the Pareto principle.
When we check the number of core developers, almost 90% or more projects have 15 or less core developers.
From the study of core team size, we obtained these results.
From
Next, we address the second question.
In this study, we focus on the activities of core and non-core developers.
Here is the part of this study, in this figure.
To compare the activities, we need to classify the commits.
We first explain the method that we used for this study, then show the results.
To know developer activities, we use the method which is proposed by Hattori and Lanza.
The method classifies commits into four categories using the commit comments.
This table shows the four categories and the example of keywords.
Forward engineering category is for activities to implement new functionalities and representative keyword is “implement”.
Reengineering category is for modifying existing codes and the keyword is “optimize”.
Corrective Engineering category is for bug fixing activities and the keyword is “bug”.
Management category is for activities to control project and the keyword is “TODO”.
If any keyword is not appeared in commit comments, the commit is classified into Unknown category.
Also, if there is no comment, the commit is classified into Empty category.
This figure shows proportions of categories of each type of developers.
For example, blue bars show the proportions of Forward engineering category and yellow bars show corrective engineering.
In our assumption, the proportion of non-core developers’ corrective engineering activity is large.
However, from the figure, we find that there are no big differences in proportions of corrective engineering.
Furthermore, the other three activities have similar proportions.
Therefore, we obtained the conclusion from this study.
Finally, we obtained these results from our study.
Now we discuss some points that we can obtain from our results.
First, we think extremely large core team may be interesting.
We think it is natural that the proportions of core developers are widespread.
But, there are projects that have more than 50% of the proportion of core developers.
It may be interesting to find out how to coordinate such large number of core developers and how impact to the project quality.
図を差し替え-&gt;%でなく人数の絶対値にする
First, we think extremely large core team may be interesting.
We think it is natural that the proportions of core developers are widespread.
But, there are projects that have more than 50 core developers.
It may be interesting to find out how to coordinate such large number of core developers and how impact to the project quality.
Next, we think many projects face a risk of bus factor.
We showed that many projects have 15 or less core developers.
In fact, many of projects have less than 5 core developers.
For example, in LOC based heuristic, 81% of projects have less than 5 core developers and 24% of projects have only 1 core developers.
From the fact, we assume that many projects face a risk of bus factor.
Now we conclude our slide.
First, we showed prior studies and our two questions which are derived from prior studies.
Then, we showed our case study design to address the two questions.
From the case study, we found that core team proportions are widespread and there are no big differences in proportions of development activity between core and non-core developers.
That’s all. Thank you.