This document presents an algorithm to extract data from open source software project repositories on GitHub for building duration estimation models. The algorithm extracts data on contributors, commits, lines of code added and removed, and active days for each contributor within a release period. This data is then used to build linear regression models to estimate project duration based on the number of commits by contributor type (full-time, part-time, occasional). The algorithm is tested on data extracted from 21 releases of the WordPress project hosted on GitHub.
This document compares two solutions for filtering hierarchical data sets: Solution A uses MySQL and Python, while Solution B uses MongoDB and C++. Both solutions were tested on a 2011 MeSH data set using various filtering methods and thresholds. Solution A generally had faster execution times at lower thresholds, while Solution B scaled better to higher thresholds. However, the document concludes that neither solution is clearly superior, and further study is needed to evaluate their performance for real-world human users.
The document describes an evaluation of existing relational keyword search systems. It notes discrepancies in how prior studies evaluated systems using different datasets, query workloads, and experimental designs. The evaluation aims to conduct an independent assessment that uses larger, more representative datasets and queries to better understand systems' real-world performance and tradeoffs between effectiveness and efficiency. It outlines schema-based and graph-based search approaches included in the new evaluation.
The document discusses automatic data unit annotation in search results. It proposes a method that clusters data units on result pages into groups containing semantically similar units. Then, multiple annotators are used to predict annotation labels for each group based on features of the units. An annotation wrapper is constructed for each website to annotate new result pages from that site. The method aims to improve search response by providing meaningful annotations of data units within results. It is evaluated based on precision and recall for the alignment of data units and text nodes during the annotation process.
The document describes an automated process for bug triage that uses text classification and data reduction techniques. It proposes using Naive Bayes classifiers to predict the appropriate developers to assign bugs to by applying stopword removal, stemming, keyword selection, and instance selection on bug reports. This reduces the data size and improves quality. It predicts developers based on their history and profiles while tracking bug status. The goal is to more efficiently handle software bugs compared to traditional manual triage processes.
IRJET- Determining Document Relevance using Keyword ExtractionIRJET Journal
This document describes a system that aims to search for and retrieve relevant documents from a large collection based on a user's query. It does this through three main components: keyword extraction, document searching, and a question answering bot. Keyword extraction is done using the TF-IDF algorithm to identify important words in documents. These keywords are stored in a database along with their TF-IDF weights. When a user submits a query, the system searches for documents containing keywords from the query and returns relevant results. It also includes a feedback mechanism for users to improve search accuracy over time. The goal is to deliver accurate search results quickly from large document collections.
A Conceptual Dependency Graph Based Keyword Extraction Model for Source Code...Nakul Sharma
The document presents a conceptual dependency graph based model for mapping source code to API documentation. It proposes using a weighted graph to model dependencies between code elements and filter candidate sets for contextual similarity computation. The methodology involves preprocessing source code and documents, building a dependency graph, and computing contextual similarity. Results show the proposed method achieves higher accuracy than existing techniques. Future work includes applying the approach to larger open source projects.
This document presents an approach to populating a release history database from version control and bug tracking systems. It combines data from CVS version control and Bugzilla bug tracking for the Mozilla project to analyze software evolution. The paper describes related work, outlines the data import process, and evaluates the approach by examining timescales, release history, and coupling for the Mozilla project. It concludes that this approach provides insights into a project's evolutionary processes but that more formal integration with version control could improve the analysis.
This document compares two solutions for filtering hierarchical data sets: Solution A uses MySQL and Python, while Solution B uses MongoDB and C++. Both solutions were tested on a 2011 MeSH data set using various filtering methods and thresholds. Solution A generally had faster execution times at lower thresholds, while Solution B scaled better to higher thresholds. However, the document concludes that neither solution is clearly superior, and further study is needed to evaluate their performance for real-world human users.
The document describes an evaluation of existing relational keyword search systems. It notes discrepancies in how prior studies evaluated systems using different datasets, query workloads, and experimental designs. The evaluation aims to conduct an independent assessment that uses larger, more representative datasets and queries to better understand systems' real-world performance and tradeoffs between effectiveness and efficiency. It outlines schema-based and graph-based search approaches included in the new evaluation.
The document discusses automatic data unit annotation in search results. It proposes a method that clusters data units on result pages into groups containing semantically similar units. Then, multiple annotators are used to predict annotation labels for each group based on features of the units. An annotation wrapper is constructed for each website to annotate new result pages from that site. The method aims to improve search response by providing meaningful annotations of data units within results. It is evaluated based on precision and recall for the alignment of data units and text nodes during the annotation process.
The document describes an automated process for bug triage that uses text classification and data reduction techniques. It proposes using Naive Bayes classifiers to predict the appropriate developers to assign bugs to by applying stopword removal, stemming, keyword selection, and instance selection on bug reports. This reduces the data size and improves quality. It predicts developers based on their history and profiles while tracking bug status. The goal is to more efficiently handle software bugs compared to traditional manual triage processes.
IRJET- Determining Document Relevance using Keyword ExtractionIRJET Journal
This document describes a system that aims to search for and retrieve relevant documents from a large collection based on a user's query. It does this through three main components: keyword extraction, document searching, and a question answering bot. Keyword extraction is done using the TF-IDF algorithm to identify important words in documents. These keywords are stored in a database along with their TF-IDF weights. When a user submits a query, the system searches for documents containing keywords from the query and returns relevant results. It also includes a feedback mechanism for users to improve search accuracy over time. The goal is to deliver accurate search results quickly from large document collections.
A Conceptual Dependency Graph Based Keyword Extraction Model for Source Code...Nakul Sharma
The document presents a conceptual dependency graph based model for mapping source code to API documentation. It proposes using a weighted graph to model dependencies between code elements and filter candidate sets for contextual similarity computation. The methodology involves preprocessing source code and documents, building a dependency graph, and computing contextual similarity. Results show the proposed method achieves higher accuracy than existing techniques. Future work includes applying the approach to larger open source projects.
This document presents an approach to populating a release history database from version control and bug tracking systems. It combines data from CVS version control and Bugzilla bug tracking for the Mozilla project to analyze software evolution. The paper describes related work, outlines the data import process, and evaluates the approach by examining timescales, release history, and coupling for the Mozilla project. It concludes that this approach provides insights into a project's evolutionary processes but that more formal integration with version control could improve the analysis.
Searching Repositories of Web Application ModelsMarco Brambilla
Project repositories are a central asset in software development, as they preserve the technical knowledge gathered in past development activities. However, locating relevant information in a vast project repository is problematic, because it requires manually tagging projects with accurate metadata, an activity which is time consuming and prone to errors and omissions. This paper investigates the use of classical Information Retrieval techniques for easing the discovery of useful information from past projects. Differently from approaches based on textual search over the source code of applications or on querying structured metadata, we propose to index and search the models of applications, which are available in companies applying Model-Driven Engineering practices. We contrast alternative index structures and result presentations, and evaluate a prototype implementation on real-world experimental data.
This document discusses data reduction techniques for improving bug triage in software projects. It proposes combining instance selection and feature selection to simultaneously reduce the scale of bug data on both the bug dimension and word dimension, while also improving the accuracy of bug triage. Historical bug data is used to build a predictive model to determine the optimal order of applying instance selection and feature selection for a new bug data set. The techniques are empirically evaluated on 600,000 bug reports from the Eclipse and Mozilla open source projects, showing the approach can effectively reduce data scale and improve triage accuracy.
This document provides a listing and brief descriptions of working papers from 2000. It includes 12 papers with titles and short 1-2 paragraph summaries of each paper's topic or focus. The papers cover a range of topics related to text mining, machine learning, data compression, knowledge discovery, and user interfaces for developing classifiers.
TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUESShakas Technologies
This document summarizes an approach for data reduction in software bug triage. It combines instance selection and feature selection techniques to simultaneously reduce the number of bug reports (instances) and words (features) in bug datasets. This aims to create smaller, higher quality datasets that improve the accuracy of automatic bug triage while reducing labor costs. It evaluates different instance selection, feature selection, and their combination methods on large bug datasets from Eclipse and Mozilla projects. The results show the proposed data reduction approach can effectively shrink dataset sizes and boost bug triage accuracy.
TOWARDS CROSS-BROWSER INCOMPATIBILITIES DETECTION: ASYSTEMATIC LITERATURE REVIEWijseajournal
The term Cross Browser Incompatibilities (XBI) stands for compatibility issues which can be observed while rendering a web application in different browsers, such as: Microsoft Edge, Mozilla Firefox, Google Chrome, among others. In order to detect Cross-Browser Incompatibilities - XBIs, Web developers rely on manual inspection of every web page of their applications, rendering them in various configuration environments (considering multiple OS platforms and browser versions). These manual inspections are error-prone and increase the cost of the Web engineering process. This paper has the goal of identify- ing techniques for automatic XBI detection, conducting a Systematic Literature Review (SLR) with focus on studies which report XBI automatic detection strategies. The selected studies were compared accord- ing to the different technological solutions they implemented and were used to elaborate a XBI detection framework which can organize and classify the state-of-the-art techniques and may be used to guide future research activities about this topic.
This document summarizes a research paper that combines software structure analysis and revision history analysis to measure architecture quality and identify risks. The researchers conducted a case study on an agile project with 300,000 lines of code. They measured attributes of individual files and file pairs like size, dependencies, and change frequency. Exploration of the data revealed outlier files and distant change pairs that indicated architecture problems. Validation showed the measures could accurately predict faulty files. Investigating outliers uncovered key interface files with many faults violating intended encapsulation. The visual models and metrics helped prioritize restructuring tasks and justify architecture renovations.
Findability through Traceability - A Realistic Application of Candidate Tr...Markus Borg
Conference presentation from ENASE 2012 in Wroclaw, Poland.
Abstract: Since software development is of dynamic nature, the impact analysis is an inevitable work task. Traceability is known as one factor that supports this task, and several researchers have proposed traceability recovery tools to propose trace links in an existing system. However, these semi-automatic tools have not yet proven useful in industrial applications. Based on an established automation model, we analyzed the potential value of such a tool. We based our analysis on a pilot case study of an impact analysis process in a safety-critical development context, and argue that traceability recovery should be considered an investment in ndability. Moreover, several risks involved in an increased level of impact analysis automation are already plaguing the state-
of-practice workflow. Consequently, deploying a traceability recovery tool involves a lower degree
of change than has previously been acknowledged.
This document proposes a new method to re-rank web documents retrieved by search engines based on their relevance to a user's query using ontology concepts. It involves building an ontology of concepts for a given domain (electronic commerce), extracting concepts from retrieved documents, and re-ranking documents based on the frequency of ontology concepts within them. An evaluation showed the approach reduced average ranking error compared to search engines alone. The method was tested on the first 30 documents retrieved for the query "e-commerce" from search engines.
The document describes a proposed system for classifying and retrieving reusable software components from a repository. It begins with an introduction to software reuse and existing component classification schemes, such as free text, enumerated, attribute-value, and faceted classification. The proposed system uses an integrated classification scheme that combines attribute-value and faceted classification. Components are classified using attributes like operating system, language, keywords, inputs, outputs, domain, and version. The system includes algorithms for inserting new components and searching for relevant components matching given attributes. Matching components are returned along with a download option. The goal is to take advantage of different classification schemes to improve search and retrieval of reusable software components.
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...IJCI JOURNAL
In this paper we describe a new unsupervised algorithm for automatic documents clustering with the aid of Wikipedia. Contrary to other related algorithms in the field, our algorithm utilizes only two aspects of Wikipedia, namely its categories network and articles titles. We do not utilize the inner content of the articles in Wikipedia or their inner or inter links. The implemented algorithm was evaluated in an
experiment for documents clustering. The findings we obtained indicate that the utilized features from
Wikipedia in our framework can give competing results especially when compared against other models in
the literature which employ the inner content of Wikipedia articles.
Real Time Competitive Marketing Intelligencefeiwin
The document describes a system for real-time competitive market intelligence that analyzes unstructured text from news articles about companies. It crawls the web in real-time to collect articles about competitors. Text analysis techniques convert the documents to numerical format for machine learning methods to determine word patterns that distinguish companies. The system applies a lightweight rule induction method to generate rules with meaningful word conjunctions and disjunctions that characterize each company. An example output highlights distinguishing words between news articles about IBM and Microsoft.
The document discusses the history and challenges of software component search and retrieval. It covers early information retrieval systems from the 1940s-1950s and issues with unstructured, semi-structured and structured documents. The document also discusses the goals of software component retrieval, such as exact and approximate reuse. It analyzes different component search tools and approaches developed from the 1990s-2000s, including facet classification, spreading activation techniques, and context-aware recommendation systems.
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineIDES Editor
In Meta Search Engine result merging is the key
component. Meta Search Engines provide a uniform query
interface for Internet users to search for information.
Depending on users’ needs, they select relevant sources and
map user queries into the target search engines, subsequently
merging the results. The effectiveness of a Meta Search
Engine is closely related to the result merging algorithm it
employs. In this paper, we have proposed a Meta Search
Engine, which has two distinct steps (1) searching through
surface and deep search engine, and (2) Ranking the results
through the designed ranking algorithm. Initially, the query
given by the user is inputted to the deep and surface search
engine. The proposed method used two distinct algorithms
for ranking the search results, concept similarity based
method and cosine similarity based method. Once the results
from various search engines are ranked, the proposed Meta
Search Engine merges them into a single ranked list. Finally,
the experimentation will be done to prove the efficiency of
the proposed visible and invisible web-based Meta Search
Engine in merging the relevant pages. TSAP is used as the
evaluation criteria and the algorithms are evaluated based on
these criteria.
Comparison of data recovery techniques on master file table between Aho-Coras...TELKOMNIKA JOURNAL
This document compares two data recovery techniques - Aho-Corasick and logical data recovery - based on efficiency using three datasets. The logical data recovery algorithm was found to be faster than Aho-Corasick according to processing speed, recovering files at speeds of up to 12.8 MB/s compared to 11.9 MB/s for Aho-Corasick. However, both algorithms achieved the same recovery accuracy rates of 94-100% depending on the dataset. The document concludes that a hybrid algorithm combining the strengths of both - using logical recovery for speed and Aho-Corasick for targeted file searches - could achieve even higher efficiency for data recovery.
The document introduces Tag.bio as a low-code analytics application platform built from interconnected data products in a data mesh architecture. It consists of data, algorithms, and analysis apps contributed by different groups - data engineers, data scientists, and domain experts. The platform can integrate various data sources and enable collaboration between groups. It then provides demos of the Tag.bio developer studio and data portal. Key capabilities discussed include integration with AWS services like AI/ML and HealthLake, as well as security features like confidential computing. Example use cases presented are for clinical trials, healthcare, life sciences, and universities.
Data Mining and the Web_Past_Present and Futurefeiwin
The document discusses past, present and future of data mining and the web. It outlines four problems with current web search tools: abundance of irrelevant results, limited coverage, limited queries, and lack of customization. It then describes several data mining techniques like association rules, classification, clustering and how they can be applied to web mining problems such as analyzing link structure, improving search customization and extracting information from web documents. Future research directions include better mining of web structure and content.
TUW-ASE Summer 2015 - Quality of Result-aware data analyticsHong-Linh Truong
This is a lecture from the advanced service engineering course from the Vienna University of Technology. See http://dsg.tuwien.ac.at/teaching/courses/ase
The document discusses the process of qualitative data analysis using the software tool Ethnograph V5.0. It describes the stages of qualitative data collection and analysis including coding, grouping, establishing relationships and theory generation. It then explains the various steps that can be taken using Ethnograph V5.0 to facilitate the analytic process, such as creating projects, entering data, coding data, conducting searches, creating memos and search filters using face sheets and identifier sheets. Pros and cons of using the software are also mentioned.
In the present paper, applicability and
capability of A.I techniques for effort estimation prediction has
been investigated. It is seen that neuro fuzzy models are very
robust, characterized by fast computation, capable of handling
the distorted data. Due to the presence of data non-linearity, it is
an efficient quantitative tool to predict effort estimation. The one
hidden layer network has been developed named as OHLANFIS
using MATLAB simulation environment.
Here the initial parameters of the OHLANFIS are
identified using the subtractive clustering method. Parameters of
the Gaussian membership function are optimally determined
using the hybrid learning algorithm. From the analysis it is seen
that the Effort Estimation prediction model developed using
OHLANFIS technique has been able to perform well over normal
ANFIS Model.
Mining developer communication data streamscsandit
This paper explores the concepts of modelling a software development project as a process that
results in the creation of a continuous stream of data. In terms of the Jazz repository used in this
research, one aspect of that stream of data would be developer communication. Such data can
be used to create an evolving social network characterized by a range of metrics. This paper
presents the application of data stream mining techniques to identify the most useful metrics for
predicting build outcomes. Results are presented from applying the Hoeffding Tree
classification method used in conjunction with the Adaptive Sliding Window (ADWIN) method
for detecting concept drift. The results indicate that only a small number of the available metrics
considered have any significance for predicting the outcome of a build
The article proposes a new model for optimizing software effort and cost estimation based on code reusability. The model compares new projects to previously completed, similar projects stored in a code repository. By searching for and retrieving reusable code, functions, and methods from old projects, the model aims to reduce effort and cost estimates for new software development. The model is described as being based on the concept of estimation by analogy and using innovative search and retrieval techniques to achieve code reuse and thus decreased cost and effort estimates.
A Survey on Design Pattern Detection ApproachesCSCJournals
Design patterns play a key role in software development process. The interest in extracting design pattern instances from object-oriented software has increased tremendously in the last two decades. Design patterns enhance program understanding, help to document the systems and capture design trade-offs.
This paper provides the current state of the art in design patterns detection. The selected approaches cover the whole spectrum of the research in design patterns detection. We noticed diverse accuracy values extracted by different detection approaches. The lessons learned are listed at the end of this paper, which can be used for future research directions and guidelines in the area of design patterns detection.
Searching Repositories of Web Application ModelsMarco Brambilla
Project repositories are a central asset in software development, as they preserve the technical knowledge gathered in past development activities. However, locating relevant information in a vast project repository is problematic, because it requires manually tagging projects with accurate metadata, an activity which is time consuming and prone to errors and omissions. This paper investigates the use of classical Information Retrieval techniques for easing the discovery of useful information from past projects. Differently from approaches based on textual search over the source code of applications or on querying structured metadata, we propose to index and search the models of applications, which are available in companies applying Model-Driven Engineering practices. We contrast alternative index structures and result presentations, and evaluate a prototype implementation on real-world experimental data.
This document discusses data reduction techniques for improving bug triage in software projects. It proposes combining instance selection and feature selection to simultaneously reduce the scale of bug data on both the bug dimension and word dimension, while also improving the accuracy of bug triage. Historical bug data is used to build a predictive model to determine the optimal order of applying instance selection and feature selection for a new bug data set. The techniques are empirically evaluated on 600,000 bug reports from the Eclipse and Mozilla open source projects, showing the approach can effectively reduce data scale and improve triage accuracy.
This document provides a listing and brief descriptions of working papers from 2000. It includes 12 papers with titles and short 1-2 paragraph summaries of each paper's topic or focus. The papers cover a range of topics related to text mining, machine learning, data compression, knowledge discovery, and user interfaces for developing classifiers.
TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUESShakas Technologies
This document summarizes an approach for data reduction in software bug triage. It combines instance selection and feature selection techniques to simultaneously reduce the number of bug reports (instances) and words (features) in bug datasets. This aims to create smaller, higher quality datasets that improve the accuracy of automatic bug triage while reducing labor costs. It evaluates different instance selection, feature selection, and their combination methods on large bug datasets from Eclipse and Mozilla projects. The results show the proposed data reduction approach can effectively shrink dataset sizes and boost bug triage accuracy.
TOWARDS CROSS-BROWSER INCOMPATIBILITIES DETECTION: ASYSTEMATIC LITERATURE REVIEWijseajournal
The term Cross Browser Incompatibilities (XBI) stands for compatibility issues which can be observed while rendering a web application in different browsers, such as: Microsoft Edge, Mozilla Firefox, Google Chrome, among others. In order to detect Cross-Browser Incompatibilities - XBIs, Web developers rely on manual inspection of every web page of their applications, rendering them in various configuration environments (considering multiple OS platforms and browser versions). These manual inspections are error-prone and increase the cost of the Web engineering process. This paper has the goal of identify- ing techniques for automatic XBI detection, conducting a Systematic Literature Review (SLR) with focus on studies which report XBI automatic detection strategies. The selected studies were compared accord- ing to the different technological solutions they implemented and were used to elaborate a XBI detection framework which can organize and classify the state-of-the-art techniques and may be used to guide future research activities about this topic.
This document summarizes a research paper that combines software structure analysis and revision history analysis to measure architecture quality and identify risks. The researchers conducted a case study on an agile project with 300,000 lines of code. They measured attributes of individual files and file pairs like size, dependencies, and change frequency. Exploration of the data revealed outlier files and distant change pairs that indicated architecture problems. Validation showed the measures could accurately predict faulty files. Investigating outliers uncovered key interface files with many faults violating intended encapsulation. The visual models and metrics helped prioritize restructuring tasks and justify architecture renovations.
Findability through Traceability - A Realistic Application of Candidate Tr...Markus Borg
Conference presentation from ENASE 2012 in Wroclaw, Poland.
Abstract: Since software development is of dynamic nature, the impact analysis is an inevitable work task. Traceability is known as one factor that supports this task, and several researchers have proposed traceability recovery tools to propose trace links in an existing system. However, these semi-automatic tools have not yet proven useful in industrial applications. Based on an established automation model, we analyzed the potential value of such a tool. We based our analysis on a pilot case study of an impact analysis process in a safety-critical development context, and argue that traceability recovery should be considered an investment in ndability. Moreover, several risks involved in an increased level of impact analysis automation are already plaguing the state-
of-practice workflow. Consequently, deploying a traceability recovery tool involves a lower degree
of change than has previously been acknowledged.
This document proposes a new method to re-rank web documents retrieved by search engines based on their relevance to a user's query using ontology concepts. It involves building an ontology of concepts for a given domain (electronic commerce), extracting concepts from retrieved documents, and re-ranking documents based on the frequency of ontology concepts within them. An evaluation showed the approach reduced average ranking error compared to search engines alone. The method was tested on the first 30 documents retrieved for the query "e-commerce" from search engines.
The document describes a proposed system for classifying and retrieving reusable software components from a repository. It begins with an introduction to software reuse and existing component classification schemes, such as free text, enumerated, attribute-value, and faceted classification. The proposed system uses an integrated classification scheme that combines attribute-value and faceted classification. Components are classified using attributes like operating system, language, keywords, inputs, outputs, domain, and version. The system includes algorithms for inserting new components and searching for relevant components matching given attributes. Matching components are returned along with a download option. The goal is to take advantage of different classification schemes to improve search and retrieval of reusable software components.
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...IJCI JOURNAL
In this paper we describe a new unsupervised algorithm for automatic documents clustering with the aid of Wikipedia. Contrary to other related algorithms in the field, our algorithm utilizes only two aspects of Wikipedia, namely its categories network and articles titles. We do not utilize the inner content of the articles in Wikipedia or their inner or inter links. The implemented algorithm was evaluated in an
experiment for documents clustering. The findings we obtained indicate that the utilized features from
Wikipedia in our framework can give competing results especially when compared against other models in
the literature which employ the inner content of Wikipedia articles.
Real Time Competitive Marketing Intelligencefeiwin
The document describes a system for real-time competitive market intelligence that analyzes unstructured text from news articles about companies. It crawls the web in real-time to collect articles about competitors. Text analysis techniques convert the documents to numerical format for machine learning methods to determine word patterns that distinguish companies. The system applies a lightweight rule induction method to generate rules with meaningful word conjunctions and disjunctions that characterize each company. An example output highlights distinguishing words between news articles about IBM and Microsoft.
The document discusses the history and challenges of software component search and retrieval. It covers early information retrieval systems from the 1940s-1950s and issues with unstructured, semi-structured and structured documents. The document also discusses the goals of software component retrieval, such as exact and approximate reuse. It analyzes different component search tools and approaches developed from the 1990s-2000s, including facet classification, spreading activation techniques, and context-aware recommendation systems.
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineIDES Editor
In Meta Search Engine result merging is the key
component. Meta Search Engines provide a uniform query
interface for Internet users to search for information.
Depending on users’ needs, they select relevant sources and
map user queries into the target search engines, subsequently
merging the results. The effectiveness of a Meta Search
Engine is closely related to the result merging algorithm it
employs. In this paper, we have proposed a Meta Search
Engine, which has two distinct steps (1) searching through
surface and deep search engine, and (2) Ranking the results
through the designed ranking algorithm. Initially, the query
given by the user is inputted to the deep and surface search
engine. The proposed method used two distinct algorithms
for ranking the search results, concept similarity based
method and cosine similarity based method. Once the results
from various search engines are ranked, the proposed Meta
Search Engine merges them into a single ranked list. Finally,
the experimentation will be done to prove the efficiency of
the proposed visible and invisible web-based Meta Search
Engine in merging the relevant pages. TSAP is used as the
evaluation criteria and the algorithms are evaluated based on
these criteria.
Comparison of data recovery techniques on master file table between Aho-Coras...TELKOMNIKA JOURNAL
This document compares two data recovery techniques - Aho-Corasick and logical data recovery - based on efficiency using three datasets. The logical data recovery algorithm was found to be faster than Aho-Corasick according to processing speed, recovering files at speeds of up to 12.8 MB/s compared to 11.9 MB/s for Aho-Corasick. However, both algorithms achieved the same recovery accuracy rates of 94-100% depending on the dataset. The document concludes that a hybrid algorithm combining the strengths of both - using logical recovery for speed and Aho-Corasick for targeted file searches - could achieve even higher efficiency for data recovery.
The document introduces Tag.bio as a low-code analytics application platform built from interconnected data products in a data mesh architecture. It consists of data, algorithms, and analysis apps contributed by different groups - data engineers, data scientists, and domain experts. The platform can integrate various data sources and enable collaboration between groups. It then provides demos of the Tag.bio developer studio and data portal. Key capabilities discussed include integration with AWS services like AI/ML and HealthLake, as well as security features like confidential computing. Example use cases presented are for clinical trials, healthcare, life sciences, and universities.
Data Mining and the Web_Past_Present and Futurefeiwin
The document discusses past, present and future of data mining and the web. It outlines four problems with current web search tools: abundance of irrelevant results, limited coverage, limited queries, and lack of customization. It then describes several data mining techniques like association rules, classification, clustering and how they can be applied to web mining problems such as analyzing link structure, improving search customization and extracting information from web documents. Future research directions include better mining of web structure and content.
TUW-ASE Summer 2015 - Quality of Result-aware data analyticsHong-Linh Truong
This is a lecture from the advanced service engineering course from the Vienna University of Technology. See http://dsg.tuwien.ac.at/teaching/courses/ase
The document discusses the process of qualitative data analysis using the software tool Ethnograph V5.0. It describes the stages of qualitative data collection and analysis including coding, grouping, establishing relationships and theory generation. It then explains the various steps that can be taken using Ethnograph V5.0 to facilitate the analytic process, such as creating projects, entering data, coding data, conducting searches, creating memos and search filters using face sheets and identifier sheets. Pros and cons of using the software are also mentioned.
In the present paper, applicability and
capability of A.I techniques for effort estimation prediction has
been investigated. It is seen that neuro fuzzy models are very
robust, characterized by fast computation, capable of handling
the distorted data. Due to the presence of data non-linearity, it is
an efficient quantitative tool to predict effort estimation. The one
hidden layer network has been developed named as OHLANFIS
using MATLAB simulation environment.
Here the initial parameters of the OHLANFIS are
identified using the subtractive clustering method. Parameters of
the Gaussian membership function are optimally determined
using the hybrid learning algorithm. From the analysis it is seen
that the Effort Estimation prediction model developed using
OHLANFIS technique has been able to perform well over normal
ANFIS Model.
Mining developer communication data streamscsandit
This paper explores the concepts of modelling a software development project as a process that
results in the creation of a continuous stream of data. In terms of the Jazz repository used in this
research, one aspect of that stream of data would be developer communication. Such data can
be used to create an evolving social network characterized by a range of metrics. This paper
presents the application of data stream mining techniques to identify the most useful metrics for
predicting build outcomes. Results are presented from applying the Hoeffding Tree
classification method used in conjunction with the Adaptive Sliding Window (ADWIN) method
for detecting concept drift. The results indicate that only a small number of the available metrics
considered have any significance for predicting the outcome of a build
The article proposes a new model for optimizing software effort and cost estimation based on code reusability. The model compares new projects to previously completed, similar projects stored in a code repository. By searching for and retrieving reusable code, functions, and methods from old projects, the model aims to reduce effort and cost estimates for new software development. The model is described as being based on the concept of estimation by analogy and using innovative search and retrieval techniques to achieve code reuse and thus decreased cost and effort estimates.
A Survey on Design Pattern Detection ApproachesCSCJournals
Design patterns play a key role in software development process. The interest in extracting design pattern instances from object-oriented software has increased tremendously in the last two decades. Design patterns enhance program understanding, help to document the systems and capture design trade-offs.
This paper provides the current state of the art in design patterns detection. The selected approaches cover the whole spectrum of the research in design patterns detection. We noticed diverse accuracy values extracted by different detection approaches. The lessons learned are listed at the end of this paper, which can be used for future research directions and guidelines in the area of design patterns detection.
This document summarizes a research paper on developing a feature-based product recommendation system. It begins by introducing recommender systems and their importance for e-commerce. It then describes how the proposed system takes basic product descriptions as input, recognizes features using association rule mining and k-nearest neighbor algorithms, and outputs recommended additional features to improve the product profile. The paper evaluates the system's performance on recommending antivirus software features. In under 3 sentences.
1. Software project estimation involves decomposing a project into smaller problems like major functions and activities. Estimates can be based on similar past projects, decomposition techniques, or empirical models.
2. Accurate estimates depend on properly estimating the size of the software product using techniques like lines of code, function points, or standard components. Baseline metrics from past projects are then applied to the size estimates.
3. Decomposition techniques involve estimating the effort needed for each task or function and combining them. Process-based estimation decomposes the software process into tasks while problem-based estimation decomposes the problem.
Decision Making Framework in e-Business Cloud Environment Using Software Metr...ijitjournal
Cloud computing technology is most important one in IT industry by enabling them to offer access to their
system and application services on payment type. As a result, more than a few enterprises with Facebook,
Microsoft, Google, and amazon have started offer to their clients. Quality software is most important one in
market competition in this paper presents a hybrid framework based on the goal/question/metric paradigm
to evaluate the quality and effectiveness of previous software goods in project, product and organizations
in a cloud computing environment. In our approach it support decision making in the area of project,
product and organization levels using Neural networks and three angular metrics i.e., project metrics,
product metrics, and organization metrics
This document provides an overview of data mining techniques for software engineering. It discusses how software engineering data like code bases, execution traces, and bug databases contain valuable information about a project. Data mining can help with tasks like managing projects, improving quality, and understanding large systems. The document outlines challenges in mining software engineering data and how techniques must be customized. It also summarizes successful case studies and available data mining tools to help researchers apply these techniques.
SIZE METRICS FOR SERVICE-ORIENTED ARCHITECTUREmathsjournal
Determining the size, effort and cost of Service-oriented Architecture (SOA) systems is important to
facilitate project planning and eventually successful implementation of a software project. SOA approach is
one of the recent software architectures that allow integration of applications within and outside the
organization regardless of heterogeneous technology over a distributed network. A number of research
studies have been done to measure SOA size.However, these studies are not based on software metrics
rendering them to be ineffective in their estimation. This study defined a set of SOA size metrics based on
Unified Modelling Language (UML). The study employed Briand’s theoretical validation to test the validity
of the designed SOA size metrics. Findings from this study will provide metrics to measure SOA size more
accurately and form a basis for future software engineering researchers to develop more effective and more
accurate size metrics and effort estimation methods.
Size Metrics for Service-Oriented Architecture ijseajournal
Determining the size, effort and cost of Service-oriented Architecture (SOA) systems is important to facilitate project planning and eventually successful implementation of a software project. SOA approach is one of the recent software architectures that allow integration of applications within and outside the organization regardless of heterogeneous technology over a distributed network. A number of research
studies have been done to measure SOA size.However, these studies are not based on software metrics rendering them to be ineffective in their estimation. This study defined a set of SOA size metrics based on Unified Modelling Language (UML). The study employed Briand’s theoretical validation to test the validity
of the designed SOA size metrics. Findings from this study will provide metrics to measure SOA size more accurately and form a basis for future software engineering researchers to develop more effective and more accurate size metrics and effort estimation methods.
This document summarizes a technical report on an empirical study of design pattern evolution in 39 open source Java projects. The study analyzed a total of 428 software releases from the projects to identify 10 common design patterns. A total of 27,855 instances of the patterns were found. The data collected includes the number of each pattern identified in each project release. This dataset will be further analyzed to understand how design patterns evolve over the lifetime of a software system.
The document discusses some of the promises and perils of mining software repositories like Git and GitHub for research purposes. It notes that while these sources contain rich data on software development, there are also challenges to consider. For example, decentralized version control systems like Git allow private collaboration that may be missed. And most GitHub projects are personal and inactive, while it is also used for storage and hosting. The document recommends researchers approach these data sources carefully and provides lessons on how to properly analyze and interpret the data from repositories like Git and GitHub.
Using Fuzzy Clustering and Software Metrics to Predict Faults in large Indust...IOSR Journals
This document describes a study that uses fuzzy clustering and software metrics to predict faults in large industrial software systems. The study uses fuzzy c-means clustering to group software components into faulty and fault-free clusters based on various software metrics. The study applies this method to the open-source JEdit software project, calculating metrics for 274 classes and identifying faults using repository data. The results show 88.49% accuracy in predicting faulty classes, demonstrating that fuzzy clustering can be an effective technique for fault prediction in large software systems.
AGILE SOFTWARE ARCHITECTURE INGLOBAL SOFTWARE DEVELOPMENT ENVIRONMENT:SYSTEMA...ijseajournal
In recent years, software development companies started to adopt Global Software Development (GSD) to
explore the benefits of this approach, mainly cost reduction. However, the GSD environment also brings
more complexity and challenges. Some challenges are related to communication aspects like cultural differences, time zone, and language. This paper is the first step in an extensive study to understand if the
software architecture can ease communication in GSD environments. We conducted a Systematic Literature Mapping (SLM) to catalog relevant studies about software architecture and GSD teams and identify
potential practices for use in the software industry. This paper’s findings contribute to the GSD body of
knowledge by exploring the impact of software architecture strategy on the GSD environment. It presents
hypotheses regarding the relationship between software architecture and GSD challenges, which will guide
future research.
APPLYING REQUIREMENT BASED COMPLEXITY FOR THE ESTIMATION OF SOFTWARE DEVELOPM...cscpconf
The need of computing the software complexity in requirement analysis phase of software
development life cycle (SDLC) would be an enormous benefit for estimating the required
development and testing effort for yet to be developed software. Also, a relationship between
source code and difficulty in developing a source code are also attempted in order to estimate the
complexity of the proposed software for cost estimation, man power build up, code and
developer’s evaluation. Therefore, this paper presents a systematic and an integrated approach
for the estimation of software development and testing effort on the basis of improved
requirement based complexity (IRBC) of the proposed software. The IRBC measure serves as the
basis for estimation of these software development activities to enable the developers and
practitioners to predict the critical information about the software development intricacies and
obtained from software requirement specification (SRS) of proposed software. Hence, this paper
presents an integrated approach, for the prediction of software development and testing effort
using IRBC. For validation purpose, the proposed measures are categorically compared with
various established and prevalent practices proposed in the past. Finally, the results obtained, validates the claim, for the approaches discussed in this paper, for estimation of software development and testing effort, in the early phases of SDLC appears to be robust, comprehensive, early alarming and compares well with other measures proposed in the past.
Software Defect Trend Forecasting In Open Source Projects using A Univariate ...CSCJournals
Our objective in this research is to provide a framework that will allow project managers, business owners, and developers an effective way to forecast the trend in software defects within a software project in real-time. By providing these stakeholders with a mechanism for forecasting defects, they can then provide the necessary resources at the right time in order to remove these defects before they become too much ultimately leading to software failure. In our research, we will not only show general trends in several open-source projects but also show trends in daily, monthly, and yearly activity. Our research shows that we can use this forecasting method up to 6 months out with only an MSE of 0.019. In this paper, we present our technique and methodologies for developing the inputs for the proposed model and the results of testing on seven open source projects. Further, we discuss the prediction models, the performance, and the implementation using the FBProphet framework and the ARIMA model.
A Systematic Mapping Study on Analysis of Code Repositories.pdfAlicia Edwards
This document presents a systematic mapping study that analyzed 236 documents on code repository analysis published between 2012 and 2019. The study identified the main types of information used as inputs for code repository analysis (commit data and source code) and the most common analysis methods and techniques used (empirical/experimental and automatic). The study aims to help developers better understand how to analyze code repositories to improve software maintenance and evolution.
Improved Presentation and Facade Layer Operations for Software Engineering Pr...Dr. Amarjeet Singh
Nowadays, one of the most challenging situations for
software developers is the presence of a mismatch between
relational database systems and programming codes. In the
literature, this problem is defined as "impedance mismatch".
This study is to develop a framework built on innovations
based on the existing Object Relational Mapping technique to
solve these problems. In the study, users can perform
operations for three different database systems such as
MsSQL, MySql and Oracle. In addition, these operations can
be done within the framework of C# and Java programming
languages. In this framework, while the developers can define
database tables in the interface automatically, they can create
relations between tables by defining a foreign key. When the
system performs these operations, it creates tables, views, and
stored procedures automatically. In addition, entity classes in
C# and Java for tables and views, and operation classes for
stored procedures are created automatically. The summary of
the transactions can be taken as pdf file by the framework. In
addition, the project can automatically create Windows
Communication Foundation classes to facilitate the handling
of database elements created and the interfacing operations, as
well. This framework, which supports distributed systems, can
be downloaded at this link.
IRJET-Computational model for the processing of documents and support to the ...IRJET Journal
This document proposes a computational model for processing documents and supporting decision making in information retrieval systems. The model includes five main components: 1) a tracking and indexing component to crawl the web and store document metadata, 2) an information processing component to categorize documents and define user profiles, 3) a decision support component to analyze stored information and generate statistical reports, 4) a display component to provide search interfaces and visualization tools, and 5) specialized roles to administer the system. The goal of the model is to provide a framework for developing large-scale search engines.
The document provides an overview of software project estimation techniques. It discusses that estimation involves determining the money, effort, resources and time required to build a software system. The key steps are: describing product scope, decomposing problems, estimating sub-problems using historical data and experience, and considering complexity and risks. It also covers decomposition techniques, empirical estimation models like COCOMO II, and factors considered in estimation like resources, feasibility and risks.
Similar to A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FOR BUILDING DURATION ESTIMATION MODELS: CASE STUDY OF GITHUB (20)
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
Open Channel Flow: fluid flow with a free surfaceIndrajeet sahu
Open Channel Flow: This topic focuses on fluid flow with a free surface, such as in rivers, canals, and drainage ditches. Key concepts include the classification of flow types (steady vs. unsteady, uniform vs. non-uniform), hydraulic radius, flow resistance, Manning's equation, critical flow conditions, and energy and momentum principles. It also covers flow measurement techniques, gradually varied flow analysis, and the design of open channels. Understanding these principles is vital for effective water resource management and engineering applications.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
Generative AI Use cases applications solutions and implementation.pdfmahaffeycheryld
Generative AI solutions encompass a range of capabilities from content creation to complex problem-solving across industries. Implementing generative AI involves identifying specific business needs, developing tailored AI models using techniques like GANs and VAEs, and integrating these models into existing workflows. Data quality and continuous model refinement are crucial for effective implementation. Businesses must also consider ethical implications and ensure transparency in AI decision-making. Generative AI's implementation aims to enhance efficiency, creativity, and innovation by leveraging autonomous generation and sophisticated learning algorithms to meet diverse business challenges.
https://www.leewayhertz.com/generative-ai-use-cases-and-applications/
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...PriyankaKilaniya
Energy efficiency has been important since the latter part of the last century. The main object of this survey is to determine the energy efficiency knowledge among consumers. Two separate districts in Bangladesh are selected to conduct the survey on households and showrooms about the energy and seller also. The survey uses the data to find some regression equations from which it is easy to predict energy efficiency knowledge. The data is analyzed and calculated based on five important criteria. The initial target was to find some factors that help predict a person's energy efficiency knowledge. From the survey, it is found that the energy efficiency awareness among the people of our country is very low. Relationships between household energy use behaviors are estimated using a unique dataset of about 40 households and 20 showrooms in Bangladesh's Chapainawabganj and Bagerhat districts. Knowledge of energy consumption and energy efficiency technology options is found to be associated with household use of energy conservation practices. Household characteristics also influence household energy use behavior. Younger household cohorts are more likely to adopt energy-efficient technologies and energy conservation practices and place primary importance on energy saving for environmental reasons. Education also influences attitudes toward energy conservation in Bangladesh. Low-education households indicate they primarily save electricity for the environment while high-education households indicate they are motivated by environmental concerns.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Software Engineering and Project Management - Software Testing + Agile Method...Prakhyath Rai
Software Testing: A Strategic Approach to Software Testing, Strategic Issues, Test Strategies for Conventional Software, Test Strategies for Object -Oriented Software, Validation Testing, System Testing, The Art of Debugging.
Agile Methodology: Before Agile – Waterfall, Agile Development.
Home security is of paramount importance in today's world, where we rely more on technology, home
security is crucial. Using technology to make homes safer and easier to control from anywhere is
important. Home security is important for the occupant’s safety. In this paper, we came up with a low cost,
AI based model home security system. The system has a user-friendly interface, allowing users to start
model training and face detection with simple keyboard commands. Our goal is to introduce an innovative
home security system using facial recognition technology. Unlike traditional systems, this system trains
and saves images of friends and family members. The system scans this folder to recognize familiar faces
and provides real-time monitoring. If an unfamiliar face is detected, it promptly sends an email alert,
ensuring a proactive response to potential security threats.
Accident detection system project report.pdfKamal Acharya
The Rapid growth of technology and infrastructure has made our lives easier. The
advent of technology has also increased the traffic hazards and the road accidents take place
frequently which causes huge loss of life and property because of the poor emergency facilities.
Many lives could have been saved if emergency service could get accident information and
reach in time. Our project will provide an optimum solution to this draw back. A piezo electric
sensor can be used as a crash or rollover detector of the vehicle during and after a crash. With
signals from a piezo electric sensor, a severe accident can be recognized. According to this
project when a vehicle meets with an accident immediately piezo electric sensor will detect the
signal or if a car rolls over. Then with the help of GSM module and GPS module, the location
will be sent to the emergency contact. Then after conforming the location necessary action will
be taken. If the person meets with a small accident or if there is no serious threat to anyone’s
life, then the alert message can be terminated by the driver by a switch provided in order to
avoid wasting the valuable time of the medical rescue team.
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELijaia
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FOR BUILDING DURATION ESTIMATION MODELS: CASE STUDY OF GITHUB
1. International Journal of Software Engineering & Applications (IJSEA), Vol.11, No.6, November 2020
DOI: 10.5121/ijsea.2020.11603 31
A DATA EXTRACTION ALGORITHM FROM OPEN
SOURCE SOFTWARE PROJECT REPOSITORIES FOR
BUILDING DURATION ESTIMATION MODELS: CASE
STUDY OF GITHUB
Donatien K. Moulla1, 2
, Alain Abran3
and Kolyang4
1
Faculty of Mines and Petroleum Industries, University of Maroua, Cameroon
2
LaRI Lab, University of Maroua, Cameroon
3
Department of Software Engineering and Information Technology, École de
Technologie Supérieure, 1100, rue Notre-Dame Ouest Montréal, Canada
4
The Higher Teachers’ Training College, University of Maroua, Cameroon
ABSTRACT
Software project estimation is important for allocating resources and planning a reasonable work
schedule. Estimation models are typically built using data from completed projects. While organizations
have their historical data repositories, it is difficult to obtain their collaboration due to privacy and
competitive concerns. To overcome the issue of public access to private data repositories this study
proposes an algorithm to extract sufficient data from the GitHub repository for building duration
estimation models. More specifically, this study extracts and analyses historical data on WordPress
projects to estimate OSS project duration using commits as an independent variable as well as an
improved classification of contributors based on the number of active days for each contributor within a
release period. The results indicate that duration estimation models using data from OSS repositories
perform well and partially solves the problem of lack of data encountered in empirical research in software
engineering.
KEYWORDS
Effort and duration estimation, software project estimation, project data, data extraction algorithm,
GitHub repository
1. INTRODUCTION
Software project estimation has always been a challenge for software engineering communities
[1, 2]. One of the factors which influence the development of software project estimation is
project data, which is often incomplete and at times inconsistent. The performance of an
estimation model depends on the characteristics of the software project data such as size, missing
values and outliers, as well as the validation techniques used (leave-one-out cross validation,
holdout, n-fold cross validation) and evaluation criteria [1]. Some organizations such as the
International Software Benchmarking Standard Group (ISBSG) and PROMISE provide
worldwide repositories of software projects [3, 4]. These repositories provide data useful for
multiple purposes, including project productivity comparison and estimation of the effort
required to build productivity models [5]. However, different software projects have their own
characteristics, such as application domain, type of system, development platform, type of
language, etc. In addition, data collected may present quality problems due to noise, outliers,
2. International Journal of Software Engineering & Applications (IJSEA), Vol.11, No.6, November 2020
32
incompleteness, even in data collected by mature organizations [6, 7, 8, 9]. In addition, it is
considerably more difficult to get both large datasets and diverse datasets. The amount of data
available for building estimation models affects the performance of the generated models,
especially with small datasets where results may not be generalizable.
An OSS source code management repository is an online community platform that makes project
history data available to IT practitioners and researchers by enabling developers to host and
review code, manage projects, share knowledge and work together to build OSS. Such OSS code
repositories do not hold direct information on the total duration and effort of any specific project.
Due to the quantity and diversity of the raw data (such as commits, contributors, LOC, etc.),
extracting and processing the data for either duration or effort estimation is a challenge.These
OSS repositories usually contain the following information:
• Delivered software source code with corresponding dates.
• Activities of the contributors who submitted their codes to a project (commits).
• Meta-data (commit ID, date of commitment, release tag, etc.).
• Contributor performing the commit.
All of this information can be used directly as independent variables, or through derived
variables, to develop duration estimation models.
In this paper, we focus on the design of an algorithm to extract data from the GitHub source code
management repository to design duration estimation models. Three measures of activity are
explored as independent variables:
(1) The number of changes to source code (commits) per contributor in each release.
(2) The number of active days per contributor in each release, i.e., the days in which a
contributor performs at least one commit to the code base.
(3) The number of lines of code added per contributor in each release. We define LOC as the
total number of lines of code, not including blank spaces and comments.
(4) Type of contributors, where the type of contributors is an indirect variable defined within
our research methodology.
Data from the WordPress project, extracted from the source code management repository
GitHub, were used to test our algorithm and design the duration estimation models. The rationale
behind selecting the WordPress project is related to its OSS economic model, data availability
and its significance within the Open Source community: WordPress is one of the most used
content management system (CMS) open source software in the world and its economic model is
based on service and hosting.
The paper is structured as follows. Section 2 presents related work on duration and effort
estimation with a focus on project datasets. Section 3 describes the proposed algorithm. Section 4
details the test results. Section 5 concludes with a summary of the key findings, the study
limitations and directions for future work.
2. RELATED WORK
This section presents related work on empirical software engineering datasets and duration and
effort estimation with a focus on OSS projects.
3. International Journal of Software Engineering & Applications (IJSEA), Vol.11, No.6, November 2020
33
Qi et al. [10] provided a method to collect sufficient data for solving the problem of lack of
training data when building an effort estimation model using GitHub data. Although their method
can be used to collect sufficient data for solving the problem of lack of training data when
building an effort estimation model, it is subject to certain threats to validity. They only used the
java open source projects of GitHub to calculate effort as a derived variable. To identify an active
contributor within their study, they adopted the following definitions:
• Active contributor: in a period of consecutive active days (with a of default 30 days), an
active contributor has an average of committing one record per active day, and whose total
contributions exceed the average contributions of the project;
• Inactive contributor: otherwise.
Badashian et al. [11] tracked a variety of metrics such as commits, Pull Reqs, Projects Watched,
Issue Comments, Followers, and others to study the activities of developers.
MacDonell et al. [12] provided a transparent and consistent means of collection and evaluation of
data that could lead to the use of higher quality data in software engineering experiments. They
assessed the quality of 13 empirical software engineering datasets with the aim of benchmarking
them against the data quality taxonomy proposed by Bosu and MacDonell [13] based on three
groups: accuracy, relevance and provenance.
According to Gencel et al. [14], inconsistent results are obtained when software effort estimation
models are built using benchmark repositories for two reasons:
- the lack of common standards and vocabulary;
- the differences in definitions and categories of attributes of the different repositories.
Cheikhi and Abran [15] surveyed the PROMISE and ISBSG repositories with the objective of
making them easier for researchers to understand the data in them and thus more readily use the
data.
Kocaguneli et al. [16] proposed a solution for importing relevant data from other organizations to
build effort estimation models. They explored cross versus within data “sources”, where the data
is divided by the values of any single feature. With in studies are localized to one subset while a
cross study trains from some subsets and tests on others. The datasets used for their research were
COCOMO81, nasa93, Desharnais, Finnish, Kemerer and Maxwell datasets. They found that the
cross performance results are no worse than within.
Shihab et al. [17] performed an empirical study on four OSS projects (namely JBoss, Spring,
Hibernate and Mule) and compared the use of LOC to other code variables (in particular code
complexity). To obtain code complexity, they required the source code of the changed files. They
used the list of file names to download the corresponding revisions of files from the JIRA
database. They used correlation tests and linear regression models to investigate the categories of
variables as a substitute for effort. Since effort was available in their dataset, they investigated the
used code complexity and LOC as a substitute measure of effort, a unique feature of this study.
However, in their dataset the effort data was at the “issues” level but the variables used for
statistical analysis were at another level (e.g., at the “files” level). A number of assumptions had
to be made to assign effort across levels (e.g., from issues to files). Doing so, of course,
introduces a number of unknown distortions without the ability quantify them and analyze their
impact on the reported findings. While the authors report that there are correlations between the
indirect effort measures with actual effort, these results are quite low and provide little support
4. International Journal of Software Engineering & Applications (IJSEA), Vol.11, No.6, November 2020
34
for the relevance of the indirect measures of effort in other related works for OSS effort and
duration estimation purposes.
In summary, of the studies on empirical software engineering datasets and effort and duration
estimation very few are OSS projects. The size and choice of the datasets presents a challenge for
research on OSS project estimation. The keys findings of these specific studies show that there
are datasets that have been used extensively in research on software effort estimation but most of
them do not contain sufficient information to enable researchers to evaluate their accuracy
(outliers, inconsistency, incompleteness, etc.), relevance (amount of data, heterogeneity and
timeliness) and provenance (trustworthiness, accessibility and commercial sensitivity). In
addition, related OSS work suffers from a lack of data on effort itself and must be approximated
indirectly through other variables without knowing their variability and the impact of such
variability on the estimation models themselves. To date, OOS studies by researchers for
estimating effort are based on substitute variables that do not rely on reliable methodological
bases [17]. As long as reliable substitute of effort measures are not available, it is not wise to
build estimation models based on these very weak substitutes. However, for duration estimation,
this variable is directly calculable from direct data of project calendar dates and our research
relates exclusively to duration estimation models.
3. PROPOSED ALGORITHM
Data extraction selects, within a project, the number of changes made by each contributor within
a specific period. These changes are related to the total number of commits, LOC added, LOC
removed and active days. The algorithm to generate data from GitHub is presented next.
Algorithm: Extraction of data
Input: Project P
Outputs: Contributors, Commits, LOC added, LOC removed, Active days
Begin
{
1. Write the range of date on which data extraction must be done
2. Read (beginning date, end date)
3.For this range of date:
4. List all contributors who made a change
5.For each contributor:
6. { commit<- total number of changes to source code
7.insert <- total number of LOC added
8. delete <- total number of LOC removed
9. day <- total number of active days
10. write in output (contributor t commit t LOC added t LOC removed t
active days n)
11.}
12.End for
}
End
The input of the algorithm is the project P and the outputs are the total number of contributors per
release, their commits, LOC added, LOC removed and active days. The first step is to enter the
beginning date and end date of the release. The second step is to generate the list of contributors
for this release. A temporary file is created for this purpose. Finally, for each contributor, the
algorithm calculates the total number of LOC added, LOC removed, commits and active days.
5. International Journal of Software Engineering & Applications (IJSEA), Vol.11, No.6, November 2020
35
The algorithm then writes the results to a file. The implemented script for this algorithm is
presented in Appendix A.
4. TEST RESULTS
4.1. Data Extraction
The proposed algorithm was implemented in C including GitHub commands.
Table 1 shows a statistical analysis of the historical data on 21 releases of the WordPress project,
from February 2015 to May 2019 based on data extracted on June 23, 2019 from the GitHub
repository using our algorithm. In Table 1, |LOC net| = |LOC added - LOC removed|.
Table 1.WordPress project: descriptive statistics of the 21 releases
Release # Commits # LOC added #LOC removed |LOC net|
4.2.4 227 7984 3384 4618
4.2.5 1319 28696 20281 8415
4.2.6 225 5435 3062 4409
4.2.7 589 12018 5835 8989
4.2.8 913 23886 6488 17398
4.3 876 12044 6446 5944
4.3.1 988 42575 35829 6746
4.3.2 1100 20861 8760 12825
4.3.3 1599 32842 6207 26635
4.3.4 559 29788 26392 3686
4.4.1 408 4100 2041 2257
4.4.2 518 16154 4823 11335
4.4.3 806 17548 5990 11558
4.5 487 5823 2995 3142
4.6 855 54932 37178 22192
4.7 1108 73845 35945 48848
4.8 718 42282 22272 41862
4.9 1001 126472 47171 126100
5.0 854 125568 105561 124716
5.1 586 921228 668282 920642
5.2 421 145968 81814 145547
4.2. Classification of OSS Contributors
We observed that contributions to OSS projects were not equally distributed across contributors.
Several contributors were responsible for the majority of the commits, while the majority
contributed only a few [18, 19]. In our study, instead of using a threshold [19], we propose an
improved approach using the total number of active days for each contributor in the release
duration for classifying the contributors as full-time, part-time and occasional.
For the purpose of estimating OSS project duration, we defined the following:
i) Active day for a contributor = a day in which the contributor performs at least one commit
to the code base.
ii) Release duration is calculated from the time required to develop each release, that is,
duration = (end date of the release) – (beginning date of the release).
6. International Journal of Software Engineering & Applications (IJSEA), Vol.11, No.6, November 2020
36
iii) Approach for classifying contributors:
a) Occasional: any contributor who has been active at most one third (1/3) of the release
duration.
b) Part-time: any contributor who has been active more than one third (1/3) and at most two
thirds (2/3) of the release duration.
c) Full-time: any contributor who has been active more than two thirds (2/3) of the release
duration.
After identifying the different contributors per category in each release according to the above
definitions, the total number of commits per category of contributors per release is obtained by
adding all the commits of the different contributors for each category. For example, when a
category of full-time contributors consists of three contributors who made respectively, 10, 20
and 30 commits over respectively 8, 13 and 16 active days, the maximum duration release for this
category of contributors is 16 days and the total number of commits for this category of
contributors in this release is 60. In summary, the maximum duration by category of contributors
for this release corresponds to the largest number of active days.
4.3. Estimation Models with Commits by Categories of Contributors
We investigated the contribution of the number of commits in the estimation of the duration of
OSS releases using linear regression since linear regression models are widely used in the
software estimation literature [20, 21, 22, 23]. In addition, linear regression models are more
suitable for models that use one independent variable. Also, it is much easier to represent the
findings graphically and for management to understand.
Three linear regression models were constructed using the total number of commits by category
of contributors for each release as the independent variable. The three models correspond to the
duration estimation models for full-time, part-time and occasional contributors of the WordPress
project.
4.3.1. Data Preparation and Descriptive Statistics
Three major steps are recommended for building estimation models based on historical data: data
preparation, application of statistical tools and data analysis. Models built using simple
parametric regression techniques require [3]:
• A normal distribution of the input variables (for both the dependent variable (e.g., project
duration) and independent variable (e.g., number of commits));
• No outlier that may unduly influence the model;
• A large enough dataset (e.g., typically 20 to 30 data points for each independent variable
included in the model).
Here, we focus on the quality of the data, not on its quantity. In order to build good models, we
must have good input values: that is the rationale to remove outliers for all categories of
contributors to avoid potentially biasing the findings (see the discussion on outliers in chapter 5
of [3]). The total number of commits per category of contributors per release, excluding outliers
and the maximum duration per category of contributors per release, where duration is expressed
in calendar days (work days) are given in the appendix. There were full-time contributors in 10
releases, part-time contributors in 15 releases and occasional contributors in 21 releases.
According to the definitions in the research approach section, there were:
7. International Journal of Software Engineering & Applications (IJSEA), Vol.11, No.6, November 2020
37
• No full-time contributors in releases 4.2.6, 4.2.8, 4.3.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1 and
5.2.
• No part-time contributor in releases 4.6, 4.8, 4.9, 5.0, 5.1 and 5.2.
To analyze for the presence of outliers by category of contributors in these new subsets (commits
per category of contributors and duration per category of contributors), we applied the Grubbs
test on both the independent variable ‘total commits per category of contributors per release’ and
the dependent variable ‘maximum duration per category of contributors per release’:
• There were no statistical outliers for total commits per category of full-time, part-time and
occasional contributors.
• There was one statistical outlier for maximum duration per category of occasional
contributors in the release 4.8.
4.3.2. Estimation Models
Table 2 from [24] presents the duration estimation models built on commits per category of
contributors for each release and the performance criteria for each model, respectively, for full-
time, part-time and occasional contributors:
• The MMRE for full-time, part-time and occasional contributors were 23%, 20% and 49%,
respectively.
• In terms of R², MMRE and PRED (25%), the duration estimation model for part-time
contributors was better compared to the models for full-time and occasional contributors.
For instance, the PRED value obtained for full-time contributors is excellent (PRED (25%)
= 93%), which means that the MRE was within ± 25% for 93% of the data points.
It is to be noted that the ‘x’ variable in the equations of Table 2 represents the number of
commits. When extrapolations are made from 1/3 and 2/3 of the estimated duration, respectively,
for the models for occasional and part-time contributors, the estimates are good compared to the
model for full-time contributors [24].
4.4. Discussion of the Estimation Models based on 'Commits' and Industrial
Datasets
Table 3 presents the industrial datasets used to compare the duration estimation models with
regression techniques. The rationale behind selecting these datasets is related to their significance
within the empirical software engineering research: the COCOMO81, Desharnais and Kemerer
datasets have been the most widely used in software effort estimation [25].
Table 3. Descriptive statistics of the WordPress and industrial datasets
Datasets
Number
of records
Size (measure unit)
(independent variable)
Measure
unit
Attribute measured
(dependent variable)
Desharnais 81 Function points (Adjusted) Month Duration
Kemerer 15 KSLOC Month Duration
COCOMO81 63 KLOC Month Duration
WordPress 21 Commits Workday Duration
To date, studies by researchers for estimating effort based on substitute variable do not rest on
reliable methodological bases [17] and, as long as reliable substitute of effort measures are not
available, it is not wise to build estimation models based on these very weak substitutes.
8. International Journal of Software Engineering & Applications (IJSEA), Vol.11, No.6, November 2020
38
However, for duration estimation this variable is directly calculable from direct data of calendar
dates, and our research relates exclusively to duration estimation models.
Table 4 summarizes the performance criteria resulting from the duration estimation models with
Desharnais [26], Kemerer [27], COCOMO81 [28] datasets and the duration estimation model
with WordPress datasets. The performance criteria of duration estimation models using
Desharnais, Kemerer COCOMO81 and WordPress datasets are given in Table 4.
Table 4.Performance of duration estimation models based on WordPress and industrial datasets
Datasets
Number of
records
Equation of the
models
Performance criteria
MMRE
(%)
PRED
(25%)
R²
Desharnais 81 0.0293 2.8687y x 49 47 0.51
Kemerer 15 0.014 11.648y x 53 40 0.06
COCOMO81 63 0.0696 14.73y x 55 37 0.51
WordPress
10 (full-time) 0.0151 14y x 23 70 0.36
15 (Part-time) 0.0223 7y x 20 93 0.77
20 (Occasional) 0.0392 2y x 49 20 0.60
In Table 4 the highlighted numbers indicate the most accurate estimates among the four types of
datasets. The duration estimation models with the WordPress OSS datasets using ‘commits’ as
the independent variable gave better estimates than duration models built from Desharnais,
Kemerer and COCOMO81 datasets with lower MMRE, and highest PRED(25) and R2
.
5. CONCLUSION
One of the factors which influence software project estimation is project data, which is often
incomplete and inconsistent. This paper proposes an algorithm to extract relevant data from the
GitHub repository for building duration estimation models. Our algorithm written in C includes
GitHub commands. Data from WordPress projects, extracted from the source code management
repository GitHub, were used to test our algorithm and design the duration estimation models.
Statistical regression techniques were used to construct duration estimation models using as the
independent variable the number of commits per categories of contributors. We proposed an
improved approach which uses the total number of active days for each contributor in the release
duration for classifying the contributors as full-time, part-time and occasional. The duration
estimation models with the WordPress OSS datasets using ‘commits’ as the independent variable
gave better estimates than duration estimation models built from Desharnais, Kemerer and
COCOMO81 datasets. The results show that the duration estimation models built from data
collected from Open Source repositories perform as well than those built with datasets most
widely used in software duration estimation. All our duration estimation models can be
considered white-box models since all the detailed data used to build them are documented and
available for independent replication studies.
All empirical studies, such as those reported here, are subject to certain threats to validity. The
recommended size of the dataset sample required for meaningful statistical regression results is
between 20 to 30 data points. In this study, the sample size used to construct the duration
estimation models per category of contributors, with commits as the independent variable, was
relatively small (21 data points). We proposed criteria for an objective classification to categorize
contributors as full-time, part-time or occasional. However, some assumptions have been made in
the design of our duration estimation models. For instance, we assumed that the time spent by
9. International Journal of Software Engineering & Applications (IJSEA), Vol.11, No.6, November 2020
39
each contributor in a release corresponds to the number of days in which the contributor performs
at least one commit to the code base.
We plan to build duration estimation models using different statistical techniques and analyze
additional OSS projects. In future work, we plan to use some additional factors as criteria to
collect data, such as application domain, type of system, development platform, type of language,
etc. We also plan to take into account outliers for a more in-depth analysis. Subsequent research
could also explore other variables for the estimation of project duration such as algorithm
complexity, team experience and code quality, but all these variables together will only help to
predict the part of the duration not yet explained by the commits.
REFERENCES
[1] A. Idri, M. Hosni, A. Abran, “Systematic Literature Review of Ensemble Effort Estimation”, Journal
of Systems & Software, doi: 10.1016/j.jss.2016.05.016, 2016.
[2] S.K. Sehra, Y.S. Brar, N. Kaur, S.S. Sehra, “Research Patterns and Trends in Software Effort
Estimation”, Information and Software Technology, doi: 10.1016/j.infsof.2017.06.002, 2017.
[3] https://www.isbsg.org/
[4] T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan, “The promise repository
of empirical software engineering data,” 2012. [Online]. Available:
http://promise.site.uottawa.ca/SERepository/
[5] A. Abran, Software Project Estimation: The Fundamentals for Providing High Quality Information to
Decision Makers. John Wiley & Sons Inc., Hoboken, New Jersey, 2015.
[6] M. Fernãndez-Diego, F.G.-L. De-Guevara, “Potential and limitations of the ISBSG dataset in
enhancing software engineering research: A mapping review”, Information and Software Technology
56 (6), 527– 544. doi:10.1016/j.infsof.2014.01.003, 2014.
[7] D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson, “Reflections on the NASA MDP data
sets”, IET Softw. 6 6, 549–558. DOI:10.1049/iet-sen.2011.0132, 2012.
[8] M.J. Shepperd, Q. Song, Z. Sun, and C. Mair, “Data quality: Some comments on the NASA software
defect datasets”, Software Engineering. IEEE Trans. Softw. Eng. 39, 9, 1208–1215.
DOI:10.1109/TSE.2013.11, 2013.
[9] K. Deng and S.G. MacDonell, “Maximising data retention from the isbsg repository”, 12th
International Conference on Evaluation and Assessment in Software Engineering. Italy. 2008.
[10] F. Qi, X Y. Jing, X. Zhu, et al., “Software effort estimation based on open source projects: Case
study of Github”, Information and Software Technology, 92: 145-157, 2017.
[11] A.S. Badashian, A. Esteki, A. Gholipour, A. Hindle, and E. Stroulia, “Involvement, contribution and
influence in github and stackoverflow”, 24th Annual International Conference on Computer Science
and Software Engineering. Markham, Ontario, 2014.
[12] M.F. Bosu and S. G. Macdonell, “Experience: Quality Benchmarking of Datasets Used in Software
Effort Estimation”, Journal of Data and Information Quality, vol. 11, n° 4, Article 19, 38 pages, 2019.
https://doi.org/10.1145/3328746
[13] M.F. Bosu and S.G. MacDonell, “A taxonomy of data quality challenges in empirical software
engineering”, 22nd Australian Conference on Software Engineering.97–106.2013a,
DOI:10.1109/ASWEC.2013.21.
[14] C. Gencel, L. Buglione, and A. Abran, “Improvement opportunities and suggestions for
benchmarking”, Software Process and Product Measurement.Springer, Berlin (Germany), 144–156,
2009.
[15] L. Cheikhi and A. Abran, “Promise and ISBSG software engineering data repositories: A survey”, In
Joint Conference of the 23nd International Workshop on Software Measurement and the 8th
International Conference on Software Process and Product Measurement, 17–24, 2013,
DOI:10.1109/IWSM-Mensura.2013.13
[16] E. Kocaguneli, T. Menzies, “How to find relevant data for effort estimation?”,International
Symposium on Empirical Software Engineering and Measurement, Banff, Canada, 2011, pp. 255–
264. doi:10.1109/ESEM. 2011.34.
10. International Journal of Software Engineering & Applications (IJSEA), Vol.11, No.6, November 2020
40
[17] E. Shihab, Y. Kamei, B. Adams, et al., “Is lines of code a good measure ofeffort in effort-aware
models?”, Information and Software Technology, 55(11): 1981-1993, 2013.
[18] S. Koch, “Effort Modeling and Programmer Participation in Open Source Software
Projects”,Information Economics and Policy, 20(4), pp.345-355, 2008.
[19] G. Robles, J.M. González-Barahona, C. Cervigón, A. Capiluppi, D. Izquierdo-Cortázar, “Estimating
Development Effort in Free/Open Source Software Projects by Mining Software Repositories: A
Case Study of OpenStack”, 11th Working Conference on Mining Software Repositories, May 31 –
June 01 2014, pp.222-231.
[20] S. Chatterjee and A.S. Hadi, Regression analysis by example, John Wiley & Sons, 2015.
[21] J. Fernandez-Ramil, D. Izquierdo-Cortazar, and T. Mens, “What does it take to develop a million
lines of Open Source code?”, In: Boldyreff C., Crowston K., Lundell B., Wasserman A.I. (eds) Open
Source Ecosystems: Diverse Communities Interacting. OSS 2009. IFIP Advances in Information and
Communication Technology, Springer, Berlin, Heidelberg, vol 299, pp. 170-184, 2009.
[22] M. Jorgensen, “Regression models of software development effort estimation accuracy and bias”,
Empirical Software Engineering, vol. 9, pp. 297–314, 2004.
[23] D.C. Montgomery, E. A. Peck, G. G. Vining, Introduction to Linear Regression Analysis, 5th ed.
Wiley, Hoboken, NJ. 2012.
[24] D.K. Moulla, A. Abran and Kolyang, “Duration Estimation Models for Open Source Software
Projects”, International Journal of Information Technology and Computer Science (IJITCS), Vol.12,
No.5, In production.
[25] C. Mair, M.J. Shepperd and M. Jørgensen, “An analysis of data sets used to train and validate cost
prediction systems”, Workshop on Predictor Models in Software Engineering (PROMISE’05).1–6,
2005. DOI:10.1145/1083165.1083166.
[26] J.M. Desharnais, Analyse Statistique de la productivité des projets informatique partir de la technique
des Point des Fonction, Université de Montréal (Canada), Masters thesis, 1989.
[27] Chris F. Kemerer, “An empirical validation of software cost estimation models”,Communication of
the ACM, Vol.30, No.5, pp.416–429, 1987.
[28] J. SayyadShirabad, T.J. Menzies, “PROMISE Repository of Software Engineering Databases”,
School of Information Technology and Engineering, University of Ottawa, Canada, 2005. [Online].
Available: http://promise.site.uottawa.ca/SERepository.
AUTHORS
Dr.Donatien K. Moulla is currently a Lecturer in computer science at the Faculty of
Mines and Petroleum Industries, University of Maroua, Cameroon. He received his PhD
in Computer Science from the University of Ngaoundéré, Cameroon. He earned his
Master’s and Bachelor’s degrees from the Department of Mathematics and Computer
Science in the same University. He has more than eight years of teaching and research
experience in information systems development and Software Engineering. He is also
member of COSMIC International Advisory Council. His research interest includes
software estimation, software measurement, software quality, Automation of COSMIC functional size
measurement and Open Source Software projects.
Alain Abran, PhD, is a Professorat École de technologie supérieure, Université du
Québec, Canada. He is also Chairman of the Common Software Measurement
International Consortium. He was the international secretary for ISO/IEC JTC1 SC7. Dr.
Abran has over 20 years of industry experience in information systems development and
software engineering.
Prof. Dr.-Ing. habil. Kolyang is a Professor at the Higher Teachers’ Training College,
University of Maroua, Cameroon. His research area includes Software Engineering, E-
learning, and ICT for Development.