The document discusses Stephen W. Thomas's research on using information retrieval (IR) models to mine software repositories. It lists several of his publications that (1) propose using IR models for software engineering tasks like test case prioritization and analyzing code-mail interactions, (2) evaluate topic evolution models on code histories, (3) combine multiple IR models, (4) address the data duplication problem, and (5) analyze the effects of preprocessing and IR model parameters.
The document proposes using semantic similarity between API usage patterns and code under editing to improve automatic code completion. It discusses extracting API usage patterns and representing them and code in a semantic model. An experimental design is presented that would use semantic queries to find similar patterns for code completion suggestions, evaluated based on precision, recall and F-score. The expected results are that using semantic features rather than just syntax would provide more accurate and context-aware suggestions.
Pg. 03 question one assignment 1deadline monday ssuser562afc1
This document contains instructions for a network management assignment with 3 questions. Question 1 asks to differentiate between modern telecommunication and data communication networks in 6-8 lines. Question 2 provides a scenario and asks to write ASN.1 descriptions for different record values. Question 3 asks to provide the Object Identifiers for various nodes like interfaces and snmp according to the given Internet MIB-II group diagram. The assignment is due on February 15, 2021.
‘CodeAliker’ - Plagiarism Detection on the Cloud acijjournal
Plagiarism is a burning problem that academics have been facing in all of the varied levels of the educational system. With the advent of digital content, the challenge to ensure the integrity of academic work has been amplified. This paper discusses on defining a precise definition of plagiarized computer code, various solutions available for detecting plagiarism and building a cloud platform for plagiarism disclosure.
‘CodeAliker’, our application thus developed automates the submission of assignments and the review process associated for essay text as well as computer code. It has been made available under the GNU’s General Public License as a Free and Open Source Software.
This document proposes a new metric for measuring code readability and compares it to existing metrics. It describes collecting rules for readability from software engineers and developing a formula that incorporates these rules. A prototype application was created to apply the new metric and existing metrics (ARI, FOG, SMOG) to code samples. The results of the new metric were compared to readability percentages provided by 50 software engineers for the same samples, and were found to closely match. The new metric provides an automated way to measure code readability.
This is the slideshow I used to present my M.S. thesis proposal, which is tentatively titled "Planning Messages in Sequence Diagrams and Analyzing the Consistency of Use Cases and Class Diagrams Automatically using Design by Contract."
IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...IRJET Journal
The document describes a proposed system called QUEZARD that uses machine learning and artificial intelligence to generate questions from documents. It consists of an Android application to scan documents and extract text, a machine learning platform to analyze the text and generate possible questions, and a voice assistant to answer user questions. The system aims to help both students by providing practice questions and teachers by suggesting new questions to ask. It extracts key elements from sentences using part-of-speech tagging to form question-answer pairs from documents.
AN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLSIJNSA Journal
Emails are the most common service on the Internet for communication and sending documents. Email is used not only from computers but also from many other electronic devices such as tablets; smartphones, etc. Emails can also be used for criminal activities. Email forensic refers to the study of email detail and content as evidence to identify the actual sender and recipient of a message, date/time of transmission, detailed record of email transaction, intent of the sender, etc. Email forensics involves investigation of metadata, keyword, searching, port scanning and generating report based on investigators need. Many tools are available for any investigation that involves email forensics. Investigators should be very careful of not violating user’s privacy. To this end, investigators should run keyword searches to reveal only the relevant emails. Therefore, knowledge of the features of the tool and the search features is necessary for the tool selection. In this research, we experimentally compare the performance of several email forensics tools. Our aim is to help the investigators with the tool selection task. We evaluate the tools in terms of their keyword search, report generation, and other features such as, email format, size of the file accepted, whether they work online or offline, format of the reports, etc. We use Enron email dataset for our experiment.
This presentation summarizes a simple phone book program using linked lists and file handling data structures. Key points include:
- Linked lists and files were used to store and manage contact data in a dynamic way without pre-allocating memory.
- Files allow data to be stored non-volatile, reusable, and portable between systems.
- The program includes functions for loading data from a file into a linked list, validating user input, adding/finding/modifying/deleting records, and writing the linked list data back to the file.
The document proposes using semantic similarity between API usage patterns and code under editing to improve automatic code completion. It discusses extracting API usage patterns and representing them and code in a semantic model. An experimental design is presented that would use semantic queries to find similar patterns for code completion suggestions, evaluated based on precision, recall and F-score. The expected results are that using semantic features rather than just syntax would provide more accurate and context-aware suggestions.
Pg. 03 question one assignment 1deadline monday ssuser562afc1
This document contains instructions for a network management assignment with 3 questions. Question 1 asks to differentiate between modern telecommunication and data communication networks in 6-8 lines. Question 2 provides a scenario and asks to write ASN.1 descriptions for different record values. Question 3 asks to provide the Object Identifiers for various nodes like interfaces and snmp according to the given Internet MIB-II group diagram. The assignment is due on February 15, 2021.
‘CodeAliker’ - Plagiarism Detection on the Cloud acijjournal
Plagiarism is a burning problem that academics have been facing in all of the varied levels of the educational system. With the advent of digital content, the challenge to ensure the integrity of academic work has been amplified. This paper discusses on defining a precise definition of plagiarized computer code, various solutions available for detecting plagiarism and building a cloud platform for plagiarism disclosure.
‘CodeAliker’, our application thus developed automates the submission of assignments and the review process associated for essay text as well as computer code. It has been made available under the GNU’s General Public License as a Free and Open Source Software.
This document proposes a new metric for measuring code readability and compares it to existing metrics. It describes collecting rules for readability from software engineers and developing a formula that incorporates these rules. A prototype application was created to apply the new metric and existing metrics (ARI, FOG, SMOG) to code samples. The results of the new metric were compared to readability percentages provided by 50 software engineers for the same samples, and were found to closely match. The new metric provides an automated way to measure code readability.
This is the slideshow I used to present my M.S. thesis proposal, which is tentatively titled "Planning Messages in Sequence Diagrams and Analyzing the Consistency of Use Cases and Class Diagrams Automatically using Design by Contract."
IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...IRJET Journal
The document describes a proposed system called QUEZARD that uses machine learning and artificial intelligence to generate questions from documents. It consists of an Android application to scan documents and extract text, a machine learning platform to analyze the text and generate possible questions, and a voice assistant to answer user questions. The system aims to help both students by providing practice questions and teachers by suggesting new questions to ask. It extracts key elements from sentences using part-of-speech tagging to form question-answer pairs from documents.
AN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLSIJNSA Journal
Emails are the most common service on the Internet for communication and sending documents. Email is used not only from computers but also from many other electronic devices such as tablets; smartphones, etc. Emails can also be used for criminal activities. Email forensic refers to the study of email detail and content as evidence to identify the actual sender and recipient of a message, date/time of transmission, detailed record of email transaction, intent of the sender, etc. Email forensics involves investigation of metadata, keyword, searching, port scanning and generating report based on investigators need. Many tools are available for any investigation that involves email forensics. Investigators should be very careful of not violating user’s privacy. To this end, investigators should run keyword searches to reveal only the relevant emails. Therefore, knowledge of the features of the tool and the search features is necessary for the tool selection. In this research, we experimentally compare the performance of several email forensics tools. Our aim is to help the investigators with the tool selection task. We evaluate the tools in terms of their keyword search, report generation, and other features such as, email format, size of the file accepted, whether they work online or offline, format of the reports, etc. We use Enron email dataset for our experiment.
This presentation summarizes a simple phone book program using linked lists and file handling data structures. Key points include:
- Linked lists and files were used to store and manage contact data in a dynamic way without pre-allocating memory.
- Files allow data to be stored non-volatile, reusable, and portable between systems.
- The program includes functions for loading data from a file into a linked list, validating user input, adding/finding/modifying/deleting records, and writing the linked list data back to the file.
Mineograph is a comprehensive online mining software that aims to empower and optimize mining operations through increased productivity, cost reduction, and improved planning. The software provides modules for inventory management, human resource management, purchase management, project management, operations management, and fleet management. It utilizes telematics data from GPS, IoT, and OBD devices to provide real-time tracking of assets, production, and operations. The dashboard and business intelligence features allow users to analyze performance, costs, and deviations to better manage the business.
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...Norihiro Yoshida
Slides for the data paper "Mining the Modern Code Review Repositories: A Dataset of People, Process and Product" in the proceedings of the 13th International Conference on Mining Software Repositories (MSR 2016), Austin, TX, May 2016.
This document compares several open source tools that can be used for data science. It provides background on key concepts in data science like data mining, machine learning, predictive analytics and business intelligence. It also discusses techniques commonly used by data scientists like clustering, classification, regression etc. The document then reviews popular open source data science tools like Orange, RapidMiner, KNIME, Weka and R and compares their key features based on techniques covered in the EMC Data Science Associate certification. It finds that these tools provide capabilities for common data science techniques at no cost, making them suitable alternatives to expensive proprietary software, especially for small organizations.
Mining Software Archives to Support Software DevelopmentThomas Zimmermann
1. The document discusses mining software archives and repositories to help guide software developers and predict defects.
2. It describes the eROSE tool which mines past associations between code changes to suggest related files and locations to developers.
3. The BugCache model predicts future defects based on the hypothesis that defects are temporally local, with a cache that loads elements likely to have defects.
The document discusses model comparison approaches for delta-compression. It describes comparing models at the element level by matching elements between models and identifying differences. It also discusses representation of differences for compression purposes and experiments comparing EMF Compare and EMF Compress on reverse engineered models from Git repositories.
An Empirical Study of Goto in C Code from GitHub RepositoriesSAIL_QU
Developers still use goto statements in practice despite arguments against them. This study analyzed over 11,000 GitHub projects and found goto statements in around 11% of C files. Goto statements were primarily used for error handling and cleanup. The study also analyzed commit histories of 6 projects and found that developers rarely remove or modify goto statements, even when fixing post-release bugs. This suggests that while goto statements have drawbacks, developers still find them useful for certain tasks like error handling.
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자Dylan Ko
Gonnector(고넥터) 고영혁 대표가 주최한 스타트업 데이터 활용 세미나 '우리가 데이터를 쓰는 법' 의 두 번째 발표 자료
세미나 : 우리가 데이터를 쓰는 법 (How We Use Data)
일시 : 2016년 4월 12일 화요일 10:00 ~ 18:00
장소 : 마루180 (Maru180) B1 Think 홀
제목 : 온라인 서비스 개선을 위한 데이터 활용법
연사 : 마이크로소프트 김진영 데이터과학자
Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer
There are plenty of public datasets out there available and the number is growing. Few recent and most useful of BigData ecosystem tools are showcased: Apache Zeppelin (incubating), Apache Spark and Juju.
Software Defect Prediction on Unlabeled DatasetsSung Kim
The document describes techniques for software defect prediction when labeled training data is unavailable. It proposes Transfer Defect Learning (TCA+) to improve cross-project defect prediction by normalizing data distributions between source and target projects. For projects with heterogeneous metrics, it introduces Heterogeneous Defect Prediction (HDP) which matches similar metrics between source and target to build cross-project prediction models. It also discusses CLAMI for defect prediction using only unlabeled data without human effort. The techniques are evaluated on open source projects to demonstrate their effectiveness compared to traditional cross-project and within-project prediction.
The document summarizes a dissertation defense about adaptive bug prediction by analyzing project history. It discusses the motivation for leveraging project history and software configuration management data for bug prediction. It also describes creating a corpus by identifying bug-fix changes and bug-introducing changes from commits. The dissertation proposes using a "bug cache" to predict likely locations of future bugs based on past bug occurrences.
The document discusses mining software repositories, including what it is, common repositories, conferences and journals of interest, tools for mining repositories, and datasets available for mining. It defines mining software repositories as analyzing rich data in repositories to uncover interesting information about software systems and projects. Popular repositories include version control systems and bug tracking systems. Conferences of interest are MSR, ICSE, and ICSM, and journals include IEEE TSE, EMSE, and JSS. Tools and datasets discussed include those from Libresoft, the University of Waterloo, Debian, Eclipse, and PROMISE.
Introduce Deep learning & A.I. ApplicationsMario Cho
This document provides an overview of deep machine learning and its applications. It introduces Mario Cho and his experience with image recognition, medical data processing, and open source software development. The document then discusses neural networks, deep learning techniques, and how deep learning can be applied to areas like image recognition, natural language processing, and autonomous vehicles. Examples of deep learning applications include image captioning, machine translation, and generating synthetic images from sketches.
This document summarizes several papers related to applying design patterns to improve software performance. It discusses how patterns can be used to remove antipatterns that negatively impact performance, such as "empty semi-trucks" where there are an excessive number of requests. The document also covers case studies on how different patterns like facade and command impact the performance of web applications. Additional papers discuss automated tools for detecting antipatterns in models and refactoring software designs to improve performance.
Studying Software Quality Using Topic ModelsSAIL_QU
The document discusses using topic models to study software quality by capturing concerns in topics. It finds that only a few topics are defect-prone, and topics can better explain defects than static metrics. It also finds that less tested topics are more defect-prone, and can accurately predict low tested, high defect prone topics to help allocate testing resources.
Mineograph is a comprehensive online mining software that aims to empower and optimize mining operations through increased productivity, cost reduction, and improved planning. The software provides modules for inventory management, human resource management, purchase management, project management, operations management, and fleet management. It utilizes telematics data from GPS, IoT, and OBD devices to provide real-time tracking of assets, production, and operations. The dashboard and business intelligence features allow users to analyze performance, costs, and deviations to better manage the business.
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...Norihiro Yoshida
Slides for the data paper "Mining the Modern Code Review Repositories: A Dataset of People, Process and Product" in the proceedings of the 13th International Conference on Mining Software Repositories (MSR 2016), Austin, TX, May 2016.
This document compares several open source tools that can be used for data science. It provides background on key concepts in data science like data mining, machine learning, predictive analytics and business intelligence. It also discusses techniques commonly used by data scientists like clustering, classification, regression etc. The document then reviews popular open source data science tools like Orange, RapidMiner, KNIME, Weka and R and compares their key features based on techniques covered in the EMC Data Science Associate certification. It finds that these tools provide capabilities for common data science techniques at no cost, making them suitable alternatives to expensive proprietary software, especially for small organizations.
Mining Software Archives to Support Software DevelopmentThomas Zimmermann
1. The document discusses mining software archives and repositories to help guide software developers and predict defects.
2. It describes the eROSE tool which mines past associations between code changes to suggest related files and locations to developers.
3. The BugCache model predicts future defects based on the hypothesis that defects are temporally local, with a cache that loads elements likely to have defects.
The document discusses model comparison approaches for delta-compression. It describes comparing models at the element level by matching elements between models and identifying differences. It also discusses representation of differences for compression purposes and experiments comparing EMF Compare and EMF Compress on reverse engineered models from Git repositories.
An Empirical Study of Goto in C Code from GitHub RepositoriesSAIL_QU
Developers still use goto statements in practice despite arguments against them. This study analyzed over 11,000 GitHub projects and found goto statements in around 11% of C files. Goto statements were primarily used for error handling and cleanup. The study also analyzed commit histories of 6 projects and found that developers rarely remove or modify goto statements, even when fixing post-release bugs. This suggests that while goto statements have drawbacks, developers still find them useful for certain tasks like error handling.
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자Dylan Ko
Gonnector(고넥터) 고영혁 대표가 주최한 스타트업 데이터 활용 세미나 '우리가 데이터를 쓰는 법' 의 두 번째 발표 자료
세미나 : 우리가 데이터를 쓰는 법 (How We Use Data)
일시 : 2016년 4월 12일 화요일 10:00 ~ 18:00
장소 : 마루180 (Maru180) B1 Think 홀
제목 : 온라인 서비스 개선을 위한 데이터 활용법
연사 : 마이크로소프트 김진영 데이터과학자
Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer
There are plenty of public datasets out there available and the number is growing. Few recent and most useful of BigData ecosystem tools are showcased: Apache Zeppelin (incubating), Apache Spark and Juju.
Software Defect Prediction on Unlabeled DatasetsSung Kim
The document describes techniques for software defect prediction when labeled training data is unavailable. It proposes Transfer Defect Learning (TCA+) to improve cross-project defect prediction by normalizing data distributions between source and target projects. For projects with heterogeneous metrics, it introduces Heterogeneous Defect Prediction (HDP) which matches similar metrics between source and target to build cross-project prediction models. It also discusses CLAMI for defect prediction using only unlabeled data without human effort. The techniques are evaluated on open source projects to demonstrate their effectiveness compared to traditional cross-project and within-project prediction.
The document summarizes a dissertation defense about adaptive bug prediction by analyzing project history. It discusses the motivation for leveraging project history and software configuration management data for bug prediction. It also describes creating a corpus by identifying bug-fix changes and bug-introducing changes from commits. The dissertation proposes using a "bug cache" to predict likely locations of future bugs based on past bug occurrences.
The document discusses mining software repositories, including what it is, common repositories, conferences and journals of interest, tools for mining repositories, and datasets available for mining. It defines mining software repositories as analyzing rich data in repositories to uncover interesting information about software systems and projects. Popular repositories include version control systems and bug tracking systems. Conferences of interest are MSR, ICSE, and ICSM, and journals include IEEE TSE, EMSE, and JSS. Tools and datasets discussed include those from Libresoft, the University of Waterloo, Debian, Eclipse, and PROMISE.
Introduce Deep learning & A.I. ApplicationsMario Cho
This document provides an overview of deep machine learning and its applications. It introduces Mario Cho and his experience with image recognition, medical data processing, and open source software development. The document then discusses neural networks, deep learning techniques, and how deep learning can be applied to areas like image recognition, natural language processing, and autonomous vehicles. Examples of deep learning applications include image captioning, machine translation, and generating synthetic images from sketches.
This document summarizes several papers related to applying design patterns to improve software performance. It discusses how patterns can be used to remove antipatterns that negatively impact performance, such as "empty semi-trucks" where there are an excessive number of requests. The document also covers case studies on how different patterns like facade and command impact the performance of web applications. Additional papers discuss automated tools for detecting antipatterns in models and refactoring software designs to improve performance.
Studying Software Quality Using Topic ModelsSAIL_QU
The document discusses using topic models to study software quality by capturing concerns in topics. It finds that only a few topics are defect-prone, and topics can better explain defects than static metrics. It also finds that less tested topics are more defect-prone, and can accurately predict low tested, high defect prone topics to help allocate testing resources.
The document presents a software bug prediction model. It aims to build a resilient bug prediction model through simulation on open source issue trackers like Jira and Bugzilla. It also aims to conduct a comparative study of the new model against existing competitive models. The model will make use of data from software repositories, bug reports, and code artifacts to predict bugs. Open source projects like Eclipse, Mozilla and Android will be used for simulations. Data mining tools like WEKA and RAPID MINER will be utilized extensively. The model also aims to facilitate code refactoring to improve software maintenance activities like modification and enhancement. Literature in the areas of bug prediction and code refactoring will be surveyed. The research will be conducted in
With the rise of software systems ranging from personal assistance to the nation's facilities, software defects become more critical concerns as they can cost millions of dollar as well as impact human lives. Yet, at the breakneck pace of rapid software development settings (like DevOps paradigm), the Quality Assurance (QA) practices nowadays are still time-consuming. Continuous Analytics for Software Quality (i.e., defect prediction models) can help development teams prioritize their QA resources and chart better quality improvement plan to avoid pitfalls in the past that lead to future software defects. Due to the need of specialists to design and configure a large number of configurations (e.g., data quality, data preprocessing, classification techniques, interpretation techniques), a set of practical guidelines for developing accurate and interpretable defect models has not been well-developed.
The ultimate goal of my research aims to (1) provide practical guidelines on how to develop accurate and interpretable defect models for non-specialists; (2) develop an intelligible defect model that offer suggestions how to improve both software quality and processes; and (3) integrate defect models into a real-world practice of rapid development cycles like CI/CD settings. My research project is expected to provide significant benefits including the reduction of software defects and operating costs, while accelerating development productivity for building software systems in many of Australia's critical domains such as Smart Cities and e-Health.
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...Hong-Linh Truong
This presentation is part of the course "184.742 Advanced Services Engineering" at The Vienna University of Technology, in Winter Semester 2012. Check the course at: http://www.infosys.tuwien.ac.at/teaching/courses/ase/
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGIJCI JOURNAL
This document summarizes a research paper that proposes using a combination of Natural Language Processing and statistical models to match features between different datasets. Specifically, it uses BERT (Bidirectional Encoder Representations from Transformers), a pretrained NLP model, in parallel with Jaccard similarity to measure similarity between feature lists. The hybrid approach reduces time required for manual feature matching compared to previous methods. The paper describes preprocessing data, generating embeddings with BERT, calculating similarity scores with BERT and Jaccard, and outputting top matches above a threshold. It provides example results matching house sales and movie metadata features. The hybrid approach leverages strengths of BERT's semantic understanding and Jaccard's flexibility for special characters.
(Structural) Feature Interactions for Variability-Intensive Systems Testing Gilles Perrouin
Presentation given in the "short talks" session in the Dagstuhl seminar 14281 on "Feature Interactions - the Next Generation" , Schloss Dagstuhl, Germany, July 2014.
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
This document proposes a methodology to automatically assign topics to unlabeled datasets using topic modeling techniques. It applies latent Dirichlet allocation (LDA) and non-negative matrix factorization (NMF) with term frequency-inverse document frequency (TF-IDF) weighting to product reviews to generate topics. Word similarities are used to cluster words for each topic. Sentiment analysis and word clouds are also used to gain insights. The methodology successfully converts unlabeled to labeled data and provides automatic topic labeling to facilitate further research and opportunity discovery.
May 2024: Top 10 Read Articles in Software Engineering & Applications Interna...sebastianku31
Welcome To IJSEA ...!!!
Call for papers___!
International Journal of Software Engineering & Applications(IJSEA)
ISSN:0975-3834 [Online]; 0975-4679 [Print]
ERA Indexed, H Index 31
Web Page URL : https://airccse.org/journal/ijsea/ijsea.html
Submission URL :https://airccse.com/submissioncs/home.html
Contact Us : ijseajournal@airccse.org or ijsea@aircconline.com
May 2024: Top 10 Read Articles Posted Url:https://www.academia.edu/119977684/April_2024_Top_10_Read_Articles_in_Software_Engineering_and_Applications_International_Journal_of_Software_Engineering_and_Applications_IJSEA_ERA_Indexed
This document provides a 50-hour roadmap for building large language model (LLM) applications. It introduces key concepts like text-based and image-based generative AI models, encoder-decoder models, attention mechanisms, and transformers. It then covers topics like intro to image generation, generative AI applications, embeddings, attention mechanisms, transformers, vector databases, semantic search, prompt engineering, fine-tuning foundation models, orchestration frameworks, autonomous agents, bias and fairness, and recommended LLM application projects. The document recommends several hands-on exercises and lists upcoming bootcamp dates and locations for learning to build LLM applications.
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IJCSEA Journal
This document summarizes a research paper that proposes a method for dynamically measuring coupling in distributed object-oriented software systems. The method involves three steps: instrumentation of the Java Virtual Machine to trace method calls, post-processing of the trace files to merge information, and calculation of coupling metrics based on the dynamic traces. The implementation results show that the proposed approach can effectively measure coupling metrics dynamically by accounting for polymorphism and dynamic binding, overcoming limitations of traditional static coupling analysis.
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IJCSEA Journal
Software metrics are increasingly playing a central role in the planning and control of software development projects. Coupling measures have important applications in software development and maintenance. Existing literature on software metrics is mainly focused on centralized systems, while work in the area of distributed systems, particularly in service-oriented systems, is scarce. Distributed systems with service oriented components are even more heterogeneous networking and execution environment. Traditional coupling measures take into account only “static” couplings. They do not account for “dynamic” couplings due to polymorphism and may significantly underestimate the complexity of software and misjudge the need for code inspection, testing and debugging. This is expected to result in poor predictive accuracy of the quality models in distributed Object Oriented systems that utilize static coupling measurements. In order to overcome these issues, we propose a hybrid model in Distributed Object Oriented Software for measure the coupling dynamically. In the proposed method, there are three steps
such as Instrumentation process, Post processing and Coupling measurement. Initially the instrumentation process is done. In this process the instrumented JVM that has been modified to trace method calls. During this process, three trace files are created namely .prf, .clp, .svp. In the second step, the information in these file are merged. At the end of this step, the merged detailed trace of each JVM contains pointers to the merged trace files of the other JVM such that the path of every remote call from the client to the server can be uniquely identified. Finally, the coupling metrics are measured dynamically. The implementation results show that the proposed system will effectively measure the coupling metrics dynamically.
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago
Join us as Tao Xie, Professor and Willett Faculty Scholar in the Department of Computer Science at the University of Illinois at Urbana-Champaign and ACM Distinguished Speaker, talks about Intelligent Software Engineering: Synergy between AI and Software Engineering. This is a joint meeting hosted by Chicago Chapter ACM / Loyola University Computer Science Department.
Intelligent Software Engineering: Synergy between AI and Software EngineeringTao Xie
This document discusses the synergy between artificial intelligence and software engineering. It begins with an overview of intelligent software engineering and how AI techniques can be applied to software engineering problems. Specific examples discussed include using dynamic symbolic execution for automated test generation for binary code, .NET code, and mobile app code. The document also discusses using machine learning for software analytics, testing, and natural language interfacing for IDEs. Open challenges in the field of intelligent software engineering are mentioned at the end.
Tim Menzies, directions in Data ScienceCS, NcState
- Over the past 10 years, the PROMISE repository has collected hundreds of software engineering datasets and enabled thousands of papers and journal special issues on data mining in software engineering.
- Issues raised by work with the PROMISE data include learning for novel problems, understanding locality and trust of models, enabling privacy while exploiting data locality, and developing goal-oriented analysis and human skill amplifiers for data science.
- Lessons learned indicate that variance in results is due more to the analysts doing the work than choices around learners, data, or features, highlighting the importance of understanding human factors in data science.
This thesis explores how non-functional requirements (NFRs) can drive software architecture design. It proposes a model-driven development framework that fully integrates NFRs. Empirical studies show that architects consider NFRs just as important as functional requirements when making decisions. The thesis also presents Arteon, an ontology for architectural knowledge; Quark, a method for decision-making based on NFRs; and ArchiTech, a tool that implements Arteon and Quark.
Not Only Statements: The Role of Textual Analysis in Software QualityRocco Oliveto
My keynote at the 2012 Workshop on Mining Unstructured Data (co-located with the 10th Working Conference on Reverse Engineering - WCRE'12). Kingston, Ontario, Canada. October 17th, 2012.
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNINGIRJET Journal
This document describes a study on converting images to text to speech using machine learning. The researchers developed a system that uses optical character recognition to extract text from images, then converts the text to speech. They achieved over 99% accuracy on their test dataset of over 1 million images. Their integrated system was able to accurately extract and convert text from various real-world images like street signs and menus. The system has potential to improve accessibility for people with visual impairments by allowing printed information to be converted to audio. Future work includes handling lower quality images and expanding the system to support additional languages and applications.
Similar to Mining Unstructured Software Repositories Using IR Models (20)
Studying the Integration Practices and the Evolution of Ad Libraries in the G...SAIL_QU
In-app advertisements have become a major revenue for app developers in the mobile app economy. Ad libraries play an integral part in this ecosystem as app
developers integrate these libraries into their apps to display ads. However, little is known about how app developers integrate these libraries with their apps and how these libraries have evolved over time.
In this thesis, we study the ad library integration practices and the evolution of such libraries. To understand the integration practices of ad libraries, we manually study apps and derive a set of rules to automatically identify four strategies for integrating
multiple ad libraries. We observe that integrating multiple ad libraries commonly occurs in apps with a large number of downloads and ones in categories with a high percentage of apps that display ads. We also observe that app developers prefer to manage their own integrations instead of using off the shelf features of ad libraries for integrating multiple ad libraries.
To study the evolution of ad libraries, we conduct a longitudinal study of the 8 most popular ad libraries. In particular, we look at their evolution in terms of size, the main drivers for releasing a new ad library version, and their architecture. We observe that ad libraries are continuously evolving with a median release interval of 34 days. Some ad libraries have grown exponentially in size (e.g., Facebook Audience Network ad library), while other libraries have worked to reduce their size. To study the main drivers for releasing an ad library version, we manually study the release notes of the eight studied ad libraries. We observe that ad library developers continuously update their ad libraries to support a wider range of Android versions (i.e., to ensure that more devices can use the libraries without errors). Finally, we derive a reference architecture for ad libraries and study how the studied ad libraries diverged from this architecture during our study period.
Our findings can assist ad library developers to understand the challenges for developing ad libraries and the desired features of these libraries.
Improving the testing efficiency of selenium-based load testsSAIL_QU
Slides for a paper published at AST 2019:
Shahnaz M. Shariff, Heng Li, Cor-Paul Bezemer, Ahmed E. Hassan, Thanh H. D. Nguyen, and Parminder Flora. 2019. Improving the testing efficiency of selenium-based load tests. In Proceedings of the 14th International Workshop on Automation of Software Test (AST '19). IEEE Press, Piscataway, NJ, USA, 14-20. DOI: https://doi.org/10.1109/AST.2019.00008
Studying User-Developer Interactions Through the Distribution and Reviewing M...SAIL_QU
This document discusses studying user-developer interactions through the distribution and reviewing mechanisms of the Google Play Store. It analyzes emergency updates made by developers to fix issues, the dialogue between users and developers through reviews and responses, and how the reviewing mechanism can help identify good and bad updates. The study found that responding to reviews is six times more likely to increase an app's rating, with 84% of rating increases going to four or five stars. Three common patterns of developer responses were identified: responding to negative or long reviews, only negative reviews, and reviews shortly after an update.
Studying online distribution platforms for games through the mining of data f...SAIL_QU
Our studies of Steam platform data provided insights into online game distribution:
1) Urgent game updates were used to fix crashes, balance issues, and functionality; frequent updaters released more 0-day patches.
2) The Early Access model attracted indie developers and increased game participation; reviews were more positive during Early Access.
3) Game reviews were typically short and in English; sales increased review volume more than new updates; negative reviews came after longer play.
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...SAIL_QU
This study analyzed factors that impact the speed of questions receiving accepted answers on four popular Stack Exchange websites: Stack Overflow, Mathematics, Ask Ubuntu, and Super User. The researchers examined question, answerer, asker, and answer factors from over 150,000 questions. They built classification models and found that key factors for fast answers included the past speed of answerers, length of the question, and past speed of answers for the question's tags. The models achieved AUCs of 0.85-0.95. Fast answers relied heavily on answerers, especially frequent answerers. The study suggests improving incentives for non-frequent and more difficult questions to attract diverse answerers.
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...SAIL_QU
Selenium is a popular tool for browser-based automation testing. The author analyzes challenges in using Selenium by mining Selenium questions on Stack Overflow. Programming language-related questions, especially for Java and Python, are most common and growing fastest. Less than half of questions receive accepted answers, and questions about browsers and components take longest. In the second part, the author develops an approach to improve efficiency of Selenium-based load testing by sharing browsers among user instances. This increases the number of error-free users by 20-22% while reducing memory usage.
Mining Development Knowledge to Understand and Support Software Logging Pract...SAIL_QU
This document summarizes Heng Li's PhD thesis on mining development knowledge to understand and support software logging practices. It discusses how logging code is used to record runtime information but can be difficult for developers to maintain. The thesis aims to understand current logging practices and develop tools by mining change history, source code, issue reports, and other development knowledge. It presents research that analyzes logging-related issues to identify developers' logging concerns, uses code topics and structure to predict where logging statements should be added, leverages code changes to suggest when logging code needs updating, and applies machine learning models to recommend appropriate log levels.
Which Log Level Should Developers Choose For a New Logging Statement?SAIL_QU
The document discusses choosing an appropriate log level when adding a new logging statement. It finds that an ordinal regression model can effectively model log levels, achieving an AUC of 0.76-0.81 in within-project evaluation and 0.71-0.8 in cross-project evaluation. The most influential factors for determining log levels vary between projects and include metrics related to the logging statement, containing code block, and file as well as code change and historical change metrics.
Towards Just-in-Time Suggestions for Log ChangesSAIL_QU
The document presents a study on providing just-in-time suggestions for log changes when developers make code changes. The researchers analyzed over 32,000 log changes from 4 systems. They found 20 reasons for log changes that fall into 4 categories: block changes, log improvements, dependence-driven changes, and logging issues. A random forest classifier using 25 software metrics related to code changes, history, and complexity achieved 0.84-0.91 AUC in predicting whether a log change is needed. Change metrics and product metrics were the most influential factors. The study aims to help developers make better logging decisions for failure diagnosis.
The Impact of Task Granularity on Co-evolution AnalysesSAIL_QU
The document discusses how task granularity at different levels (e.g. commits, pull requests, work items) can impact analyses of co-evolution in software projects. It finds that analyzing at the commit-level can overlook relationships between tasks that span multiple commits. Work item level analysis is recommended to provide a more complete view of co-evolution, as median of 29% of work items consist of multiple commits, and analyzing at the commit level would miss 24% of co-changed files and inability to group 83% of related commits.
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...SAIL_QU
1) Initial bug fix discussions with more comments and more developers participating are more likely to experience later bug reworking through re-opening or re-patching of the bug.
2) Manual analysis found that defective initial fixes and failure to reach consensus in discussions contributed to later reworking.
3) For re-opened bugs, initial discussions focused on addressing a particular problem through a burst of comments, while re-patched bugs lacked thorough code review and testing during the initial fix period.
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...SAIL_QU
This study examined the relationship between mobile device attributes and user-perceived quality of Android apps. The researchers analyzed 150,373 star ratings from Google Play across 30 devices and 280 apps. They found that the perceived quality of apps varies across devices, and having better characteristics of an attribute does not necessarily correlate with higher quality. Device OS version, resolution, and CPU showed significant relationships with ratings, as did some app attributes like lines of code and number of inputs. However, some device attributes had stronger relationships than app attributes.
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...SAIL_QU
This document presents the results of a large-scale study on the impact of feature selection techniques on defect classification models. The study used expanded scopes including multiple datasets from NASA and PROMISE with different feature types, more classification techniques from different paradigms, and additional feature selection techniques. The results show that correlation-based feature subset selection techniques like FS1 and FS2 consistently appear in the top ranks across most of the datasets, projects within the datasets, and classification techniques. The document concludes that future defect classification studies should consider applying correlation-based feature selection techniques.
Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU
The study analyzes user-developer interactions through reviews and responses on the Google Play Store. It finds that responding to reviews has a significant positive impact, with 84% of rating increases due to the developer addressing the issue or providing guidance. Three common response patterns were identified: only negative reviews, negative or longer reviews, and reviews shortly after an update. Developers most often thank the user, ask for details, provide guidance, or ask for an endorsement. Guidance responses can address common issues through FAQs. The analysis considered over 2,000 apps, 355,000 review changes, 128,000 responses, and 4 million reviews.
What Do Programmers Know about Software Energy Consumption?SAIL_QU
This document summarizes the results of a survey of 122 programmers about their knowledge of software energy consumption. The survey found that programmers have limited awareness of energy consumption and how to reduce it. They were unaware of the main causes of high energy usage. Programmers lacked knowledge about how to properly rank the energy consumption of different hardware components and were unfamiliar with strategies to improve efficiency, such as minimizing I/O and avoiding polling. The study concludes that programmers would benefit from more education on software energy usage and its causes.
Revisiting the Experimental Design Choices for Approaches for the Automated R...SAIL_QU
Prior research on automated duplicate issue report retrieval focused on improving performance metrics like recall rate. The author revisits experimental design choices from four perspectives: needed effort, data changes, data filtration, and evaluation process.
The thesis contributions are: 1) Showing the importance of considering needed effort in performance measurement. 2) Proposing a "realistic evaluation" approach and analyzing prior findings with it. 3) Developing a genetic algorithm to filter old issue reports and improve performance. 4) Highlighting the impact of "just-in-time" features on evaluation. The findings help better understand benefits and limitations of prior work in this area.
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsSAIL_QU
The document summarizes a large-scale field study that tracked the program comprehension activities of 78 professional developers over 3,148 hours. The study found that:
1) Program comprehension accounted for approximately 58% of developers' time on average, with navigation and editing making up the remaining portions.
2) Developers frequently used web browsers and document editors to aid comprehension beyond just IDEs.
3) Interviews and observations revealed that insufficient documentation, unclear code, and complex inheritance hierarchies contributed to long comprehension sessions.
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppGoogle
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-fusion-buddy-review
AI Fusion Buddy Review: Key Features
✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini
✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique!
✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs!
✅Fully automated AI articles bulk generation!
✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more.
✅With one keyword or URL, generate complete websites, landing pages, and more…
✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7.
✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches.
✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all!
✅Save over $5000 per year and kick out dependency on third parties completely!
✅Brand New App: Not available anywhere else!
✅ Beginner-friendly!
✅ZERO upfront cost or any extra expenses
✅Risk-Free: 30-Day Money-Back Guarantee!
✅Commercial License included!
See My Other Reviews Article:
(1) AI Genie Review: https://sumonreview.com/ai-genie-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIFusionBuddyReview,
#AIFusionBuddyFeatures,
#AIFusionBuddyPricing,
#AIFusionBuddyProsandCons,
#AIFusionBuddyTutorial,
#AIFusionBuddyUserExperience
#AIFusionBuddyforBeginners,
#AIFusionBuddyBenefits,
#AIFusionBuddyComparison,
#AIFusionBuddyInstallation,
#AIFusionBuddyRefundPolicy,
#AIFusionBuddyDemo,
#AIFusionBuddyMaintenanceFees,
#AIFusionBuddyNewbieFriendly,
#WhatIsAIFusionBuddy?,
#HowDoesAIFusionBuddyWorks
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Odoo ERP software
Odoo ERP software, a leading open-source software for Enterprise Resource Planning (ERP) and business management, has recently launched its latest version, Odoo 17 Community Edition. This update introduces a range of new features and enhancements designed to streamline business operations and support growth.
The Odoo Community serves as a cost-free edition within the Odoo suite of ERP systems. Tailored to accommodate the standard needs of business operations, it provides a robust platform suitable for organisations of different sizes and business sectors. Within the Odoo Community Edition, users can access a variety of essential features and services essential for managing day-to-day tasks efficiently.
This blog presents a detailed overview of the features available within the Odoo 17 Community edition, and the differences between Odoo 17 community and enterprise editions, aiming to equip you with the necessary information to make an informed decision about its suitability for your business.
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
What is Augmented Reality Image Trackingpavan998932
Augmented Reality (AR) Image Tracking is a technology that enables AR applications to recognize and track images in the real world, overlaying digital content onto them. This enhances the user's interaction with their environment by providing additional information and interactive elements directly tied to physical images.
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
Do you want Software for your Business? Visit Deuglo
Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions.
Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC).
Requirement — Collecting the Requirements is the first Phase in the SSLC process.
Feasibility Study — after completing the requirement process they move to the design phase.
Design — in this phase, they start designing the software.
Coding — when designing is completed, the developers start coding for the software.
Testing — in this phase when the coding of the software is done the testing team will start testing.
Installation — after completion of testing, the application opens to the live server and launches!
Maintenance — after completing the software development, customers start using the software.
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
2. 2
Stephen W. Thomas
Mining Software Repositories with Topic Models.
ICSE 2011
Stephen W. Thomas, Hadi Hemmati, Ahmed E. Hassan, and Dorothea Blostein
Static TestC ase Prioritization Using Topic Models.
Empirical Software Engineering, 2012
Stephen W. Thomas, Nicolas Bettenburg, Ahmed E. Hassan, and Dorothea Blostein
Talk and Work: Recovering the Relationship between Mailing ListDiscussions and Development
Activity.
Empirical Software Engineering, 2nd
round
Stephen W. Thomas, Meiyappan Nagappan , Ahmed E. Hassan, and Dorothea Blostein
The ImpactofC lassifierC onfiguration and C lassifierC ombination on Bug Localization.
IEEE Transactions on Software Engineering, 2nd
round
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Validating the Use ofTopic Models forSoftware Evolution.
SCAM 2010
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Modeling the Evolution ofTopics in Source C ode Histories.
MSR 2011
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Studying Software Evolution Using Topic Models.
Science of Computer Programming, 2012
9. 9
The research and practice of using IR models to
mine software repositories can be improved by
(i) considering additional software engineering
tasks, such as prioritizing test cases;
(ii) using advanced IR techniques, such as
combining multiple IR models; and
(iii) better understanding the assumptions and
parameters of IR models.
10. Test Case Prioritization
Less similar
Higher prioritySimilarity
identifiers
comments
string literals
Part 1
10[EMSE 2012]
structural-based IR-based
11. Source code ↔ Email Interaction
cleaning and
preprocessing
identifiers
comments
string literals
mail codeXML
printing
installation
GUI
Code
Mail
Time
Activity
XML
Monitoring project status
Software explanation
Training and documentation
11
Part 1
[EMSE 20XX]
13. Combining Multiple IRModels
identifiers
comments
string literalsBug
report
Bug
report
Similarity
title
description
Best individual
IR model
Random subset,
combined
13
Part 2
[TSE 20XX] sets had improved performance median improvement
14. XML concept
Swing concept
Encryption concept
Time
Popularity
Concept Evolution Models
identifiers
comments
string literals
14
Part 2
[SCP 2012]
[SCAM 2010]
accuracy of topic evolutions
17. Preprocessing and ParameterEffects
Code representation
identifiers? comments?
past bug reports?
Bug report representation
title? description?
Preprocessing
split identifiers? remove stop words?
word stemming?
IR Model parameters
term weighting?
No. of topics? similarity measure?
No. of iterations?
Configuration matters!
worst:
best:
mean:
17
Part 3
[TSE 20XX]
“configuration”
18. New!
1
2
3
18
Part
Part
Part
Proposed and evaluated a technique to prioritize test cases
Proposed and evaluated a technique to analyze the interaction of source code and mailing lists
Described and evaluated a technique to analyze code histories using topic evolution models
Proposed and evaluated a frameworkforcombining the results of disparate IR models
Overcame the data duplication problem in large source code histories
Analyzed the sensitivity of IRmodels to data preprocessing and IR model parameters
Editor's Notes
This diagram describes the field of Mining Software Repositories. The overall goal is take software repositories (which are readily-available datasets about a software project, such as [list a few]), apply data mining and machine learning techniques, and come out with some actionable knowledge that will help developers in some way. For example: bug prediction, traceability linking, feature location, …
In current research, the majority of the repositories that are mined are structured: call graphs, parse trees, execution logs;
However, there are also many repositories that are unstructured: [name them]
In fact, research has shown that about 80% of the content in software repositories is unstructured, meaning that we to consider this data if we want to take full advantage of the software repositories.
However, unstructured data brings with it many challenges. Consider these two seemingly-innocent bug reports from one of my case studies.
Here we see many difficulties, such as undefined acronyms; spelling errors and typos; inconsistent usages; no labels, vague wording.
These problems exist because most unstructured data comes in the form of natural language text written by humans, which is notoriously difficult for a computer to deal with.
In an attempt to deal with unstructured software repositories, researchers have began to use IR. IR models come from the NLP community, and a good fit for our problem because they were designed to handle many of the problems of unstructured data. IR models help you search, organize, and provide structure for your unstructured data.
IR models use a simplifying assumption of the data, called the “bag of words” approach. This means that word order is not considered in IR models. By ignoring word order, analysis is simpler and faster, and the techniques can scale to large datasets. And we demonstrate that despite this simplifying assumption, IR models actually perform quite well in many scenarios.
Initial successes: concept location; document clustering; new code metrics; code search engines; traceability linking
To understand how IR models have been used in MSR, I did a thorough literature review of all papers that use IR models to mine unstructured data. In all, there are about 67 papers. I analyzed the trends and common usages, and found three shortcomings of the state-of-the-art, i.e., some areas where we could improve. My thesis is the proposal of solutions to each of these three shortcomings.
First shortcoming: most papers that use IR models only perform one of two software engineering tasks: concept location, and traceability linking. There’s nothing wrong with these applications, but I propose that we can go beyond these two tasks and use IR models to perform new SE tasks, and help software developers even further.
Second shortcoming: most papers use only the most basic IR models, such as the Vector Space Model (1975, 37 years ago). I propose that we use some of the more advanced, super-man like IR techniques, which may bring better results and new capabilities to software developers.
Third shortcoming: most papers use IR models as off-the-shelf black boxes, without fully understanding how their parameters work, what input is required, and what the output means. I propose that we develop a better understanding of how IR models, which will allow us to take full advantage of their potential, and improve results for software developers.
My thesis statement has a parallel structure: [read]
In TCP, the goal is take an unordered set of test cases, and provide an ordering such that more bugs are detected earlier in the testing process. By doing so, if the test suite must be stopped early, then you can rest assured that you have detected as many bugs as possible.
Typically, TCP is tackled by using some sort of structural code coverage metric, that says: hey, how much code does this test case execute? If it executes a lot of code, then let’s give it a high priority. Otherwise, let’s give it a low priority. This is how it’s traditionally done.
However, I propose that we can use IR models to solve the same problem, only with the additional advantage of not having to run the test case to collect the execution information. Here’s how.
First, we extract the unstructured information from the source code: identifier names, comments, and string literals.
Then, we compute the IR similarity between each pair of test cases. This will tell us if the test cases are textually similar or not.
Then, if a test case is not very similar to other test cases, we give it a higher priority. The thought here is: if two test cases are exactly the same, then they will find the same bugs, so we don’t need to execute both. So we’re looking for test cases that are highly unlike any other test case, because it will detect unique bugs.
We did a case study on 5 real world systems, and found that our IR based approach was as good or better than existing approaches prioritizing test cases.
The first advanced technique I propose is that of combining multiple IR models.
Let me explain this in the context of bug localization. […]
A simple way to combine models is to just add the scores of each file from the various IR models. That way, if a file gets a high score in several models, it will shoot up to the top in the combined model. Another way is expert voting, where only the rank of each file is used, as opposed to the score. Either way, the end goal is to utilize the “expertise” of each model.
If a manager or developer had a dashboard that magically told them what developers were working on, and when, at a high level, they would be very happy. This would keep them informed, allow them to perform retrospective analysis, and maybe even be part of a preemptive maintenance solution that automatically monitored the “health” of the source code over time.
To achieve this goal, we use an advanced IR model called a topic evolution model. It works by [explain]
We input these versions into an advanced IR model, called a topic evolution model, which gives us exactly what we’re looking for.
A case study found that a large majority of the discovered evolutions were in-sync with how developers described the project, and since this technique is automatic, it will be helpful to use in an automatic dashboard setting.
During my research, I came across an issue which I now call the “data duplication problem”.
When I tried to analyze the evolution of long-lived systems with many different versions, I found that the IR model was producing unusual and unexpected results. Things just didn’t make sense: the topics were weird, and something was off, but I didn’t know what.
Upon further analysis, I learned that the cause of this problem was that in source code, hardly any of the words change between versions. A new version typically contains some bug fixes and some new features, but these only affect at most 1% of the lines of code, meaning that 99% of the data is exactly the same. It’s identical. This was throwing the IR models out of balance, and causing the problems that we experienced.
The reason is, IR models weren’t originally designed for source code. They were designed for newspaper articles or books. So version 1 here might contain all the newspaper articles in January, and version 2 contains all the newspaper articles in February. Sure, there might be some overlap, but in general we do not expect that 99% of the articles in February are exact duplicates from January. I believe that someone would be fired from the newspaper if this happened.
So I proposed a model that better handled this data duplication inherent to source code. Basically what it does, is it only inputs the differences between versions into the IR model. This keeps everything in balance because it meets the implicit assumptions made by the IR model. Our case studies showed that results are better when the duplication is removed.
Another way to better understand IR models is to understand their parameters and configurations. IR models have a lot of dials, knobs, and switches that you can tweak. For example, …
Currently, researchers don’t focus on these parameters, and just seem to randomly choose settings without fully understanding the associated consequences.
To better understand the parameters, we ran a large, empirical case study. We had 8000 bug reports, and we ran each of them through 3,168 IR model configurations. What we found was, that there is a HUGE difference in performance between the various configurations. For example, the worst IR model could only achieve 1% accuracy; the best could get as high 55%. And the mean was 23%. So the range was quite big, as was the variance.
In addition, in this study we were able to determine which configurations were best, so that researchers, tool vendors, and developers could use these when building their own IR-based solutions.
Let me conclude by summarizing the main contributions of this thesis.
First, I proposed new application of IR models in SE: TCP, and measuring the interaction of email and source code.
I also proposed that we start using more advanced IR techniques in our work, such as topic evolution models and model combination and
Finally, I proposed that if we increase our understanding of IR models, we further improve results. The two studies have show that by looking into the details of IR models, instead of treating them as black boxes, we can improve our techniques and get better results.
My broader research vision is to provide better tools, techniques, and insights for software development teams, so that they can build better software at lower costs and have happier customers. In this thesis, I have taken a step towards that vision by proposing and evaluating ways to better utilize the unstructured elements of software repositories, which in turn provide new and better capabilities for software developers.