Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Nirav Raje
This was a research project for an undergraduate academic seminar. Analyzed the impact of various text preprocessing techniques, feature weighting (FF, FP, TF-IDF), feature selection (filters, wrappers, embedded), lemmatization, tokenization (unigram, bigram and 1-to-3-gram) on 3 open Twitter datasets.
This is the slide that Terry. T. Um gave a presentation at Kookmin University in 22 June, 2014. Feel free to share it and please let me know if there is some misconception or something.
(http://t-robotics.blogspot.com)
(http://terryum.io)
In this PDF you will find the basics of Turbo Prolog 2.0 with some good program and it's output. Also it's second part is coming in next month or week.
For any query ------------------> sohupatel8828@gmail.com
For programs ----------------> https://github.com/UltraHopeful/Turbo-Prolog-2.0
NLP techniques used for Spell checking to recommend find error in the written word and also suggest a relevant word.
Algorithm: Jaccard Coefficient, The Levenshtein Distance
Research on character level language modelling using LSTM for semi-supervised learning. The objective is learning inner layer representations of the language model for transfer learning into a classification one.
Generalizing NLP processes using Bi-directional LSTMs to learn character(byte) level embeddings of financial news headlines up too 8 bits ( 2**8 -1) in order to study the relationship between character vectors in financial news headlines in order to transfer learning in to classification models using UTF-8 encoding. Many traditional NLP steps (lemmatize, POS, NER, stemming...) are skipped when diving to byte level making the process more universal in terms of scope then task specific.
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Nirav Raje
This was a research project for an undergraduate academic seminar. Analyzed the impact of various text preprocessing techniques, feature weighting (FF, FP, TF-IDF), feature selection (filters, wrappers, embedded), lemmatization, tokenization (unigram, bigram and 1-to-3-gram) on 3 open Twitter datasets.
This is the slide that Terry. T. Um gave a presentation at Kookmin University in 22 June, 2014. Feel free to share it and please let me know if there is some misconception or something.
(http://t-robotics.blogspot.com)
(http://terryum.io)
In this PDF you will find the basics of Turbo Prolog 2.0 with some good program and it's output. Also it's second part is coming in next month or week.
For any query ------------------> sohupatel8828@gmail.com
For programs ----------------> https://github.com/UltraHopeful/Turbo-Prolog-2.0
NLP techniques used for Spell checking to recommend find error in the written word and also suggest a relevant word.
Algorithm: Jaccard Coefficient, The Levenshtein Distance
Research on character level language modelling using LSTM for semi-supervised learning. The objective is learning inner layer representations of the language model for transfer learning into a classification one.
Generalizing NLP processes using Bi-directional LSTMs to learn character(byte) level embeddings of financial news headlines up too 8 bits ( 2**8 -1) in order to study the relationship between character vectors in financial news headlines in order to transfer learning in to classification models using UTF-8 encoding. Many traditional NLP steps (lemmatize, POS, NER, stemming...) are skipped when diving to byte level making the process more universal in terms of scope then task specific.
Build an LLM-powered application using LangChain.pdfAnastasiaSteele10
LangChain is an advanced framework that allows developers to create language model-powered applications. It provides a set of tools, components, and interfaces that make building LLM-based applications easier. With LangChain, managing interactions with language models, chaining together various components, and integrating resources like APIs and databases is a breeze. The platform includes a set of APIs that can be integrated into applications, allowing developers to add language processing capabilities without having to start from scratch.
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
Given at the PyData NYC 2013 conference (http://vimeo.com/79517341), and will be given at PyTennessee 2014.
Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.
Knowledge Based Reasoning: Agents, Facets of Knowledge. Logic and Inferences: Formal Logic,
Propositional and First Order Logic, Resolution in Propositional and First Order Logic, Deductive
Retrieval, Backward Chaining, Second order Logic. Knowledge Representation: Conceptual
Dependency, Frames, Semantic nets.
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
Most data is allocated to a period or to some point in time. We can gain a lot of insight by analyzing what happened when. The better the quality and accuracy of our data, the better our predictions can become.
Unfortunately the data we have to deal with is often aggregated for example on a monthly basis, but not all months are the same, they may have 28 days, 31 days, have four or five weekends,…. It’s made fit to our calendar that was made fit to deal with the earth surrounding the sun, not to please Data Scientists.
Dealing with periodical data can be a challenge. This talk will show to how you can deal with it with Pandas.
Intro to Machine Learning with H2O and AWSSri Ambati
Navdeep Gill @ Galvanize Seattle- May 2016
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Methods of Optimization in Machine LearningKnoldus Inc.
In this session we will discuss about various methods to optimise a machine learning model and, how we can adjust the hyper-parameters to minimise the cost function.
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on "Predictive Analytics Using R", will help you learn about how predictive analytics works and how it can be implemented using R to solve real-world problems. Below are the topics covered in this module:
What is Predictive Analytics?
Stages of Predictive Analytics
Predictive Analytics Using R
Predictive Analytics Use case
Demo
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Overview of the course. Introduction to image sciences, image processing and computer vision. Basics of machine learning, terminologies, paradigms. No-free lunch theorem. Supervised versus unsupervised learning. Clustering and K-Means. Classification and regression. Linear least squares and polynomial curve fitting. Model complexity and overfitting. Curse of dimensionality. Dimensionality reduction and principal component analysis. Image representation, semantic gap, image features, and classical computer vision pipelines.
Evaluating LLM Models for Production Systems Methods and Practices -alopatenko
This webinar is designed to offer a comprehensive understanding of the evaluation processes for LLMs, particularly in the context of preparing these models for deployment in production environments.
Key Highlights of the Seminar:
In-Depth Analysis of LLM Evaluation Methods: Gain insights into a variety of methods to evaluate LLM models, understanding their strengths and weaknesses.
End-to-End Evaluation Techniques: Explore how LLM augmented systems are assessed from a holistic perspective.
Pragmatic Approach to System Deployment: Learn practical strategies for applying these evaluation techniques to systems intended for real-world application.
Focused Overview on Critical LLM Aspects: Receive an overview of various evaluation techniques that are essential for assessing the most crucial elements of modern LLM systems.
Simplifying the Evaluation Process: Understand how to streamline the evaluation process, making the work of LLM scientists more efficient and productive.
Dr. Andrei Lopatenko is a seasoned expert and executive leader with over 15 years of experience in the tech industry, focusing on search engines, recommendation systems, and large-scale AI, ML, and NLP applications. He has contributed significantly to major companies like Google, Apple, Walmart, eBay, and Zillow, benefiting billions of customers. Dr. Lopatenko earned his PhD in Computer Science from the University of Manchester. He played a key role in developing Google's search engine, initiating Apple Maps, co-founding a Conversational AI startup acquired by Facebook/Meta, and leading Search, LLM, and Generative AI at Zillow.
This is a deep learning presentation based on Deep Neural Network. It reviews the deep learning concept, related works and specific application areas.It describes a use case scenario of deep learning and highlights the current trends and research issues of deep learning
Build an LLM-powered application using LangChain.pdfAnastasiaSteele10
LangChain is an advanced framework that allows developers to create language model-powered applications. It provides a set of tools, components, and interfaces that make building LLM-based applications easier. With LangChain, managing interactions with language models, chaining together various components, and integrating resources like APIs and databases is a breeze. The platform includes a set of APIs that can be integrated into applications, allowing developers to add language processing capabilities without having to start from scratch.
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
Given at the PyData NYC 2013 conference (http://vimeo.com/79517341), and will be given at PyTennessee 2014.
Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.
Knowledge Based Reasoning: Agents, Facets of Knowledge. Logic and Inferences: Formal Logic,
Propositional and First Order Logic, Resolution in Propositional and First Order Logic, Deductive
Retrieval, Backward Chaining, Second order Logic. Knowledge Representation: Conceptual
Dependency, Frames, Semantic nets.
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
Most data is allocated to a period or to some point in time. We can gain a lot of insight by analyzing what happened when. The better the quality and accuracy of our data, the better our predictions can become.
Unfortunately the data we have to deal with is often aggregated for example on a monthly basis, but not all months are the same, they may have 28 days, 31 days, have four or five weekends,…. It’s made fit to our calendar that was made fit to deal with the earth surrounding the sun, not to please Data Scientists.
Dealing with periodical data can be a challenge. This talk will show to how you can deal with it with Pandas.
Intro to Machine Learning with H2O and AWSSri Ambati
Navdeep Gill @ Galvanize Seattle- May 2016
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Methods of Optimization in Machine LearningKnoldus Inc.
In this session we will discuss about various methods to optimise a machine learning model and, how we can adjust the hyper-parameters to minimise the cost function.
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on "Predictive Analytics Using R", will help you learn about how predictive analytics works and how it can be implemented using R to solve real-world problems. Below are the topics covered in this module:
What is Predictive Analytics?
Stages of Predictive Analytics
Predictive Analytics Using R
Predictive Analytics Use case
Demo
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Overview of the course. Introduction to image sciences, image processing and computer vision. Basics of machine learning, terminologies, paradigms. No-free lunch theorem. Supervised versus unsupervised learning. Clustering and K-Means. Classification and regression. Linear least squares and polynomial curve fitting. Model complexity and overfitting. Curse of dimensionality. Dimensionality reduction and principal component analysis. Image representation, semantic gap, image features, and classical computer vision pipelines.
Evaluating LLM Models for Production Systems Methods and Practices -alopatenko
This webinar is designed to offer a comprehensive understanding of the evaluation processes for LLMs, particularly in the context of preparing these models for deployment in production environments.
Key Highlights of the Seminar:
In-Depth Analysis of LLM Evaluation Methods: Gain insights into a variety of methods to evaluate LLM models, understanding their strengths and weaknesses.
End-to-End Evaluation Techniques: Explore how LLM augmented systems are assessed from a holistic perspective.
Pragmatic Approach to System Deployment: Learn practical strategies for applying these evaluation techniques to systems intended for real-world application.
Focused Overview on Critical LLM Aspects: Receive an overview of various evaluation techniques that are essential for assessing the most crucial elements of modern LLM systems.
Simplifying the Evaluation Process: Understand how to streamline the evaluation process, making the work of LLM scientists more efficient and productive.
Dr. Andrei Lopatenko is a seasoned expert and executive leader with over 15 years of experience in the tech industry, focusing on search engines, recommendation systems, and large-scale AI, ML, and NLP applications. He has contributed significantly to major companies like Google, Apple, Walmart, eBay, and Zillow, benefiting billions of customers. Dr. Lopatenko earned his PhD in Computer Science from the University of Manchester. He played a key role in developing Google's search engine, initiating Apple Maps, co-founding a Conversational AI startup acquired by Facebook/Meta, and leading Search, LLM, and Generative AI at Zillow.
This is a deep learning presentation based on Deep Neural Network. It reviews the deep learning concept, related works and specific application areas.It describes a use case scenario of deep learning and highlights the current trends and research issues of deep learning
Partitioning Composite Code Changes to Facilitate Code Review (MSR2015)Sung Kim
Yida's presentation at MSR 2015!
Abstract—Developers expend significant effort on reviewing source code changes, hence the comprehensibility of code changes directly affects development productivity. Our prior study has suggested that composite code changes, which mix multiple development issues together, are typically difficult to review. Unfortunately, our manual inspection of 453 open source code changes reveals a non-trivial occurrence (up to 29%) of such composite changes.
In this paper, we propose a heuristic-based approach to automatically partition composite changes, such that each sub-change in the partition is more cohesive and self-contained. Our quantitative and qualitative evaluation results are promising in demonstrating the potential benefits of our approach for facilitating code review of composite code changes.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
• Implemented a Path Oriented Decision Making (PODEM) algorithm for an Automatic Test Generator (ATG) for combinational logic circuits with re-convergent fan-out.
• The generated test vectors were verified using deductive fault simulation.
• The fault coverage after implementing Random Test Generator (RTG) was calculated and plotted.
• The ATG, RTG and fault simulator were all written in more than 1200+ lines of code in python.
Updated slides for my talk at the CHAQ meeting in Antwerp. I also added slides on some of my experiences on performing empirical studies with open source and industrial software systems.
Automation of building reliable modelsEszter Szabó
Volume and velocity of bioactivity data available in public or in-house sources represent an immense opportunity to be exploited in novel compound design. Wider and wider array of targets with labelled data necessitates efficient solutions to build a large number of individual models. Velocity of data growth provides the possibility to yield higher accuracy through continuous re-training of the existing models. Automatic re-training maximizes the applicability domain and minimizes the risk of accuracy drop while a project expands into novel chemical series.
Comparing Machine Learning Algorithms in Text MiningAndrea Gigli
In this project I compare different Machine Learning Algorithm on different Text Mining Tasks.
ML algorithms: Naive Bayes, Support Vector Machine, Decision Trees, Random Forest, Ordinal Regression as ML task
Tasks considered: Classifying Positive and Negative Reviews, Predicting Review Stars, Quantifying Sentiment Over Time, Detecting Fake Reviews
This paper advances the Domain Segmentation based on Uncertainty in the Surrogate (DSUS) framework which is a novel approach to characterize the uncertainty in surrogates. The leave-one-out cross-validation technique is adopted in the DSUS framework to measure local errors of a surrogate. A method is proposed in this paper to evaluate the performance of the leave-out-out cross-validation errors as local error measures. This method evaluates local errors by comparing: (i) the leave-one-out cross-validation error with (ii) the actual local error estimated within a local hypercube for each training point. The comparison results show that the leave-one-out cross-validation strategy can capture the local errors of a surrogate. The DSUS framework is then applied to key aspects of wind resource as- sessment and wind farm cost modeling. The uncertainties in the wind farm cost and the wind power potential are successfully characterized, which provides designers/users more confidence when using these models
MuVM: Higher Order Mutation Analysis Virtual Machine for CSusumu Tokumoto
Mutation analysis is a method for evaluating the effectiveness of a test suite by seeding faults artificially and measuring the fraction of seeded faults detected by the test suite. The major limitation of mutation analysis is its lengthy execution time because it involves generating, compiling and running large numbers of mutated programs, called mutants. Our tool MuVM achieves a significant runtime improvement by performing higher order mutation analysis using four techniques, metamutation, mutation on virtual machine, higher order split-stream execution, and online adaptation technique. In order to obtain the same behavior as mutating the source code directly, metamutation preserves the mutation location information which may potentially be lost during bitcode compilation and optimization. Mutation on a virtual machine reduces the compilation and testing cost by compiling a program once and invoking a process once. Higher order split-stream execution also reduces the testing cost by executing common parts of the mutants together and splitting the execution at a seeded fault. Online adaptation technique reduces the number of generated mutants by omitting infeasible mutants. Our comparative experiments indicate that our tool is significantly superior to an existing tool, an existing technique (mutation schema generation), and no-split-stream execution in higher order mutation.
Dependability Benchmarking by Injecting Software BugsRoberto Natella
Benchmarks are an established practice for performance evaluation in the computer industry since decades. Examples of successful benchmarking initiatives are the TPC (Transaction Processing Performance Council) and the SPEC (Standard Performance Evaluation Corporation). More recently, the research community developed the notion of dependability benchmarking, which evaluates the quality of service (throughput, availability, etc.) of competing products in the presence of faults, by using fault injection. The idea of dependability benchmarking has been applied in several domains including transaction processing, telecom, automotive, etc.
Given that software faults (bugs) are a major cause of failures, it becomes important to assess dependability against these faults. However, emulating software faults in a controlled fault injection experiment is a difficult problem, since bugs originate from human error. This presentation discusses about the open challenges and the recent advances in the field of emulating software bugs in a representative way.
Developers often wonder how to implement a certain functionality
(e.g., how to parse XML files) using APIs. Obtaining
an API usage sequence based on an API-related natural
language query is very helpful in this regard. Given a query,
existing approaches utilize information retrieval models to
search for matching API sequences. These approaches treat
queries and APIs as bags-of-words and lack a deep understanding
of the semantics of the query.
We propose DeepAPI, a deep learning based approach to
generate API usage sequences for a given natural language
query. Instead of a bag-of-words assumption, it learns the
sequence of words in a query and the sequence of associated
APIs. DeepAPI adapts a neural language model named
RNN Encoder-Decoder. It encodes a word sequence (user
query) into a fixed-length context vector, and generates an
API sequence based on the context vector. We also augment
the RNN Encoder-Decoder by considering the importance
of individual APIs. We empirically evaluate our approach
with more than 7 million annotated code snippets collected
from GitHub. The results show that our approach generates
largely accurate API sequences and outperforms the related
approaches.
Defect, defect, defect: PROMISE 2012 Keynote Sung Kim
Software prediction leveraging repositories has received a tremendous amount of attention within the software engineering community, including PROMISE. In this talk, I will first present great achievements in defect prediction research including new defect prediction features, promising algorithms, and interesting analysis results. However, there are still many challenges in defect prediction. I will talk about them and discuss potential solutions for them leveraging prediction 2.0.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
2. 2
Predict
Training
?
?
Model
Project A
: Metric value
: Buggy-labeled instance
: Clean-labeled instance
?: Unlabeled instance
Software Defect Prediction
Related Work
Munson@TSE`92, Basili@TSE`95, Menzies@TSE`07,
Hassan@ICSE`09, Bird@FSE`11,D’ambros@EMSE112
Lee@FSE`11,...
3. What if labeled instances do not
exist?
3
?
?
?
?
?
Project X
Unlabeled
Dataset
?: Unlabeled instance
: Metric value
Model
New projects
Projects lacking in
historical data
7. Key Idea
• Consistent defect-proneness tendency of
metrics
– Defect prediction metrics measure complexity of
software and its development process.
• e.g.
– The number of developers touching a source code file
(Bird@FSE`11)
– The number of methods in a class (D’Ambroas@ESEJ`12)
– The number of operands (Menzies@TSE`08)
More complexity implies more defect-proneness
(Rahman@ICSE`13)
• Distributions between source and target should
be the same to build a strong prediction model.
7
Match source and target metrics that
have similar distribution
9. Metric Selection
• Why? (Guyon@JMLR`03)
– Select informative metrics
• Remove redundant and irrelevant metrics
– Decrease complexity of metric matching combination
• Feature Selection Approaches (Gao@SPE`11,Shivaji@TSE`13)
– Gain Ratio
– Chi-square
– Relief-F
– Significance attribute evaluation
9
10. Metric Matching
10
Source Metrics Target Metrics
X1
X2
Y1
Y2
0.8
0.5
* We can apply different cutoff values of matching score
* It can be possible that there is no matching at all.
11. Compute Matching Score
KSAnalyzer
• Use p-value of Kolmogorov-Smirnov Test
(Massey@JASA`51)
11
Matching Score M of i-th source and j-th target metrics:
Mij = pij
14. Baselines
• WPDP
• CPDP-CM (Turhan@EMSE`09,Ma@IST`12,He@IST`14)
– Cross-project defect prediction using only
common metrics between source and target
datasets
• CPDP-IFS (He@CoRR`14)
– Cross-project defect prediction on
Imbalanced Feature Set (i.e. heterogeneous
metric set)
– 16 distributional characteristics of values of
an instance as features (e.g., mean, std,
maximum,...)
14
15. Research Questions (RQs)
• RQ1
– Is heterogeneous defect prediction comparable
to WPDP?
• RQ2
– Is heterogeneous defect prediction comparable
to CPDP-CM?
• RQ3
– Is Heterogeneous defect prediction comparable
to CPDP-IFS?
15
16. Benchmark Datasets
Group Dataset
# of instances # of
metrics
Granularity
All Buggy (%)
AEEEM
EQ 325 129 (39.7%)
61 Class
JDT 997 206 (20.7%)
LC 399 64 (9.36%)
ML 1862 245 (13.2%)
PDE 1492 209 (14.0%)
MORP
H
ant-1.3 125 20 (16.0%)
20 Class
arc 234 27 (11.5%)
camel-1.0 339 13 (3.8%)
poi-1.5 237 141 (75.0%)
redaktor 176 27 (15.3%)
skarbonka 45 9 (20.0%)
tomcat 858 77 (9.0%)
velocity-1.4 196 147 (75.0%)
xalan-2.4 723 110 (15.2%)
xerces-1.2 440 71 (16.1%)
16
Group Dataset
# of instances # of
metrics
Granularity
All Buggy (%)
ReLink
Apache 194 98 (50.5%)
26 FileSafe 56 22 (39.3%)
ZXing 399
118
(29.6%)
NASA
cm1 327 42 (12.8%)
37 Function
mw1 253 27 (10.7%)
pc1 705 61 (8.7%)
pc3 1077
134
(12.4%)
pc4 1458
178
(12.2%)
SOFTLA
B
ar1 121 9 (7.4%)
29 Function
ar3 63 8 (12.7%)
ar4 107 20 (18.7%)
ar5 36 8 (22.2%)
ar6 101 15 (14.9%)
600 prediction combinations in total!
17. Experimental Settings
• Logistic Regression
• HDP vs. WPDP, CPDP-CM, and CPDP-IFS
17
Test set
(50%)
Training set
(50%)
Project
1
Project
2
Project
n
...
...
X 1000
Project
1
Project
2
Project
n
...
...
CPDP-CM
CPDP-IFS
HDP
WPDP
Project A
25. Different Feature Selections
(median AUCs, Win/Tie/Loss)
25
Approach
Against
WPDP
Against
CPDP-CM
Against
CPDP-IFS
HDP
AUC Win% AUC Win% AUC Win% AUC
Gain Ratio 0.657 63.7% 0.645 63.2% 0.536 80.2% 0.720
Chi-Square 0.657 64.7% 0.651 66.4% 0.556 82.3% 0.727
Significanc
e
0.657 66.2% 0.636 66.2% 0.553 82.0% 0.724
Relief-F 0.670 57.0% 0.657 63.1% 0.543 80.5% 0.709
None 0.657 47.3% 0.624 50.3% 0.536 66.3% 0.663
26. Results in Different Cutoffs
26
Cutoff
Against
WPDP
Against
CPDP-CM
Against
CPDP-IFS
HDP Target
Coverage
AUC Win% AUC Win% AUC Win% AUC
0.05 0.657 66.2% 0.636 66.2% 0.553 82.4% 0.724* 100%
0.90 0.657 100% 0.761 71.4% 0.624 100% 0.852* 21%
27. Conclusion
• HDP
– Potential for CPDP across datasets with
different metric sets.
• Future work
– Filtering out noisy metric matching
– Determine the best probability threshold
27
Here is Project A and some software entities. Let say these entities are source code files.
I want to predict whether these files are buggy or clean.
To do this, we need a prediction model.
Since defect prediction models are trained by machine learning algorithms, we need labeled instances collected from previous releases.
This is an labeled instance. An instance consists of features and labels.
Various software metrics such as LoC, # of functions in a file, and # of authors touching a source file, are used as features for machine learning.
Software metrics measure complexity of software and its development process
Each instance can be labeled by past bug information.
Software metrics and past bug information can be collected from software archives such as version control systems and bug report systems.
With these labeled instances, we can build a prediction model and predict the unlabeled instances.
This prediction is conducted within the same project. So, we call this Within-project defect prediction (WPDP).
There are many studies about WPDP and showed good prediction performance. ( like prediction accuracy is 0.7.)
What if there are no labeled instances. This can happen in new projects and projects lacking in historical data.
New projects do not have past defect information to label instances.
Some projects also does not have defect information because of lacking in historical data from software archives.
When I participated in an industrial project for Samsung electronics, it was really difficult to generate labeled instances because their software archives are not well managed by developers.
So, in some real industrial projects, we may not generate labeled instances to build a prediction model.
Without labeled instances, we can not build a prediction model.
After experiencing this limitation form the industry, I decided to address this problem.
There are existing solutions to build a prediction model for unlabeled datasets.
The first solution is cross-project defect prediction. We can reuse labeled instances from other projects.
Various feature selection approaches can be applied
By doing that, we can investigate how higher matching scores can impact defect prediction performance.
16 distribution characteristics: mode, median, mean, harmonic mean, minimum, maximum, range, variation ratio, first quartile, third quartile, interquartile range, vari- ance, standard deviation, coefficient of variance, skewness, and kurtosis
AEEEM: object- oriented (OO) metrics, previous-defect metrics, entropy met- rics of change and code, and churn-of-source-code metrics [4].
MORPH: McCabe’s cyclomatic metrics, CK metrics, and other OO metrics [36].
ReLink: code complexity metrics
NASA: Halstead metrics and McCabe’s cyclomatic metrics, additional complexity metrics such as parameter count and percentage of comments
SOFTLAB: Halstead metrics and McCabe’s cyclomatic metrics
all 222 prediction combinations among 600 predictions