Here are some key research questions around building adaptive agents:
- How to handle anonymization of private data when making data public
- How to handle large volumes of data, especially when working from raw project artifacts
- How to recognize different modes or situations within the data over time
- How to determine when new situations are truly new vs a repeat of old situations
- How to establish trust when data is crowd-sourced and the agents did not directly collect the data
- How to provide explanations for recommendations from complex models trained on large datasets
This presentation is about a lecture I gave within the "Software systems and services" immigration course at the Gran Sasso Science Institute, L'Aquila (Italy): http://cs.gssi.infn.it/.
http://www.ivanomalavolta.com
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...alessio_ferrari
This
Lecture about qualitative data collection methods and qualitative data analysis in software engineering. Topics covered are:
1. Sampling
2. Interviews
3. Observation and Participant Observation
4. Archival Data Collection
5. Grounded theory, Coding, Thematic Analysis
6. Threats to validity in qualitative studies
Find the videos at: https://www.youtube.com/playlist?list=PLSKM4VZcJjV-P3fFJYMu2OhlTjEr9Bjl0
Lecture on case study design and reporting in empirical software engineering. The lecture touches on the topics of units of analysis, data collection, data analysis, validity procedures, and collaboration with industries.
Empirical Software Engineering for Software Environments - University of Cali...Marco Aurelio Gerosa
Second class of the Software Environment course. In this class, we discuss how to use Empirical Software Engineering techniques to support the construction and evaluation of software tools.
This presentation is about a lecture I gave within the "Software systems and services" immigration course at the Gran Sasso Science Institute, L'Aquila (Italy): http://cs.gssi.infn.it/.
http://www.ivanomalavolta.com
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...alessio_ferrari
This
Lecture about qualitative data collection methods and qualitative data analysis in software engineering. Topics covered are:
1. Sampling
2. Interviews
3. Observation and Participant Observation
4. Archival Data Collection
5. Grounded theory, Coding, Thematic Analysis
6. Threats to validity in qualitative studies
Find the videos at: https://www.youtube.com/playlist?list=PLSKM4VZcJjV-P3fFJYMu2OhlTjEr9Bjl0
Lecture on case study design and reporting in empirical software engineering. The lecture touches on the topics of units of analysis, data collection, data analysis, validity procedures, and collaboration with industries.
Empirical Software Engineering for Software Environments - University of Cali...Marco Aurelio Gerosa
Second class of the Software Environment course. In this class, we discuss how to use Empirical Software Engineering techniques to support the construction and evaluation of software tools.
Theories in Empirical Software EngineeringDaniel Mendez
Slides from the International Advanced School on Empirical Software Engineering 2015, held as part of the Empirical Software Engineering International Week in Beijing. The slides are posted with the permission of the main organiser Roel Wieringa.
Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validityalessio_ferrari
Complete lecture on controlled experiments in software engineering. It explains practical guidelines on conducting controlled experiments and describes the concepts of dependent, independent, and control variables, significance, and p-value. It also explains how to select the appropriate statistic test for a hypothesis, and gives example of data for different typical tests.
Finally, it discusses threats to validity in controlled experiments and gives indications for reporting.
Find the video lectures here: https://www.youtube.com/playlist?list=PLSKM4VZcJjV-P3fFJYMu2OhlTjEr9Bjl0
Empirical Methods in Software Engineering - an Overviewalessio_ferrari
A first introductory lecture on empirical methods in software engineering. It includes:
1) Motivation for empirical software engineering studies
2) How to define research questions
3) Measures and data collection methods
4) Formulating theories in software engineering
5) Software engineering research strategies
Find the videos at: https://www.youtube.com/playlist?list=PLSKM4VZcJjV-P3fFJYMu2OhlTjEr9Bjl0
This presentation is about a lecture I gave within the "Green Lab" course of the Computer Science master program, of the Vrije Universiteit Amsterdam.
http://www.ivanomalavolta.com
Selecting Empirical Methods for Software EngineeringDaniel Cukier
Presentation on how to write good Master and PhD dissertations.
Empirical Methods, Software Engineering, science, computer science, software, methods, positivism, epistemology, onthology, construtivism, critical theory, pragmatism, case study, research action, ethnography
This presentation is about a lecture I gave within the "Software systems and services" immigration course at the Gran Sasso Science Institute, L'Aquila (Italy): http://cs.gssi.it/.
http://www.ivanomalavolta.com
Theory Building in RE - The NaPiRE InitiativeDaniel Mendez
Talk I gave on the "Naming the Pain in Requirements Engineering" initiative (www.re-survey.org) at the Seminar on Forty Years of Requirements Engineering – Looking Forward and Looking Back (RE@40) in Kappel am Albis, Switzerland
Abstract:
Though in essence an engineering discipline, software engineering research has always been struggling to demonstrate impact. This is reflected in part by the funding challenges that the discipline faces in many countries, the difficulties we have to attract industrial participants to our conferences, and the scarcity of papers reporting industrial case studies.
There are clear historical reasons for this but we nevertheless need, as a community, to question our research paradigms and peer evaluation processes in order to improve the situation. From a personal standpoint, relevance and impact are concerns that I have been struggling with for a long time, which eventually led me to leave a comfortable academic position and a research chair to work in industry-driven research.
I will use some concrete research project examples to argue why we need more inductive research, that is, research working from specific observations in real settings to broader generalizations and theories. Among other things, the examples will show how a more thorough understanding of practice and closer interactions with practitioners can profoundly influence the definition of research problems, and the development and evaluation of solutions to these problems. Furthermore, these examples will illustrate why, to a large extent, useful research is necessarily multidisciplinary. I will also address issues regarding the implementation of such a research paradigm and show how our own bias as a research community worsens the situation and undermines our very own interests.
On a more humorous note, the title hints at the fact that being a scientist in software engineering and aiming at having impact on practice often entails leading two parallel careers and impersonate different roles to different peers and partners.
Bio:
Lionel Briand is heading the Certus center on software verification and validation at Simula Research Laboratory, where he is leading research projects with industrial partners. He is also a professor at the University of Oslo (Norway). Before that, he was on the faculty of the department of Systems and Computer Engineering, Carleton University, Ottawa, Canada, where he was full professor and held the Canada Research Chair (Tier I) in Software Quality Engineering. He is the coeditor-in-chief of Empirical Software Engineering (Springer) and is a member of the editorial boards of Systems and Software Modeling (Springer) and Software Testing, Verification, and Reliability (Wiley). He was on the board of IEEE Transactions on Software Engineering from 2000 to 2004. Lionel was elevated to the grade of IEEE Fellow for his work on the testing of object-oriented systems. His research interests include: model-driven development, testing and verification, search-based software engineering, and empirical software engineering.
The methods of exploratory testing has gained significant attention in industry and research in the last years. However, as many “buzzword" technologies, the introduction and application of exploratory testing is not straightforward. Exploratory testing it is not only black or white - scripted or exploratory - but also all shades of grey in between. Within the EASE industrial excellence center, we have executed an industrial workshop on exploratory testing, that helps providing understanding of how to choose feasible levels of exploration in exploratory testing. We will present the concepts of levels of exploration in exploratory testing, the outcomes of the workshop, along with relevant empirical research findings on exploratory testing.
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This session will discuss hidden design assumptions, review design principles to apply when building multi-use data infrastructure, and provide a reference architecture to use as you work to unify your analytics infrastructure.
The focus in our market has been on acquiring technology, and that ignores the more important part: the larger IT landscape within which this technology lives and the data architecture that lies at its core. If one expects longevity from a platform then it should be a designed rather than accidental architecture.
Architecture is more than just software. It starts from use and includes the data, technology, methods of building and maintaining, and organization of people. What are the design principles that lead to good design and a functional data architecture? What are the assumptions that limit older approaches? How can one integrate with, migrate from or modernize an existing data environment? How will this affect an organization's data management practices? This tutorial will help you answer these questions.
Topics covered:
* A brief history of data infrastructure and past design assumptions
* Categories of data and data use in organizations
* Analytic workload characteristics and constraints
* Data architecture
* Functional architecture
* Tradeoffs between different classes of technology
* Technology planning assumptions and guidance
#strataconf
Pay no attention to the man behind the curtain - the unseen work behind data ...mark madsen
Goal: explain the nature of the work of an analytics team to a manager, and enable people on those teams to explain what a data science team needs to a manager.
It seems as if every organization wants to enable analytical-decision making and embed analytics into operational processes. What can you do with analytics? It looks like anything is possible. What can you really do? Probably a lot less than you expect. Why is this? Vendors promise easy-to-use analytics tools and services but they rarely deliver. The products may be easy but the work is still hard.
Using analytics to solve problems depends on many factors beyond the math: people, processes, the skills of the analyst, the technology used, the data. Technology is the easy part. Figuring out what to do and how to do it is a lot harder. Despite this, fancy new tools get all the attention and budget.
People and data are the truly hard parts. People, because many believe that data is absolute rather than relative, and that analytic models produce an answer rather than a range of answers with varying degrees of truth, accuracy and applicability. Data, because managing data for analytics is a nuanced, detail-oriented and seemingly dull task left to back-office IT.
If your goal is to build a repeatable analytics capability rather than a one-off analytics project then you will need to address the parts that are rarely mentioned. This talk will explain some of the unseen and little-discussed aspects involved when building and deploying analytics.
Theories in Empirical Software EngineeringDaniel Mendez
Slides from the International Advanced School on Empirical Software Engineering 2015, held as part of the Empirical Software Engineering International Week in Beijing. The slides are posted with the permission of the main organiser Roel Wieringa.
Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validityalessio_ferrari
Complete lecture on controlled experiments in software engineering. It explains practical guidelines on conducting controlled experiments and describes the concepts of dependent, independent, and control variables, significance, and p-value. It also explains how to select the appropriate statistic test for a hypothesis, and gives example of data for different typical tests.
Finally, it discusses threats to validity in controlled experiments and gives indications for reporting.
Find the video lectures here: https://www.youtube.com/playlist?list=PLSKM4VZcJjV-P3fFJYMu2OhlTjEr9Bjl0
Empirical Methods in Software Engineering - an Overviewalessio_ferrari
A first introductory lecture on empirical methods in software engineering. It includes:
1) Motivation for empirical software engineering studies
2) How to define research questions
3) Measures and data collection methods
4) Formulating theories in software engineering
5) Software engineering research strategies
Find the videos at: https://www.youtube.com/playlist?list=PLSKM4VZcJjV-P3fFJYMu2OhlTjEr9Bjl0
This presentation is about a lecture I gave within the "Green Lab" course of the Computer Science master program, of the Vrije Universiteit Amsterdam.
http://www.ivanomalavolta.com
Selecting Empirical Methods for Software EngineeringDaniel Cukier
Presentation on how to write good Master and PhD dissertations.
Empirical Methods, Software Engineering, science, computer science, software, methods, positivism, epistemology, onthology, construtivism, critical theory, pragmatism, case study, research action, ethnography
This presentation is about a lecture I gave within the "Software systems and services" immigration course at the Gran Sasso Science Institute, L'Aquila (Italy): http://cs.gssi.it/.
http://www.ivanomalavolta.com
Theory Building in RE - The NaPiRE InitiativeDaniel Mendez
Talk I gave on the "Naming the Pain in Requirements Engineering" initiative (www.re-survey.org) at the Seminar on Forty Years of Requirements Engineering – Looking Forward and Looking Back (RE@40) in Kappel am Albis, Switzerland
Abstract:
Though in essence an engineering discipline, software engineering research has always been struggling to demonstrate impact. This is reflected in part by the funding challenges that the discipline faces in many countries, the difficulties we have to attract industrial participants to our conferences, and the scarcity of papers reporting industrial case studies.
There are clear historical reasons for this but we nevertheless need, as a community, to question our research paradigms and peer evaluation processes in order to improve the situation. From a personal standpoint, relevance and impact are concerns that I have been struggling with for a long time, which eventually led me to leave a comfortable academic position and a research chair to work in industry-driven research.
I will use some concrete research project examples to argue why we need more inductive research, that is, research working from specific observations in real settings to broader generalizations and theories. Among other things, the examples will show how a more thorough understanding of practice and closer interactions with practitioners can profoundly influence the definition of research problems, and the development and evaluation of solutions to these problems. Furthermore, these examples will illustrate why, to a large extent, useful research is necessarily multidisciplinary. I will also address issues regarding the implementation of such a research paradigm and show how our own bias as a research community worsens the situation and undermines our very own interests.
On a more humorous note, the title hints at the fact that being a scientist in software engineering and aiming at having impact on practice often entails leading two parallel careers and impersonate different roles to different peers and partners.
Bio:
Lionel Briand is heading the Certus center on software verification and validation at Simula Research Laboratory, where he is leading research projects with industrial partners. He is also a professor at the University of Oslo (Norway). Before that, he was on the faculty of the department of Systems and Computer Engineering, Carleton University, Ottawa, Canada, where he was full professor and held the Canada Research Chair (Tier I) in Software Quality Engineering. He is the coeditor-in-chief of Empirical Software Engineering (Springer) and is a member of the editorial boards of Systems and Software Modeling (Springer) and Software Testing, Verification, and Reliability (Wiley). He was on the board of IEEE Transactions on Software Engineering from 2000 to 2004. Lionel was elevated to the grade of IEEE Fellow for his work on the testing of object-oriented systems. His research interests include: model-driven development, testing and verification, search-based software engineering, and empirical software engineering.
The methods of exploratory testing has gained significant attention in industry and research in the last years. However, as many “buzzword" technologies, the introduction and application of exploratory testing is not straightforward. Exploratory testing it is not only black or white - scripted or exploratory - but also all shades of grey in between. Within the EASE industrial excellence center, we have executed an industrial workshop on exploratory testing, that helps providing understanding of how to choose feasible levels of exploration in exploratory testing. We will present the concepts of levels of exploration in exploratory testing, the outcomes of the workshop, along with relevant empirical research findings on exploratory testing.
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This session will discuss hidden design assumptions, review design principles to apply when building multi-use data infrastructure, and provide a reference architecture to use as you work to unify your analytics infrastructure.
The focus in our market has been on acquiring technology, and that ignores the more important part: the larger IT landscape within which this technology lives and the data architecture that lies at its core. If one expects longevity from a platform then it should be a designed rather than accidental architecture.
Architecture is more than just software. It starts from use and includes the data, technology, methods of building and maintaining, and organization of people. What are the design principles that lead to good design and a functional data architecture? What are the assumptions that limit older approaches? How can one integrate with, migrate from or modernize an existing data environment? How will this affect an organization's data management practices? This tutorial will help you answer these questions.
Topics covered:
* A brief history of data infrastructure and past design assumptions
* Categories of data and data use in organizations
* Analytic workload characteristics and constraints
* Data architecture
* Functional architecture
* Tradeoffs between different classes of technology
* Technology planning assumptions and guidance
#strataconf
Pay no attention to the man behind the curtain - the unseen work behind data ...mark madsen
Goal: explain the nature of the work of an analytics team to a manager, and enable people on those teams to explain what a data science team needs to a manager.
It seems as if every organization wants to enable analytical-decision making and embed analytics into operational processes. What can you do with analytics? It looks like anything is possible. What can you really do? Probably a lot less than you expect. Why is this? Vendors promise easy-to-use analytics tools and services but they rarely deliver. The products may be easy but the work is still hard.
Using analytics to solve problems depends on many factors beyond the math: people, processes, the skills of the analyst, the technology used, the data. Technology is the easy part. Figuring out what to do and how to do it is a lot harder. Despite this, fancy new tools get all the attention and budget.
People and data are the truly hard parts. People, because many believe that data is absolute rather than relative, and that analytic models produce an answer rather than a range of answers with varying degrees of truth, accuracy and applicability. Data, because managing data for analytics is a nuanced, detail-oriented and seemingly dull task left to back-office IT.
If your goal is to build a repeatable analytics capability rather than a one-off analytics project then you will need to address the parts that are rarely mentioned. This talk will explain some of the unseen and little-discussed aspects involved when building and deploying analytics.
Discovery and Open Data: slides from #discopen session at JISC cross programme meeting in April 2012. Author: Amber Thomas, JISC. Discusses the data space around discovery issues in education and research, with a focus on open data. CC BY. Please see slide 2 for permissions.
Toward a System Building Agenda for Data Integration(and Dat.docxjuliennehar
Toward a System Building Agenda for Data Integration
(and Data Science)
AnHai Doan, Pradap Konda, Paul Suganthan G.C., Adel Ardalan, Jeffrey R. Ballard, Sanjib Das,
Yash Govind, Han Li, Philip Martinkus, Sidharth Mudgal, Erik Paulson, Haojun Zhang
University of Wisconsin-Madison
Abstract
We argue that the data integration (DI) community should devote far more effort to building systems,
in order to truly advance the field. We discuss the limitations of current DI systems, and point out that
there is already an existing popular DI “system” out there, which is PyData, the open-source ecosystem
of 138,000+ interoperable Python packages. We argue that rather than building isolated monolithic DI
systems, we should consider extending this PyData “system”, by developing more Python packages that
solve DI problems for the users of PyData. We discuss how extending PyData enables us to pursue an
integrated agenda of research, system development, education, and outreach in DI, which in turn can
position our community to become a key player in data science. Finally, we discuss ongoing work at
Wisconsin, which suggests that this agenda is highly promising and raises many interesting challenges.
1 Introduction
In this paper we focus on data integration (DI), broadly interpreted as covering all major data preparation steps
such as data extraction, exploration, profiling, cleaning, matching, and merging [10]. This topic is also known
as data wrangling, munging, curation, unification, fusion, preparation, and more. Over the past few decades, DI
has received much attention (e.g., [37, 29, 31, 20, 34, 33, 6, 17, 39, 22, 23, 5, 8, 36, 15, 35, 4, 25, 38, 26, 32, 19,
2, 12, 11, 16, 2, 3]). Today, as data science grows, DI is receiving even more attention. This is because many
data science applications must first perform DI to combine the raw data from multiple sources, before analysis
can be carried out to extract insights.
Yet despite all this attention, today we do not really know whether the field is making good progress. The
vast majority of DI works (with the exception of efforts such as Tamr and Trifacta [36, 15]) have focused on
developing algorithmic solutions. But we know very little about whether these (ever-more-complex) algorithms
are indeed useful in practice. The field has also built mostly isolated system prototypes, which are hard to use and
combine, and are often not powerful enough for real-world applications. This makes it difficult to decide what
to teach in DI classes. Teaching complex DI algorithms and asking students to do projects using our prototype
systems can train them well for doing DI research, but are not likely to train them well for solving real-world DI
problems in later jobs. Similarly, outreach to real users (e.g., domain scientists) is difficult. Given that we have
Copyright 0000 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for
advertising or promotional purpose ...
From Lab to Factory: Or how to turn data into valuePeadar Coyle
We've all heard of 'big data' or data science, but how do we convert these trends into actual business value. I share case studies, and technology tips and talk about the challenges of the data science process. This is all based on two years of in-the-field research of deploying models, and going from prototypes to production.
These are slides from my talk at PyCon Ireland 2015
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian from the University of Oxford explains Data Science and its relationship with Big Data and Cloud Computing. Then he illustrates using AzureML to perform a simple data science analytics.
The Black Box: Interpretability, Reproducibility, and Data Managementmark madsen
The growing complexity of data science leads to black box solutions that few people in an organization understand. You often hear about the difficulty of interpretability—explaining how an analytic model works—and that you need it to deploy models. But people use many black boxes without understanding them…if they’re reliable. It’s when the black box becomes unreliable that people lose trust.
Mistrust is more likely to be created by the lack of reliability, and the lack of reliability is often the result of misunderstanding essential elements of analytics infrastructure and practice. The concept of reproducibility—the ability to get the same results given the same information—extends your view to include the environment and the data used to build and execute models.
Mark Madsen examines reproducibility and the areas that underlie production analytics and explores the most frequently ignored and yet most essential capability, data management. The industry needs to consider its practices so that systems are more transparent and reliable, improving trust and increasing the likelihood that your analytic solutions will succeed.
This talk will treat the black boxed of ML the way management perceives them, as black boxes.
There is much work on explainable models, interpretability, etc. that are important to the task of reproducibility. Much of that is relevant to the practitioner, but the practitioner can become too focused on the part they are most familiar with and focused on. Reproducing the results needs more.
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This tutorial covers design assumptions, design principles, and how to approach the architecture and planning for multi-use data infrastructure in IT.
Long:
The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This session will discuss hidden design assumptions, review design principles to apply when building multi-use data infrastructure, and provide a reference architecture to use as you work to unify your analytics infrastructure.
The focus in our market has been on acquiring technology, and that ignores the more important part: the larger IT landscape within which this technology lives and the data architecture that lies at its core. If one expects longevity from a platform then it should be a designed rather than accidental architecture.
Architecture is more than just software. It starts from use and includes the data, technology, methods of building and maintaining, and organization of people. What are the design principles that lead to good design and a functional data architecture? What are the assumptions that limit older approaches? How can one integrate with, migrate from or modernize an existing data environment? How will this affect an organization's data management practices? This tutorial will help you answer these questions.
Topics covered:
* A brief history of data infrastructure and past design assumptions
* Categories of data and data use in organizations
* Data architecture
* Functional architecture
* Technology planning assumptions and guidance
Operationalizing Machine Learning in the Enterprisemark madsen
TDWI Munich 2019
What does it take to operationalize machine learning and AI in an enterprise setting?
Machine learning in an enterprise setting is difficult, but it seems easy. All you need is some smart people, some tools, and some data. It’s a long way from the environment needed to build ML applications to the environment to run them in an enterprise.
Most of what we know about production ML and AI come from the world of web and digital startups and consumer services, where ML is a core part of the services they provide. These companies have fewer constraints than most enterprises do.
This session describes the nature of ML and AI applications and the overall environment they operate in, explains some important concepts about production operations, and offers some observations and advice for anyone trying to build and deploy such systems.
Introductory Big Data presentation given during one of our Sizing Servers Lab user group meetings. The presentation is targeted towards an audience of about 20 SME employees. It also contains a short description of the work packages for our BIg Data project proposal that was submitted in March.
Big data is a big part of the disruption hitting this market, but not in the way most people think. It's not replacing the data warehouse, but it is changing the technology stack. It doesn't eliminate data management, but it does redefine enterprise data architecture. Big data is and isn't many things. It's important to understand which information uses are well supported and which have yet to be addressed. Otherwise you risk replacing one set of problems with another. Come to this session to hear some observations on what big data is, isn't and aspires to be.
A video is available, starts at 1:03 into this Strata online event: http://www.youtube.com/watch?v=gLsHI1ZglKw
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
This is my solution for the Cloudera Data Science Challenge 3. I use Spark MLLib for problem1, and Spark GraphX for problem3. Problem2 is "simple" streaming map-reduce.
Tom DeMarco states that “You can’t control what you can’t measure”, but how much can we change and control (with) what we measure? This talk investigates the opportunities and limits of data-driven software engineering, shows which opportunities lie ahead of us when we engage in mining and analyzing software engineering process data, but also highlights important factors that influence the success and adaptability of data-based improvement approaches.
Science has escaped the lab and is roaming free in the world. People use software to understand the world . What tools are needed to support that work?
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
Multi-objective evolutionary algorithms (MOEAs) help software engineers find novel solutions to complex problems. When automatic tools explore too many options, they are slow to use and hard to comprehend. GALE is a near-linear time MOEA that builds a piecewise approximation to the surface of best solutions along the Pareto frontier. For each piece, GALE mutates solutions towards the better end. In numerous case studies, GALE finds comparable solutions to standard methods (NSGA-II, SPEA2) using far fewer evaluations (e.g. 20 evaluations, not 1,000). GALE is recommended when a model is expensive to evaluate, or when some audience needs to browse and understand how an MOEA has made its conclusions.
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...CS, NcState
Discussions about sharing
- Too much fear
- Not enough about benefits
Can we learn more from sharing that hoarding ?
- Yes (results from SE)
Three laws of trusted data sharing:
- For SE quality prediction..
- Better models from shared privatized data that from all raw data
Q: does this work for other kinds of data?
A: don’t know… yet
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
SA @ WV(software assurance research at West Virginia)
Kenneth McGill
NASA IV&V Facility Research Lead
304.367.8300
Kenneth.McGill@ivv.nasa.gov
Dr. Tim Menzies Ph.D. (WVU)
Software Engineering Research Chair
tim@menzies.us
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
Q: How have dummies (like me) managed to gain (some) control over a (seemingly) complex world?
A:The world is simpler than we think.
◆ Models contain clumps
◆ A few collar variables decide which clumps to use.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
1. Tim Menzies, WVU, USA
Forrest Shull, Fraunhofer , USA
(with John Hoskings, UoA, NZ)
Jan 27-2011
Empirical Software
Engineering, Version 2.0
2. About us
Curators of large repositories of SE data
Searched for conclusions
Shull: NSF-funded CeBase 2001- 2005
No longer on-line
Menzies: PROMISE 2006-2011
If you publish, offer data used in that pub
http://promisedata.org/data
Text-mining
Our question: Model-based
What’s next? General
Effort estimation
Defect
2
0 20 40 60 80 100
3. Summary
We need to do more “data mining”
Not just on different projects
But again and again on the same project
And by “data Mining” we really mean
Automated agents that implement
prediction
monitoring
diagnosis,
Planning
Adaptive business intelligence
3 of 48
4. Adaptive Business Intelligence
learning, and re-learning,
How to….
Detect death march project
Repair death march projects
Find best sell/buy point for software artifacts
Invest more (or less) in staff training/dev programs
Prioritize software inspections
Estimate development cost
Change development costs
etc
4 of 48
5. This talk
A plea for industrial partners to join in
A roadmap for my next decade of research
Many long term questions
A handful of new results
5 of 48
7. So many applications
of data mining to SE
Process data Social data
Input: Developer skills, Input: e.g. which tester do
platform stability you most respect?
Output: effort estimation Output: predictions of
what bugs gets fixed first
Product data Trace data
Input: static code Input: what calls what?
descriptions
Output: call sequences
Output: defect predictors that lead to a core dump
Usage data Any textual form
Input: what is everyone Input: text of any artifact
using?
Output: e.g. fault
Output: recommendations localization 7
on where to browse next
8. The State of the Art
If data collected, then usually forgotten
Dashboards : visualizations for feature
extraction; intelligence left the user
MapReduce, Hadoop et. al : systems
support for massive, parallel execution.
http://hadoop.apache.org
Implements the bus, but no bus drivers
Many SE data mining publications
e.g. Bird, Nagappan, Zimmermann
and last slide
But, no agents that recognize when
old models are no longer relevant,
Or to repair old models using new data 8 of 48
9. Of course, DM gets
it wrong, sometimes
Heh, nobody’s perfect Create agent communities, each
with novel insights and limitations
E.g. look at all the Data miners working with humans
mistakes people make: See more together than separately
Wikipidea: list of Partnership
cognitive biases
38 decision
making biases
30 biases in probability
18 social biases
10 memory biases
At least with DM, can
repeat the analysis,
audit the conclusion. 9 of 48
11. Ben Shneiderman, Mar’08
The growth of the World Wide Web ... continues
to reorder whole disciplines and industries. ...
It is time for researchers in science to take
network collaboration to the next phase and
reap the potential intellectual and societal
payoffs.
-B. Shneiderman.
Science 2.0.
Science, 319(7):1349–1350, March 2008
11 of 48
13. A proposal
Add roller skates to software engineering
Always use DM (data mining) on SE data
13 of 48
14. What’s the difference?
SE research v1.0 SE research v2.0
Case studies Data generators
Watch, don’t touch Case studies
Experiments
Experiments
Vary a few conditions in a Data analysis
project 10,000 of possible
data miners
Simple analysis
A little ANOVA, regression, Crowd-sourcing
maybe a t-test 10,000 of possible analysts
14 of 48
15. Value-added (to case-study-
based research)
Case studies: powerful for defining
problems, highlighting open issues
Has documented 100s of candidate
methods for improving SE
development
e.g. Kitchenham et. Al IEEE TSE, 2007,
Cross versus Within-Company Cost
Estimation
Spawned a sub-culture of researchers
checking if what works here also works
there.
15 of 48
16. Case-Study-based Research:
Has Limits
Too slow
Years to produce conclusions
Meanwhile, technology base changes
Too many candidate methods
No guidance on what methods to
apply to particular projects
Little generality
Zimmermann et. al, FSE 2009
662 times : learn here, test there
Worked in 4% of pairs
Many similar no-generality results
Chpt1, Menzies & Shull
16 of 48
17. Case-studies + DM =
Better Research
Propose a partnership between
case study research
And data mining
Data mining is stupid
Syntactic, no business knowledge
Case studies are too slow
And to check for generality? Even slower
Case study research (on one project) to raise questions
Data mining (on many projects) to check the answers
17 of 48
19. Need for adaptive agents
No general rules in SE
Zimmermann FSE, 2009
But general methods to find the local rules
Issues:
How quickly can we learn the local models?
How to check when local models start failing?
How to repair local models?
An adaptive agent watching a stream of data, learning
and relearning as appropriate
19 of 48
20. Agents for adaptive business
intelligence
mode #2
Quality
(e.g. mode #1
PRED(30)) learn mode #3
learn break break learn
down down
Data collected over time
20 of 48
21. Agents for adaptive business
intelligence
mode #2
Quality
(e.g. mode #1
PRED(30)) learn mode #3
learn break break learn
down down
Data collected over time
What is different here?
Not “apply data mining to build a predictor”
But add monitor and repair tools to recognize and
handle the breakdown of old predictors
Trust = data mining + monitor + repair 21 of 48
22. If crowd sourcing
Quality mode #2
mode #1
(e.g.
mode
Company#1 PRED(30))
#3
Quality mode #5
mode #4
Company #2 (e.g.
PRED(30)) mode
#6
Quality mode #8
mode #7
Company #3 (e.g.
PRED(30)) mode
#9
With DM, we could recognize that e.g. 1=4=7
• i.e. when some “new” situation has happened before 22
• So we’d know what experience base to exploit
23. Research Questions.
How to handle….
Anonymization Noise: from dad data
Make data public, without collection
revealing private data
Mode recognition
Volume of data Is when new is stuff is new,
Especially if working from or a a repeat of old stuff
“raw” project artifacts
Trust : you did not collect
Especially if crowd
the data
sourcing
Must surround the learners
Explanation : of complex with assessment agents
patterns Anomaly detectors
Repair
23 of 48
24. Most of the technology
required for this approach
can be implemented via
data mining
24
31. Text mining
Key issue: dimensionality reduction
In some domains, can be done in linear time
Use standard data miners, applied to top
100 terms in each corpus 31 of 48
33. Agents for adaptive business
intelligence
mode #2
Quality
(e.g. mode #1
PRED(30)) learn mode #3
learn break break learn
down down
Data collected over time
Q1: How to learn faster?
Technology: active learning: reflect on examples
to date to ask most informative next question
Q2: How to recognize breakdown?
Technology: bayesian anomaly detection
Focusing on frequency counts of contrast sets 33 of 48
34. Agents for adaptive business
intelligence
mode #2
Quality
(e.g. mode #1
PRED(30)) learn mode #3
learn break break learn
down down
Data collected over time
Q3: How to classify a mode?
Recognize if you’ve arrived at a mode seen before
Technology: Bayes classifier
Q4: How to make predictions?
Using the norms of a mode, report expected behavior 34 of 48
Technology: table look-up of data inside Bayes classifier
35. Agents for adaptive business
intelligence
mode #2
Quality
(e.g. mode #1
PRED(30)) learn mode #3
learn break break learn
down down
Data collected over time
Q5: What went wrong? (diagnosis)
Delta between current and prior, better, mode
Q6: What to do? (planning)
Delta between current and other, better, mode
Technology: contrast set learning 35 of 48
36. Agents for adaptive business
intelligence
mode #2
Quality
(e.g. mode #1
PRED(30)) learn mode #3
learn break break learn
down down
Data collected over time
Q7: How to understand a mode (explanation)
Presentation of essential features of a mode
Technology: contrast set learning
36 of 48
40. Contrast Set Learning
(10 years later)
No longer a post-processor to a
decision tree learner
TAR3: Branch pruning operators
applied directly to discretized
data
Summer’09
Shoot ‘em up at NASA AMES
State-of-the-art numerical
optimizer
TAR3
Ran 40 times faster
Generated better solutions
Powerful succinct explanation tool
40 of 48
41. Contrast Set Learning
Anomaly Detection
Recognize when old ideas
are now out-dated
SAWTOOTH:
read data in “eras” of 100 instances
Classify all examples as “seen it”
SAWTOOTH1:
Report average likelihood of
examples belong to “seen it”
Alert if that likelihood drops
SAWTOOTH2:
Back-end to TAR3
Track frequency of contrast sets
Some uniformity between contrast
sets and anomaly detection 41 of 48
42. Contrast sets noise management
CLIFF: post-processor to TAR3
Linear time instance selector
Finds the attribute ranges that change classification
Delete all instances that lack the “power ranges”
42 of 48
43. Contrast Sets CLIFF
Active Learning
Many examples, too few
labels
Reduce the time required
for business users to offer
judgment on business
cases
Explore the reduced
space generated by
CLIFF.
Randomly ample the
instances half-way
between different
classes
Fast (in the reduced
space) 43 of 48
44. Contrast sets CLIFF
Statistical databases
Anonymize the data: Preserving its distributions
For KNN, that means keep the boundaries between classes
Which we get from CLIFF
Also, CLIFF empties out the instance space
Leaving us space to synthesize new instances
44 of 48
46. We seek industrial partners
1. That will place textual versions of their products
in a wiki
2. That will offer joins of those products to quality
measures
3. That will suffer us interviewing their managers,
from time to time, to learn the control points.
(Note: 1,2 can be behind your firewalls.)
46 of 48
47. In return, we offer
Agents for
automatic, adaptive, business intelligence
that tunes itself to your local domain
47 of 48