This document summarizes a talk on data science for software engineering. It discusses how data science involves various fields like statistics, machine learning, and data mining. It notes that while "big data" is often discussed, software engineering data is typically small and sparse. Domain knowledge is important for data mining to avoid misinterpreting data. Data science with software engineering data requires understanding organizations and their willingness to share data given privacy concerns. The document outlines sharing data, models, and methods for learning across different organizations and discusses techniques for balancing privacy and utility when sharing data.
In this introductory lecture titled, "conceptualising and measuring human anxiety on the Internet" the audience is explained what new or interesting the dissertation has to offer and how it is connected to the human-computer interaction fields and to the society in general.
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Lauri Eloranta
Third lecture of the course CSS01: Introduction to Computational Social Science at the University of Helsinki, Spring 2015.(http://blogs.helsinki.fi/computationalsocialscience/).
Lecturer: Lauri Eloranta
Questions & Comments: https://twitter.com/laurieloranta
This workshop is a hands-on introduction to machine learning with R and was presented on December 8, 2017 at the University of South Carolina for the 2017 Computational Biology Symposium held by the International Society for Computational Biology Regional Student Group-Southeast USA.
In this introductory lecture titled, "conceptualising and measuring human anxiety on the Internet" the audience is explained what new or interesting the dissertation has to offer and how it is connected to the human-computer interaction fields and to the society in general.
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Lauri Eloranta
Third lecture of the course CSS01: Introduction to Computational Social Science at the University of Helsinki, Spring 2015.(http://blogs.helsinki.fi/computationalsocialscience/).
Lecturer: Lauri Eloranta
Questions & Comments: https://twitter.com/laurieloranta
This workshop is a hands-on introduction to machine learning with R and was presented on December 8, 2017 at the University of South Carolina for the 2017 Computational Biology Symposium held by the International Society for Computational Biology Regional Student Group-Southeast USA.
Lessons from the Learning Sciences for Cyber Security Education. Cyber Security Education requires thinking about “how computing works.”
For programmers, why some practices create holes/opportunities.
For end-users, why some activities compromise security.
We need everyone to learn about cyber security.
What can learning sciences tell us about encouraging that kind of learning?
Lesson #1: Context matters.
The Story of Computing for All at Georgia Tech.
Lesson #2: Identity matters.
“Teaching” Graphics Designers who reject CS about CS.
Lesson #3: Structure matters.
Subgoal Labels can Dramatically Improve Learning
Data Science has taken the world with a storm due to the rising need of web crawling and data acquisition to help make unheard advancements in the field of business intelligence and various technologies. We have compiled a list of the top 20 renowned data scientists who have taken quantum leaps in their fields with the data science and are changing how we see data on a day to day basis.
A Pragmatic Perspective on Software VisualizationArie van Deursen
Slides of the keynote presentation at the 5th International IEEE/ACM Symposium on Software Visualization, SoftVis 2010. Salt Lake City, USA, October 2010.
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Lauri Eloranta
Seventh lecture of the course CSS01: Introduction to Computational Social Science at the University of Helsinki, Spring 2015.(http://blogs.helsinki.fi/computationalsocialscience/).
Lecturer: Lauri Eloranta
Questions & Comments: https://twitter.com/laurieloranta
Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works:
(1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020.
Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.
Machine Learning for Non-technical Peopleindico data
Machine learning is one of the most promising and most difficult to understand fields of the modern age. Here are the slides from Slater Victoroff's (CEO of indico) talk at General Assembly Boston for non-technical folks on how to separate the signal from the noise -- stay tuned for the next time he speaks:
https://generalassemb.ly/education/machine-learning-for-non-technical-people
Video at https://www.youtube.com/watch?v=kHZqgmrIg8k
Event page : https://www.meetup.com/Tech-Valley-Machine-Learning-and-AI/events/246094251/
Speaker: Dan Elton
Tech Valley Machine Learning Meetup, Troy New York, 12-28-17
Building off David Martuscello's intro talk at the last meetup, Dan Elton presented some pitfalls one can encounter with machine learning. It is important not to get caught up in the hype surrounding machine learning and maintain a sense of what ML really is and what its limitations are. Some of the pitfalls that were discussed were
-- overfitting & underfitting
-- not cleaning & normalizing your data
-- trying to use machine learning for extrapolation
-- biased data sampling
-- not doing hyperparameter optimization carefully
-- not comparing your results to simple baselines
talk to Bellevue ML meetup
https://www.meetup.com/Bellevue-Machine-Learning-Artificial-Intelligence-Meetup/events/247110867/
This talk will go over some pitfalls every machine learning practitioner should know about. Several common technical pitfalls will be discussed: overfitting, failing to clean your data (the "Schenectady Problem"), not normalizing your data, and trying to use machine learning for extrapolation. Next a few different meanings for the word "bias" will be discussed - the statistical meanings of the term, sampling bias, and social bias. Social bias occurs when machine learning models do things which perpetuate racism, sexism, inequality, and other negative phenomena in society which we do not wish to perpetuate. Following the work of Kate Crawford two harms from socially biased models can be distinguished - harms of allocation and harms of representation. Informative real world stories will be used to illustrate each type of bias. The talk will end by stepping back and taking a "big picture" view on machine learning, what it is, what it's good for, and how it contrasts with scientific understanding of the world.
Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...Amit Sheth
Keynote at Web Intelligence 2017: http://webintelligence2017.com/program/keynotes/
Video: https://youtu.be/EIbhcqakgvA Paper: http://knoesis.org/node/2698
Abstract: While Bill Gates, Stephen Hawking, Elon Musk, Peter Thiel, and others engage in OpenAI discussions of whether or not AI, robots, and machines will replace humans, proponents of human-centric computing continue to extend work in which humans and machine partner in contextualized and personalized processing of multimodal data to derive actionable information.
In this talk, we discuss how maturing towards the emerging paradigms of semantic computing (SC), cognitive computing (CC), and perceptual computing (PC) provides a continuum through which to exploit the ever-increasing and growing diversity of data that could enhance people’s daily lives. SC and CC sift through raw data to personalize it according to context and individual users, creating abstractions that move the data closer to what humans can readily understand and apply in decision-making. PC, which interacts with the surrounding environment to collect data that is relevant and useful in understanding the outside world, is characterized by interpretative and exploratory activities that are supported by the use of prior/background knowledge. Using the examples of personalized digital health and a smart city, we will demonstrate how the trio of these computing paradigms form complementary capabilities that will enable the development of the next generation of intelligent systems. For background: http://bit.ly/PCSComputing
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...Leandro de Castro
This set of slides briefly reviews the history of artificial intelligence from its origins in the early 1950's to the new trend of Big Data. It goes from AI, passing to Machine Learning, Natural Computing and finally reaching Big Data.
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...CS, NcState
Discussions about sharing
- Too much fear
- Not enough about benefits
Can we learn more from sharing that hoarding ?
- Yes (results from SE)
Three laws of trusted data sharing:
- For SE quality prediction..
- Better models from shared privatized data that from all raw data
Q: does this work for other kinds of data?
A: don’t know… yet
Lessons from the Learning Sciences for Cyber Security Education. Cyber Security Education requires thinking about “how computing works.”
For programmers, why some practices create holes/opportunities.
For end-users, why some activities compromise security.
We need everyone to learn about cyber security.
What can learning sciences tell us about encouraging that kind of learning?
Lesson #1: Context matters.
The Story of Computing for All at Georgia Tech.
Lesson #2: Identity matters.
“Teaching” Graphics Designers who reject CS about CS.
Lesson #3: Structure matters.
Subgoal Labels can Dramatically Improve Learning
Data Science has taken the world with a storm due to the rising need of web crawling and data acquisition to help make unheard advancements in the field of business intelligence and various technologies. We have compiled a list of the top 20 renowned data scientists who have taken quantum leaps in their fields with the data science and are changing how we see data on a day to day basis.
A Pragmatic Perspective on Software VisualizationArie van Deursen
Slides of the keynote presentation at the 5th International IEEE/ACM Symposium on Software Visualization, SoftVis 2010. Salt Lake City, USA, October 2010.
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Lauri Eloranta
Seventh lecture of the course CSS01: Introduction to Computational Social Science at the University of Helsinki, Spring 2015.(http://blogs.helsinki.fi/computationalsocialscience/).
Lecturer: Lauri Eloranta
Questions & Comments: https://twitter.com/laurieloranta
Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works:
(1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020.
Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.
Machine Learning for Non-technical Peopleindico data
Machine learning is one of the most promising and most difficult to understand fields of the modern age. Here are the slides from Slater Victoroff's (CEO of indico) talk at General Assembly Boston for non-technical folks on how to separate the signal from the noise -- stay tuned for the next time he speaks:
https://generalassemb.ly/education/machine-learning-for-non-technical-people
Video at https://www.youtube.com/watch?v=kHZqgmrIg8k
Event page : https://www.meetup.com/Tech-Valley-Machine-Learning-and-AI/events/246094251/
Speaker: Dan Elton
Tech Valley Machine Learning Meetup, Troy New York, 12-28-17
Building off David Martuscello's intro talk at the last meetup, Dan Elton presented some pitfalls one can encounter with machine learning. It is important not to get caught up in the hype surrounding machine learning and maintain a sense of what ML really is and what its limitations are. Some of the pitfalls that were discussed were
-- overfitting & underfitting
-- not cleaning & normalizing your data
-- trying to use machine learning for extrapolation
-- biased data sampling
-- not doing hyperparameter optimization carefully
-- not comparing your results to simple baselines
talk to Bellevue ML meetup
https://www.meetup.com/Bellevue-Machine-Learning-Artificial-Intelligence-Meetup/events/247110867/
This talk will go over some pitfalls every machine learning practitioner should know about. Several common technical pitfalls will be discussed: overfitting, failing to clean your data (the "Schenectady Problem"), not normalizing your data, and trying to use machine learning for extrapolation. Next a few different meanings for the word "bias" will be discussed - the statistical meanings of the term, sampling bias, and social bias. Social bias occurs when machine learning models do things which perpetuate racism, sexism, inequality, and other negative phenomena in society which we do not wish to perpetuate. Following the work of Kate Crawford two harms from socially biased models can be distinguished - harms of allocation and harms of representation. Informative real world stories will be used to illustrate each type of bias. The talk will end by stepping back and taking a "big picture" view on machine learning, what it is, what it's good for, and how it contrasts with scientific understanding of the world.
Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...Amit Sheth
Keynote at Web Intelligence 2017: http://webintelligence2017.com/program/keynotes/
Video: https://youtu.be/EIbhcqakgvA Paper: http://knoesis.org/node/2698
Abstract: While Bill Gates, Stephen Hawking, Elon Musk, Peter Thiel, and others engage in OpenAI discussions of whether or not AI, robots, and machines will replace humans, proponents of human-centric computing continue to extend work in which humans and machine partner in contextualized and personalized processing of multimodal data to derive actionable information.
In this talk, we discuss how maturing towards the emerging paradigms of semantic computing (SC), cognitive computing (CC), and perceptual computing (PC) provides a continuum through which to exploit the ever-increasing and growing diversity of data that could enhance people’s daily lives. SC and CC sift through raw data to personalize it according to context and individual users, creating abstractions that move the data closer to what humans can readily understand and apply in decision-making. PC, which interacts with the surrounding environment to collect data that is relevant and useful in understanding the outside world, is characterized by interpretative and exploratory activities that are supported by the use of prior/background knowledge. Using the examples of personalized digital health and a smart city, we will demonstrate how the trio of these computing paradigms form complementary capabilities that will enable the development of the next generation of intelligent systems. For background: http://bit.ly/PCSComputing
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...Leandro de Castro
This set of slides briefly reviews the history of artificial intelligence from its origins in the early 1950's to the new trend of Big Data. It goes from AI, passing to Machine Learning, Natural Computing and finally reaching Big Data.
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...CS, NcState
Discussions about sharing
- Too much fear
- Not enough about benefits
Can we learn more from sharing that hoarding ?
- Yes (results from SE)
Three laws of trusted data sharing:
- For SE quality prediction..
- Better models from shared privatized data that from all raw data
Q: does this work for other kinds of data?
A: don’t know… yet
No Free Lunch: Metadata in the life sciencesChris Dwan
This presentation covers some challenges and makes suggestions to support the work of creating flexible, interoperable data systems for the life sciences.
Data Science at Scale - The DevOps ApproachMihai Criveti
DevOps Practices for Data Scientists and Engineers
1 Data Science Landscape
2 Process and Flow
3 The Data
4 Data Science Toolkit
5 Cloud Computing Solutions
6 The rise of DevOps
7 Reusable Assets and Practices
8 Skills Development
Science has escaped the lab and is roaming free in the world. People use software to understand the world . What tools are needed to support that work?
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
Multi-objective evolutionary algorithms (MOEAs) help software engineers find novel solutions to complex problems. When automatic tools explore too many options, they are slow to use and hard to comprehend. GALE is a near-linear time MOEA that builds a piecewise approximation to the surface of best solutions along the Pareto frontier. For each piece, GALE mutates solutions towards the better end. In numerous case studies, GALE finds comparable solutions to standard methods (NSGA-II, SPEA2) using far fewer evaluations (e.g. 20 evaluations, not 1,000). GALE is recommended when a model is expensive to evaluate, or when some audience needs to browse and understand how an MOEA has made its conclusions.
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
SA @ WV(software assurance research at West Virginia)
Kenneth McGill
NASA IV&V Facility Research Lead
304.367.8300
Kenneth.McGill@ivv.nasa.gov
Dr. Tim Menzies Ph.D. (WVU)
Software Engineering Research Chair
tim@menzies.us
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
Q: How have dummies (like me) managed to gain (some) control over a (seemingly) complex world?
A:The world is simpler than we think.
◆ Models contain clumps
◆ A few collar variables decide which clumps to use.
ICSE’14 Workshop Keynote Address: Emerging Trends in Software Metrics (WeTSOM’14).
Data about software projects is not stored in metrc1, metric2,…,
but is shared between them in some shared, underlying,shape.
Not every project has thesame underlying simple shape; many projects have different,
albeit simple, shapes.
We can exploit that shape, to great effect: for better local predictions; for transferring
lessons learned; for privacy-preserving data mining/
In the age of Big Data, what role for Software Engineers?CS, NcState
ABSTRACT:
Consider the premise of Big Data:
better conclusions = same algorithms + more data + more cpu
If this were always true, then there would be no role for human analysts
that reflected over the domain to offer insights that produce better solutions
(since all such insight is now automatically generated from the CPUs).
This talk proposes a marriage of sorts between Big Data and software
engineering. It reviews over a decade of work by the author in exploring
user goals using CPU-intensive methods. It will be shown that analyst-insight was
useful from building “better" tools (where “better” means generate
more succinct recommendations, runs faster, scales to much larger problems).
The conclusion will be that in the age of big data, human analysis is still
useful and necessary. But a new kind of software engineering analyst is required- one
that know how to take full advantage of the power of Big Data.
ABOUT THE AUTHOR:
Tim Menzies (P.hD., UNSW) is a Professor in CS at WVU; the author of
over 230 referred publications; and is one of the 50 most cited
authors in software engineering (out of 50,000+ researchers, see
http://goo.gl/wqpQl). At WVU, he has been a lead researcher on
projects for NSF, NIJ, DoD, NASA, USDA, as well as joint research work
with private companies. He teaches data mining and artificial
intelligence and programming languages.
Prof. Menzies is the co-founder of the PROMISE conference series
devoted to reproducible experiments in software engineering (see
http://promisedata.googlecode.com). He is an associate editor of IEEE
Transactions on Software Engineering, Empirical Software Engineering
and the Automated Software Engineering Journal. In 2012, he served as
co-chair of the program committee for the IEEE Automated Software
Engineering conference. In 2015, he will serve as co-chair for the
ICSE'15 NIER track. For more information, see his web site
http://menzies.us or his vita at http://goo.gl/8eNhY or his list of
pubs at http://goo.gl/0SWJ2p.
Scalable Product Line Configuration:
A Straw to Break the Camel’s Back
Abdel Salam Sayyad
Joseph Ingram
Tim Menzies
Hany Ammar
IEEE Automated SE,
Palo Alto, CA
Nov 2013
Class Level Fault Prediction using Software Clustering
for
IEEE ASE 2013
by
Giuseppe Scanniello (1) Carmine Gravino (2) Andrian Marcus (3) Tim Menzies (4)
from
1 University of Basilicata, Italy
2 Italy University of Salerno, Italy
3 Wayne State University, USA
4 West Virginia University, USA
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Dm sei-tutorial-v7
1. Data Science for Software
Engineering (short version)
Tim Menzies, West Virginia University
Fayola Peters, West Virginia University
SEI, August, 2013
SEI http://goo.gl/w4Acsi
ICSE’13 http://goo.gl/29YTMu
0
2. This talk: reflections on data science
and software analytics
1
Two recent special issues of IEEE Software: July’13; Sept’13.
Editors: Menzies & Zimmermann
3. • Statistics
• Operations research
• Machine Learning
• Data mining
• Predictive Analytics
• Business Intelligence
• Data Science
• Smart data
• Big Data
2
Insert
buzzword
here
4. Big data: not-so-successful stories
• Community medicine
– Additional manual
collection required for
their queries
• Software engineering
– Much product data
• examples of source code
– Little process data
• costs, quality measures
We go mining with the data we have,
not the data we want. Get used to it. 3
5. But what isn’t being said in the all the
above about data mining + SE?
1. Its not just all about algorithms (people matter)
2. Data mining is a technical and a sociological problem
– No point in talking about how to learn lessons from many
organizations…
– …. Unless those organizations let you access their data
– The problem of privacy
3. When we learn from each other
– There is more to sharing that just “you give me your
model”
• Local learning, ensembles, filtering..
4
6. OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
5
7. OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
6
8. What can we share?
• Two software project
managers meet
– What can they learn
from each other?
• They can share
1. Data
2. Models
3. Methods
• techniques for turning
data into models
4. Insight into the domain
• The standard mistake
– Generally assumed that
models can be shared,
without modification.
– Yeah, right…
7
9. SE research = sparse sample of a
very diverse set of activities
8
Microsoft research,
Redmond, Building 99
Other studios,
many other projects
And they are all different.
10. Models may not move
(effort estimation)
• 20 * 66% samples of
data from NASA
• Linear regression on
each sample to learn
effort = a*LOCb *Σiβixi
• Back select to remove
useless xi
• Result?
– Wide βivariance
9
* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect
Prediction and Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf
11. Models may not move
(defect prediction)
10* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect Prediction and
Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf
12. Oh woe is me
• No generality in SE?
• Nothing we can learn
from each other?
• Forever doomed to never
make a conclusion?
– Always, laboriously,
tediously, slowly, learning
specific lessons that hold
only for specific projects?
• No: 3 things we might
want to share
– Models, methods, data
• If no general models, then
– Share methods
• general methods for
quickly turning local data
into local models.
– Share data
• Find and transfer relevant
data from other projects to
us
11
13. The rest of this tutorial
• Data science
– How to share data
– How to share methods
• Maybe one day, in the future,
– after we’ve shared enough data and methods
– We’ll be able to report general models
• But first,
– Some general notes on data mining
12
14. OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
13
15. OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
–Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
14
16. Case Study : NASA
• NASA’s Software Engineering Lab, 1990s
– Gave free access to all comers to their data
– But you had to come to get it (to Learn the domain)
– Otherwise: mistakes
• E.g. one class of software module with far more errors that
anything else.
– Dumb data mining algorithms: might learn that this kind of module in
inherently more data prone
• Smart data scientists might question “what kind of
programmer work that module”
– A: we always give that stuff to our beginners as a learning exercise
15* F. Shull, M. Mendonsa, V. Basili, J. Carver, J. Maldonado, S. Fabbri, G. Travassos, and M. Ferreira, "Knowledge-
Sharing Issues in Experimental Software Engineering", EMSE 9(1): 111-137, March 2004.
17. So algorithms are
only part of the story
16
• Drew Conway, The Data Science Venn Diagram, 2009,
• http://www.dataists.com/2010/09/the-data-science-venn-diagram/
• Dumb data miners miss important
domains semantics
• An ounce of domain knowledge is
worth a ton to algorithms.
• Math and statistics only gets you
machine learning,
• Science is about discovery and building
knowledge, which requires some
motivating questions about the world
• The culture of academia, does not
reward researchers for understanding
domains.
19. Management
misconceptions of Big Data
• All our data analysis problems will be solved
– Once we boot a CPU farm
– Once we bring up Hadoop and Map-reduce
• If your first question is “what tools to buy?”
– Then you are asking the wrong question
18
20. • Deploy data scientists before deploying tools
Tools can augment, but
not replace, human insight
19Source: http://goo.gl/CCMZo
21. The great myth
• Wouldn’t it be
wonderful if we did not
have to listen to them
– The dream of
oldeworlde machine
learning
• Circa 1980s
– Dispense with live
experts and resurrect
dead ones.
• But any successful
learner needs biases
– Ways to know what’s
important
• What’s dull
• What can be ignored
– No bias? Can’t ignore
anything
• No summarization
• No generalization
• No way to predict the future
20
Lesson:
TALK TO
THE USERS!
22. The Inductive
Engineering Manifesto
• Users before algorithms:
– Mining algorithms are only useful in industry if
users fund their use in real-world applications.
• Data science
– Understanding user goals to inductively generate
the models that most matter to the user.
21
• T. Menzies, C. Bird, T. Zimmermann, W. Schulte, and E. Kocaganeli.
The inductive software engineering manifesto. (MALETS '11).
23. OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
–Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles 22
24. Do it again, and again,
and again, and …
23
In any industrial
application, data science
is repeated multiples
time to either answer an
extra user question,
make some
enhancement and/or
bug fix to the method,
or to deploy it to a
different set of users.
25. Thou shall not click
• For serious data science studies,
– to ensure repeatability,
– the entire analysis should be automated
– using some high level scripting language;
• e.g. R-script, Matlab, Bash, ….
24
28. OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
27
29. OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
–How to prune data, simpler &
smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
28
30. How to Prune Data,
Simpler and Smarter
29
Data is the new
oil
32. 31
Picking random
training instance is
not a good idea
More popular instances
in the active pool
decrease error
One of the stopping
point conditions fires
Data for Industry / Active Learning
X-axis: Instances sorted in decreasing popularity numbers
Y-axis:MedianMRE
33. 32
Data for Industry / Active Learning
At most 31% of all
the cells
On median 10%
Intrinsic dimensionality: There is a consensus in
the high-dimensional data analysis community
that the only reason any methods work in very
high dimensions is that, in fact, the data is not
truly high-dimensional*
* E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in
Neural Information Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.
34. OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
–How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
33
35. Is Data Sharing Worth the Risk to
Individual Privacy
• Former Governor Massachusetts.
• Victim of re-identification privacy breach.
• Led to sensitive attribute disclosure of his medical records.
What would William Weld say?
34
36. Is Data Sharing Worth the Risk to
Individual Privacy
What about NASA contractors?
Subject to competitive bidding
every 2 years.
Unwilling to share data
that would lead to
sensitive attribute disclosure.
e.g. actual software
development times
35
37. When To Share – How To Share
So far we cannot guarantee
100% privacy.
What we have is a directive
as to whether data is private
and useful enough to share...
We have a lot of privacy
algorithms geared toward
minimizing risk.
Old School
K-anonymity
L-diversity
T-closeness
But What About Maximizing Benefits (Utility)?
The degree of risk to the
data sharing entity must
not exceed the benefits of
sharing.
36
39. Balancing Privacy and Utility
or...
Minimize risk of privacy disclosure while maximizing utility.
Instance Selection with CLIFF
Small random moves with MORPH
= CLIFF + MORPH
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
38
40. CLIFF
Don't share all the data.
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
39
41. CLIFF
Don't share all the data.
"a=r1"
powerful for selection for
class=yes
more common in "yes"
than "no"
CLIFF
step1:
for each class find ranks
of all values
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
40
42. CLIFF
Don't share all the data.
"a=r1"
powerful for selection for
class=yes
more common in "yes"
than "no"
CLIFF
step2:
multiply ranks of each
row
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
41
43. CLIFF
Don't share all the data.
CLIFF
step3: select the most powerful
rows of each class
Note linear time
Can reduce N rows to 0.1N
So an O(N2) NUN algorithm
now
takes time O(0.01)
Scalability
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
42
44. MORPH
Push the CLIFF data from their original position.
y = x ± (x − z) ∗ r
x ∈ D, the original
instance
z ∈ D the NUN of x
y the resulting
MORPHed
instance
F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Software Engineering (ICSE), 2012 34th
International Conference on, june 2012, pp. 189 –199.
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction,"
IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
43
45. Case Study: Cross-Company Defect Prediction (CCDP)
Sharing Required.
Zimmermann et al.
Local data not always
available
• companies too small
• product in first release, so
no past data.
Kitchenham et al.
• no time for collection
• new technology can make all
data irrelevant
T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data vs. domain vs. process.”
in ESEC/SIGSOFT FSE’09,2009
B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,”
IEEE Transactions on Software Engineering, vol. 33, pp. 316–329, 2007
- Company B has little or no data to build a defect model;
- Company B uses data from Company A to build defect models;
44
46. Measuring the Risk
IPR = Increased Privacy Ratio
Queries Original Privatized Privacy Breach
Q1 0 0 yes
Q2 0 1 no
Q3 1 1 yes
yes = 2/3
IPR = 1- 2/3 = 0.33
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
45
47. Measuring the Utility
The g-measure
Probability of detection (pd)
Probability of False alarm (pf)
Actual
yes no
Predicted yes TP FP
no FN TN
pd TP/(TP+FN)
pf FP/(FP+TN)
g-measure 2*pd*(1-pf)/(pd+(1-pf))
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
46
48. Making Data Private for CCDP
Comparing CLIFF+MORPH to Data Swapping and K-anonymity
Data Swapping (s10, s20, s40)
A standard perturbation
technique used for privacy
To implement...
• For each NSA a certainpercent
of the values areswapped with
anyothervalue in that NSA.
• For our experiments,these
percentages are 10, 20 and 40.
k-anonymity (k2, k4)
The Datafly Algorithm.
To implement...
• Make a generalizationhierarchy.
• Replace values in the
NSAaccording to thehierarchy.
• Continue until there are k or
fewer distinct instancesand
suppress them.
K. Taneja, M. Grechanik, R. Ghani, and T. Xie, “Testing software in age of data privacy: a balancing act,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European
conference on Foundations of software engineering, ser. ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp. 201–211.
L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression,” Int. J. Uncertain. Fuzziness Knowl.-Based Syst., vol. 10, no. 5, pp. 571–588, Oct. 2002.
47
49. Making Data Private for CCDP
Comparing CLIFF+MORPH to Data Swapping and K-anonymity
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
48
50. Making Data Private for CCDP
Comparing CLIFF+MORPH to Data Swapping and K-anonymity
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
49
51. Making Data Private for CCDP
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
52. OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
51
53. OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
–Envy-based learning
– Ensembles
52
54. • Seek the fence
where the grass
is greener on the
other side.
• Learn from
there
• Test on here
• Cluster to find
“here” and
“there”
53
Envy =
The WisDOM Of
the COWs
55. 54
@attribute recordnumber real
@attribute projectname {de,erb,gal,X,hst,slp,spl,Y}
@attribute cat2 {Avionics, application_ground, avionicsmonitoring, … }
@attribute center {1,2,3,4,5,6}
@attribute year real
@attribute mode {embedded,organic,semidetached}
@attribute rely {vl,l,n,h,vh,xh}
@attribute data {vl,l,n,h,vh,xh}
…
@attribute equivphyskloc real
@attribute act_effort real
@data
1,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,25.9,117.6
2,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,24.6,117.6
3,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,7.7,31.2
4,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,8.2,36
5,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,9.7,25.2
6,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,2.2,8.4
….
DATA = MULTI-DIMENSIONAL VECTORS
56. CAUTION: data may not divide neatly
on raw dimensions
• The best description for SE projects may be
synthesize dimensions extracted from the raw
dimensions
55
57. Fastmap
56
Fastmap: Faloutsos [1995]
O(2N) generation of axis of large variability
• Pick any point W;
• Find X furthest from W,
• Find Y furthest from Y.
c = dist(X,Y)
All points have distance a,b to (X,Y)
• x = (a2 + c2 − b2)/2c
• y= sqrt(a2 – x2)
Find median(x), median(y)
Recurse on four quadrants
58. Hierarchical partitioning
Prune
• Find two orthogonal dimensions
• Find median(x), median(y)
• Recurse on four quadrants
• Combine quadtree leaves
with similar densities
• Score each cluster by median
score of class variable
57
Grow
59. Q: why cluster Via FASTMAP?
• A1: Circular methods (e.g. k-means)
assume round clusters.
• But density-based clustering allows
clusters to be any shape
• A2: No need to pre-set the number of
clusters
• A3: cause other methods
(e.g. PCA) are much slower
• Fastmap is the O(2N)
• Unoptimized Python:
58
61. • Seek the fence
where the grass
is greener on the
other side.
• Learn from
there
• Test on here
• Cluster to find
“here” and
“there”
60
Envy =
The WisDOM Of
the COWs
62. Hierarchical partitioning
Prune
• Find two orthogonal dimensions
• Find median(x), median(y)
• Recurse on four quadrants
• Combine quadtree leaves
with similar densities
• Score each cluster by median
score of class variable
• This cluster envies its neighbor with
better score and max
abs(score(this) - score(neighbor))
61
Grow
Where is grass greenest?
63. Q: How to learn rules from
neighboring clusters
• A: it doesn’t really matter
– Many competent rule learners
• But to evaluate global vs local rules:
– Use the same rule learner for local vs global rule learning
• This study uses WHICH (Menzies [2010])
– Customizable scoring operator
– Faster termination
– Generates very small rules (good for explanation)
62
64. Data from
http://promisedata.org/data
• Effort reduction =
{ NasaCoc, China } :
COCOMO or function points
• Defect reduction =
{lucene,xalanjedit,synapse,etc } :
CK metrics(OO)
• Clusters have untreated class
distribution.
• Rules select a subset of the
examples:
– generate a treated class
distribution
•
63
0 20 40 60 80 100
25th
50th
75th
100th
untreated global local
Distributions have percentiles:
Treated with rules
learned from all data
Treated with rules learned
from neighboring cluster
65. • Lower median efforts/defects (50th percentile)
• Greater stability (75th – 25th percentile)
• Decreased worst case (100th percentile)
By any measure,
Local BETTER THAN GLOBAL
64
66. Rules learned in each cluster
• What works best “here” does not work “there”
– Misguided to try and tame conclusion instability
– Inherent in the data
•
Can’t tame conclusion instability.
• Instead, you can exploit it
• Learn local lessons that do better than overly generalized global theories
65
67. OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
–Ensembles
66
68. 67B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-
2, pp.62-74, 2012.
Outlier
‘Detection
’
Relevancy
Filtering
Instance
Weighting
Stratification
Cost
Curves
Mixture
Models
Managing Dataset Shift
Covariate
Shift
Prior
Probability
Shift
Sampling
Imbalanced
Data
Domain
Shift
Source
Component
Shift
69. Solutions to SE Model Problems/
Ensembles of Learning Machines*
Sets of learning machines grouped together.
Aim: to improve predictive performance.
...
estimation1 estimation2 estimationN
Base learners
E.g.: ensemble estimation = Σ wi estimationi
B1 B2 BN
* T. Dietterich. Ensemble Methods in Machine Learning. Proceedings of the First International Workshop in
Multiple Classifier Systems. 2000.
68
70. Solutions to SE Model Problems/
Ensembles of Learning Machines
One of the keys:
Diverse* ensemble: “base learners” make different
errors on the same instances.
* G. Brown, J. Wyatt, R. Harris, X. Yao. Diversity Creation Methods: A Survey and Categorisation. Journal of
Information Fusion 6(1): 5-20, 2005.
69
71. Solutions to SE Model Problems/
Dynamic Adaptive Ensembles
Dynamic Cross-company Learning (DCL)
DCL uses new completed projects that arrive with time.
DCL determines when CC data is useful.
DCL adapts to changes by using CC data.
Predicting effort for a single company from ISBSG based on its projects and other companies' projects.
* L. Minku, X. Yao. Can Cross-company Data Improve Performance in Software Effort Estimation? Proceedings
of the 8th International Conference on Predictive Models in Software Engineering, p. 69-78, 2012.
http://dx.doi.org/10.1145/2365324.2365334.
70