The document discusses adversarial learning, robust learning, and scalable learning. It explores how machine learning algorithms can be exploited by adversaries and how to develop robust and scalable algorithms. Key contributions include showing vulnerabilities in classifiers to exploratory and causative attacks, developing models for multi-labeler learning from unreliable sources, and algorithms for online and client-server learning from noisy data streams. The goal is to learn effectively from unfaithful or incomplete training data and enable learning at large scales required for real-time applications.
This document presents a research thesis that investigates using Monte Carlo simulation as an asset allocation tool for real estate portfolios during different economic periods. The thesis acknowledges those who provided guidance and support. It includes an introduction outlining the research problem and question, as well as the methodology, literature review, research design, data analysis, conclusion and references. The research aims to determine if altering real estate asset allocation based on Monte Carlo simulation during growth, decline and stable periods can enhance portfolio performance compared to a static allocation. Portfolio performance is measured using the Sharpe ratio to assess risk-adjusted returns.
The document discusses various stability patterns and antipatterns related to system architecture and integration. It describes antipatterns like integration points, blocked threads, cascading failures, and attacks of self-denial that can cause system failures. It then presents patterns like circuit breakers, bulkheads, test harnesses, and decoupling middleware that can be used to detect failures early and prevent cascading failures from spreading. The document emphasizes the importance of monitoring dependencies, avoiding tight couplings, and using timeouts and fail-fast mechanisms to isolate failures and stop errors from propagating.
Apache Kafka is a distributed streaming platform that forms a key part of the infrastructure at many companies including Uber, Netflix and LinkedIn. In this talk, Matt gave a technical overview of Apache Kafka, discussed practical use cases of Kafka for IoT data and demonstrated how to ingest data from an MQTT server using Kafka Connect.
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh
Alok Singh is a Principal Engineer at IBM CODAIT who has built multiple analytical frameworks and machine learning algorithms. The presentation provides an overview of building predictive models for imbalanced datasets using scikit-learn and XGBoost. It discusses challenges with imbalanced data, evaluation metrics like confusion matrix and ROC curves, and techniques for imbalanced learning including weighted classes, oversampling minorities and undersampling majorities, and SMOTE. The presentation concludes with a hands-on tutorial demonstrating these techniques on an imbalanced bank marketing dataset.
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
This document discusses adversarial machine learning and how to attack machine learning algorithms. It provides examples of how naive Bayes, k-means clustering, and SVM algorithms can be subverted by manipulating input data or model parameters. Specifically, the naive Bayes algorithm's accuracy can be decreased by introducing benign words to messages. The k-means clustering algorithm's false negative rate can be increased by adding outlier points. And the SVM algorithm's decision boundary and predictions can be controlled. The document advocates for defenses like ensembling multiple models and using robust learning methods.
Patrick Hall, Professor, AI Risk Management, The George Washington University
H2O Open Source GenAI World SF 2023
Language models are incredible engineering breakthroughs but require auditing and risk management before productization. These systems raise concerns about toxicity, transparency and reproducibility, intellectual property licensing and ownership, disinformation and misinformation, supply chains, and more. How can your organization leverage these new tools without taking on undue or unknown risks? While language models and associated risk management are in their infancy, a small number of best practices in governance and risk are starting to emerge. If you have a language model use case in mind, want to understand your risks, and do something about them, this presentation is for you!
This document discusses adversarial learning and the adversarial classification reverse engineering (ACRE) problem. The ACRE problem aims to efficiently learn enough about a classifier to construct adversarial attacks using a limited number of queries. The document presents algorithms for reverse engineering linear classifiers with continuous and boolean features. It shows ACRE learning is possible within a factor of 1+ε for continuous features and 2 for boolean features. Experimental results demonstrate the algorithm's effectiveness on spam filtering tasks. Future work directions are discussed around different classifier and cost function types.
This document presents a research thesis that investigates using Monte Carlo simulation as an asset allocation tool for real estate portfolios during different economic periods. The thesis acknowledges those who provided guidance and support. It includes an introduction outlining the research problem and question, as well as the methodology, literature review, research design, data analysis, conclusion and references. The research aims to determine if altering real estate asset allocation based on Monte Carlo simulation during growth, decline and stable periods can enhance portfolio performance compared to a static allocation. Portfolio performance is measured using the Sharpe ratio to assess risk-adjusted returns.
The document discusses various stability patterns and antipatterns related to system architecture and integration. It describes antipatterns like integration points, blocked threads, cascading failures, and attacks of self-denial that can cause system failures. It then presents patterns like circuit breakers, bulkheads, test harnesses, and decoupling middleware that can be used to detect failures early and prevent cascading failures from spreading. The document emphasizes the importance of monitoring dependencies, avoiding tight couplings, and using timeouts and fail-fast mechanisms to isolate failures and stop errors from propagating.
Apache Kafka is a distributed streaming platform that forms a key part of the infrastructure at many companies including Uber, Netflix and LinkedIn. In this talk, Matt gave a technical overview of Apache Kafka, discussed practical use cases of Kafka for IoT data and demonstrated how to ingest data from an MQTT server using Kafka Connect.
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh
Alok Singh is a Principal Engineer at IBM CODAIT who has built multiple analytical frameworks and machine learning algorithms. The presentation provides an overview of building predictive models for imbalanced datasets using scikit-learn and XGBoost. It discusses challenges with imbalanced data, evaluation metrics like confusion matrix and ROC curves, and techniques for imbalanced learning including weighted classes, oversampling minorities and undersampling majorities, and SMOTE. The presentation concludes with a hands-on tutorial demonstrating these techniques on an imbalanced bank marketing dataset.
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
This document discusses adversarial machine learning and how to attack machine learning algorithms. It provides examples of how naive Bayes, k-means clustering, and SVM algorithms can be subverted by manipulating input data or model parameters. Specifically, the naive Bayes algorithm's accuracy can be decreased by introducing benign words to messages. The k-means clustering algorithm's false negative rate can be increased by adding outlier points. And the SVM algorithm's decision boundary and predictions can be controlled. The document advocates for defenses like ensembling multiple models and using robust learning methods.
Patrick Hall, Professor, AI Risk Management, The George Washington University
H2O Open Source GenAI World SF 2023
Language models are incredible engineering breakthroughs but require auditing and risk management before productization. These systems raise concerns about toxicity, transparency and reproducibility, intellectual property licensing and ownership, disinformation and misinformation, supply chains, and more. How can your organization leverage these new tools without taking on undue or unknown risks? While language models and associated risk management are in their infancy, a small number of best practices in governance and risk are starting to emerge. If you have a language model use case in mind, want to understand your risks, and do something about them, this presentation is for you!
This document discusses adversarial learning and the adversarial classification reverse engineering (ACRE) problem. The ACRE problem aims to efficiently learn enough about a classifier to construct adversarial attacks using a limited number of queries. The document presents algorithms for reverse engineering linear classifiers with continuous and boolean features. It shows ACRE learning is possible within a factor of 1+ε for continuous features and 2 for boolean features. Experimental results demonstrate the algorithm's effectiveness on spam filtering tasks. Future work directions are discussed around different classifier and cost function types.
The document discusses various machine learning concepts like model overfitting, underfitting, missing values, stratification, feature selection, and incremental model building. It also discusses techniques for dealing with overfitting and underfitting like adding regularization. Feature engineering techniques like feature selection and creation are important preprocessing steps. Evaluation metrics like precision, recall, F1 score and NDCG are discussed for classification and ranking problems. The document emphasizes the importance of feature engineering and proper model evaluation.
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesQuestionPro
This document provides an overview of MaxDiff, a technique for evaluating preferences that asks respondents to choose best and worst options from sets. It notes limitations of traditional rating scales like scale bias and lack of constraints. MaxDiff forces trade-offs and provides richer data than ratings. Questions present lists and ask for most/least important. Results can be analyzed simply by counting choices or more advanced techniques can provide respondent-level utilities. The document provides examples and tips for effective MaxDiff surveys.
This document discusses key concepts in sampling design and procedures. It covers reasons for sampling such as pragmatic and cost reasons. It also discusses defining the target population and sampling frame. The document contrasts probability sampling techniques like simple random sampling, systematic sampling, and stratified sampling with non-probability techniques like convenience sampling and snowball sampling. It discusses factors to consider in determining sample size such as variance, desired confidence level and interval. Sample size formulas for estimating means and proportions are also provided.
Adversarial Attacks on A.I. Systems — NextCon, Jan 2019anant90
Machine Learning is itself just another tool, susceptible to adversarial attacks. These can have huge implications, especially in a world with self-driving cars and other automation. In this talk, we will look at recent developments in the world of adversarial attacks on the A.I. systems, and how far we have come in mitigating these attacks.
Connections b/w active learning and model extractionAnmol Dwivedi
Codes on https://github.com/anmold-07/Model-Extraction-with-RL
https://www.usenix.org/conference/usenixsecurity20/presentation/chandrasekaran
This paper formalizes model extraction and discusses possible defense strategies by drawing parallels between model extraction and an established area of active learning. In particular, the authors show that recent advancements in the active learning domain can be used to implement powerful model extraction attacks and investigate possible defense strategies.
Correlation, causation and incrementally recommendation problems at netflix ...Roelof van Zwol
Within Netflix, personalization is a key differentiator, helping members to quickly discover new content that matches their taste. Done well, it creates an immersive user experience, however when the recommendation is out of tune, it is immediately noticed by our members. During this presentation I will cover some of the personalization and recommendation tasks that jointly define the Netflix user experience that entertains more that 130M members world wide. In particular, I will focus on several of the algorithmic challenges related to the launch of new Netflix originals in the service, and go over concepts such as causality, incrementality and explore-exploit strategies.
The research presented in this talk represents the collaborative efforts of a team of research scientists and engineers at Netflix on our journey to create best in class user experiences.
- Machine learning models are negatively impacted by noisy or inconsistent labels in training data. This is a challenge for tasks like bug severity classification where labels can be subjective.
- A new evaluation metric called Krippendorff's alpha is proposed to measure agreement between labels while accounting for inconsistencies. It is shown to better reflect performance than accuracy when labels are inconsistent.
- Making "big data thick" by improving quality is an important future direction, but challenging at scale. Lightweight methods are needed to reduce noise without extensive manual labelling. Performance measures also need to account for noise inherent in some real-world problems.
Dr Murari Mandal from NUS presented as part of 3 days OpenPOWER Industry summit about Robustness in Deep learning where he talked about AI Breakthroughs , Performance improments in AI models , Adversarial attacks , Attacks on semantic segmentation , Attacs on object detector , Defending Against adversarial attacks and many other areas.
AI and ML Skills for the Testing World TutorialTariq King
Software continues to revolutionize the world, impacting nearly every aspect of our work, family, and personal life. Artificial intelligence (AI) and machine learning (ML) are playing key roles in this revolution through improvements in search results, recommendations, forecasts, and other predictions. AI and ML technologies are being used in platforms for digital assistants, home entertainment, medical diagnosis, customer support, and autonomous vehicles. Testing practitioners are recognizing the potential for advances in AI and ML to be leveraged for automated testing—an area that still requires significant manual effort. Tariq King and Jason Arbon introduce you to the world of AI for software testing. Learn the fundamentals behind autonomous and intelligent agents, ML approaches including Bayesian networks, decision tree learning, neural networks, and reinforcement learning. Discover how to apply these techniques to common testing tasks such as identifying testable features, generating test flows, and detecting erroneous states.
Machine Learning Experimentation at Sift ScienceSift Science
Alex Paino, a Software Engineer at Sift Science, discusses how we use machine learning to prevent several types of abusive user behavior for thousands of customers. Measuring the accuracy of the thousands of classifiers used in a manner that correctly represents the value provided to customers is a huge challenge for us. Alex describes how we think about this problem and what we have done to address it. This includes an overview of the various tools and methodologies we employ that allow us to quickly summarize the results of an experiment, break ties in mixed result experiments, and drill into specific models and samples.
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
Machine learning model fairness and interpretability are critical for data scientists, researchers and developers to explain their models and understand the value and accuracy of their findings. Interpretability is also important to debug machine learning models and make informed decisions about how to improve them.
In this session, Francesca will go over a few methods and tools that enable you to "unpack” machine learning models, gain insights into how and why they produce specific results, assess your AI systems fairness and mitigate any observed fairness issues.
Using open-source fairness and interpretability packages, attendees will learn how to:
- Explain model prediction by generating feature importance values for the entire model and/or individual data points.
- Achieve model interpretability on real-world datasets at scale, during training and inference.
- Use an interactive visualization dashboard to discover patterns in data and explanations at training time.
- Leverage additional interactive visualizations to assess which groups of users might be negatively impacted by a model and compare multiple models in terms of their fairness and performance.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...Kishor Datta Gupta
This presentation discusses robust filtering schemes to defend machine learning systems against adversarial attacks. It outlines three main defense schemes: input filtering, output filtering, and an end-to-end protection scheme. The input filtering scheme uses a genetic algorithm to determine an optimal sequence of filters to detect adversarial examples. The output filtering scheme formulates the detection of adversarial inputs as an outlier detection problem. The end-to-end scheme integrates components for adversarial detection, filtering, and classification into a unified framework for protection. Experimental results show the proposed approaches can effectively detect various adversarial attack types while maintaining high classification accuracy.
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...MLAI2
While existing federated learning approaches mostly require that clients have fully-labeled data to train on, in realistic settings, data obtained at the client-side often comes without any accompanying labels. Such deficiency of labels may result from either high labeling cost, or difficulty of annotation due to the requirement of expert knowledge. Thus the private data at each client may be either partly labeled, or completely unlabeled with labeled data being available only at the server, which leads us to a new practical federated learning problem, namely Federated Semi-Supervised Learning (FSSL). In this work, we study two essential scenarios of FSSL based on the location of the labeled data. The first scenario considers a conventional case where clients have both labeled and unlabeled data (labels-at-client), and the second scenario considers a more challenging case, where the labeled data is only available at the server (labels-at-server). We then propose a novel method to tackle the problems, which we refer to as Federated Matching (FedMatch). FedMatch improves upon naive combinations of federated learning and semi-supervised learning approaches with a new inter-client consistency loss and decomposition of the parameters for disjoint learning on labeled and unlabeled data. Through extensive experimental validation of our method in the two different scenarios, we show that our method outperforms both local semi-supervised learning and baselines which naively combine federated learning with semi-supervised learning.
This document compares regression, support vector machines (SVM), and deep learning. It defines each as a supervised learning model that maps input data to output data using labeled training data. Regression uses weighted parameters and least squares error, SVM uses quadratic programming with constraints, and deep learning uses layered connection weights between nodes. The document provides recommendations for when each model is better suited, such as SVM for high-dimensional data, regression for small datasets or numerical prediction, and deep learning for tasks like image colorization or machine translation.
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...Francesca Lazzeri, PhD
Machine learning model fairness and interpretability are critical for data scientists, researchers and developers to explain their models and understand the value and accuracy of their findings. Interpretability is also important to debug machine learning models and make informed decisions about how to improve them. In this session, Francesca will go over a few methods and tools that enable you to “unpack" machine learning models, gain insights into how and why they produce specific results, assess your AI systems fairness and mitigate any observed fairness issues.
Using open source fairness and interpretability packages, attendees will learn how to:
- Explain model prediction by generating feature importance values for the entire model and/or individual datapoints.
- Achieve model interpretability on real-world datasets at scale, during training and inference.
- Use an interactive visualization dashboard to discover patterns in data and explanations at training time.
- Leverage additional interactive visualizations to assess which groups of users might be negatively impacted by a model and compare multiple models in terms of their fairness and performance.
The document discusses explainability and bias in machine learning/AI models. It covers several topics:
1. Why explainability of models is important, including for laypeople using models and potential legal needs for explanations of decisions.
2. Methods for explainability including using interpretable models directly and post-hoc explainability methods like LIME and SHAP which provide feature attributions.
3. Issues with bias in machine learning models and different definitions of fairness. It also discusses techniques for measuring and mitigating bias, such as reweighting data or using adversarial learning.
The document discusses various machine learning concepts like model overfitting, underfitting, missing values, stratification, feature selection, and incremental model building. It also discusses techniques for dealing with overfitting and underfitting like adding regularization. Feature engineering techniques like feature selection and creation are important preprocessing steps. Evaluation metrics like precision, recall, F1 score and NDCG are discussed for classification and ranking problems. The document emphasizes the importance of feature engineering and proper model evaluation.
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesQuestionPro
This document provides an overview of MaxDiff, a technique for evaluating preferences that asks respondents to choose best and worst options from sets. It notes limitations of traditional rating scales like scale bias and lack of constraints. MaxDiff forces trade-offs and provides richer data than ratings. Questions present lists and ask for most/least important. Results can be analyzed simply by counting choices or more advanced techniques can provide respondent-level utilities. The document provides examples and tips for effective MaxDiff surveys.
This document discusses key concepts in sampling design and procedures. It covers reasons for sampling such as pragmatic and cost reasons. It also discusses defining the target population and sampling frame. The document contrasts probability sampling techniques like simple random sampling, systematic sampling, and stratified sampling with non-probability techniques like convenience sampling and snowball sampling. It discusses factors to consider in determining sample size such as variance, desired confidence level and interval. Sample size formulas for estimating means and proportions are also provided.
Adversarial Attacks on A.I. Systems — NextCon, Jan 2019anant90
Machine Learning is itself just another tool, susceptible to adversarial attacks. These can have huge implications, especially in a world with self-driving cars and other automation. In this talk, we will look at recent developments in the world of adversarial attacks on the A.I. systems, and how far we have come in mitigating these attacks.
Connections b/w active learning and model extractionAnmol Dwivedi
Codes on https://github.com/anmold-07/Model-Extraction-with-RL
https://www.usenix.org/conference/usenixsecurity20/presentation/chandrasekaran
This paper formalizes model extraction and discusses possible defense strategies by drawing parallels between model extraction and an established area of active learning. In particular, the authors show that recent advancements in the active learning domain can be used to implement powerful model extraction attacks and investigate possible defense strategies.
Correlation, causation and incrementally recommendation problems at netflix ...Roelof van Zwol
Within Netflix, personalization is a key differentiator, helping members to quickly discover new content that matches their taste. Done well, it creates an immersive user experience, however when the recommendation is out of tune, it is immediately noticed by our members. During this presentation I will cover some of the personalization and recommendation tasks that jointly define the Netflix user experience that entertains more that 130M members world wide. In particular, I will focus on several of the algorithmic challenges related to the launch of new Netflix originals in the service, and go over concepts such as causality, incrementality and explore-exploit strategies.
The research presented in this talk represents the collaborative efforts of a team of research scientists and engineers at Netflix on our journey to create best in class user experiences.
- Machine learning models are negatively impacted by noisy or inconsistent labels in training data. This is a challenge for tasks like bug severity classification where labels can be subjective.
- A new evaluation metric called Krippendorff's alpha is proposed to measure agreement between labels while accounting for inconsistencies. It is shown to better reflect performance than accuracy when labels are inconsistent.
- Making "big data thick" by improving quality is an important future direction, but challenging at scale. Lightweight methods are needed to reduce noise without extensive manual labelling. Performance measures also need to account for noise inherent in some real-world problems.
Dr Murari Mandal from NUS presented as part of 3 days OpenPOWER Industry summit about Robustness in Deep learning where he talked about AI Breakthroughs , Performance improments in AI models , Adversarial attacks , Attacks on semantic segmentation , Attacs on object detector , Defending Against adversarial attacks and many other areas.
AI and ML Skills for the Testing World TutorialTariq King
Software continues to revolutionize the world, impacting nearly every aspect of our work, family, and personal life. Artificial intelligence (AI) and machine learning (ML) are playing key roles in this revolution through improvements in search results, recommendations, forecasts, and other predictions. AI and ML technologies are being used in platforms for digital assistants, home entertainment, medical diagnosis, customer support, and autonomous vehicles. Testing practitioners are recognizing the potential for advances in AI and ML to be leveraged for automated testing—an area that still requires significant manual effort. Tariq King and Jason Arbon introduce you to the world of AI for software testing. Learn the fundamentals behind autonomous and intelligent agents, ML approaches including Bayesian networks, decision tree learning, neural networks, and reinforcement learning. Discover how to apply these techniques to common testing tasks such as identifying testable features, generating test flows, and detecting erroneous states.
Machine Learning Experimentation at Sift ScienceSift Science
Alex Paino, a Software Engineer at Sift Science, discusses how we use machine learning to prevent several types of abusive user behavior for thousands of customers. Measuring the accuracy of the thousands of classifiers used in a manner that correctly represents the value provided to customers is a huge challenge for us. Alex describes how we think about this problem and what we have done to address it. This includes an overview of the various tools and methodologies we employ that allow us to quickly summarize the results of an experiment, break ties in mixed result experiments, and drill into specific models and samples.
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
Machine learning model fairness and interpretability are critical for data scientists, researchers and developers to explain their models and understand the value and accuracy of their findings. Interpretability is also important to debug machine learning models and make informed decisions about how to improve them.
In this session, Francesca will go over a few methods and tools that enable you to "unpack” machine learning models, gain insights into how and why they produce specific results, assess your AI systems fairness and mitigate any observed fairness issues.
Using open-source fairness and interpretability packages, attendees will learn how to:
- Explain model prediction by generating feature importance values for the entire model and/or individual data points.
- Achieve model interpretability on real-world datasets at scale, during training and inference.
- Use an interactive visualization dashboard to discover patterns in data and explanations at training time.
- Leverage additional interactive visualizations to assess which groups of users might be negatively impacted by a model and compare multiple models in terms of their fairness and performance.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...Kishor Datta Gupta
This presentation discusses robust filtering schemes to defend machine learning systems against adversarial attacks. It outlines three main defense schemes: input filtering, output filtering, and an end-to-end protection scheme. The input filtering scheme uses a genetic algorithm to determine an optimal sequence of filters to detect adversarial examples. The output filtering scheme formulates the detection of adversarial inputs as an outlier detection problem. The end-to-end scheme integrates components for adversarial detection, filtering, and classification into a unified framework for protection. Experimental results show the proposed approaches can effectively detect various adversarial attack types while maintaining high classification accuracy.
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...MLAI2
While existing federated learning approaches mostly require that clients have fully-labeled data to train on, in realistic settings, data obtained at the client-side often comes without any accompanying labels. Such deficiency of labels may result from either high labeling cost, or difficulty of annotation due to the requirement of expert knowledge. Thus the private data at each client may be either partly labeled, or completely unlabeled with labeled data being available only at the server, which leads us to a new practical federated learning problem, namely Federated Semi-Supervised Learning (FSSL). In this work, we study two essential scenarios of FSSL based on the location of the labeled data. The first scenario considers a conventional case where clients have both labeled and unlabeled data (labels-at-client), and the second scenario considers a more challenging case, where the labeled data is only available at the server (labels-at-server). We then propose a novel method to tackle the problems, which we refer to as Federated Matching (FedMatch). FedMatch improves upon naive combinations of federated learning and semi-supervised learning approaches with a new inter-client consistency loss and decomposition of the parameters for disjoint learning on labeled and unlabeled data. Through extensive experimental validation of our method in the two different scenarios, we show that our method outperforms both local semi-supervised learning and baselines which naively combine federated learning with semi-supervised learning.
This document compares regression, support vector machines (SVM), and deep learning. It defines each as a supervised learning model that maps input data to output data using labeled training data. Regression uses weighted parameters and least squares error, SVM uses quadratic programming with constraints, and deep learning uses layered connection weights between nodes. The document provides recommendations for when each model is better suited, such as SVM for high-dimensional data, regression for small datasets or numerical prediction, and deep learning for tasks like image colorization or machine translation.
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...Francesca Lazzeri, PhD
Machine learning model fairness and interpretability are critical for data scientists, researchers and developers to explain their models and understand the value and accuracy of their findings. Interpretability is also important to debug machine learning models and make informed decisions about how to improve them. In this session, Francesca will go over a few methods and tools that enable you to “unpack" machine learning models, gain insights into how and why they produce specific results, assess your AI systems fairness and mitigate any observed fairness issues.
Using open source fairness and interpretability packages, attendees will learn how to:
- Explain model prediction by generating feature importance values for the entire model and/or individual datapoints.
- Achieve model interpretability on real-world datasets at scale, during training and inference.
- Use an interactive visualization dashboard to discover patterns in data and explanations at training time.
- Leverage additional interactive visualizations to assess which groups of users might be negatively impacted by a model and compare multiple models in terms of their fairness and performance.
The document discusses explainability and bias in machine learning/AI models. It covers several topics:
1. Why explainability of models is important, including for laypeople using models and potential legal needs for explanations of decisions.
2. Methods for explainability including using interpretable models directly and post-hoc explainability methods like LIME and SHAP which provide feature attributions.
3. Issues with bias in machine learning models and different definitions of fairness. It also discusses techniques for measuring and mitigating bias, such as reweighting data or using adversarial learning.
1. From Adversarial Learning to
Robust and Scalable Learning
Ph.D. Presentation
1
Han Xiao (I20)
xiaoh@in.tum.de
Advisor: Prof. Dr. Claudia Eckert
2. Introduction Adversarial Learning Robust Learning Scalable Learning
Motivation
2
Machine learning algorithms in real-world applications are
vulnerable to adversaries.
Machine
learning
algorithms
Spam filtering
Recommendation
system
Spammer may disguise the spam by
adding image and “good words” to
cheat the filter.
Spam users may give false ratings on
tail items, leading to a biased
recommendation system.
Explorative attack
Causative attack
Application Threat
4. Introduction Adversarial Learning Robust Learning Scalable Learning
Why shall we care?
4
“Know your enemies and yourself,
you will not be imperiled in a
hundred battles.”
Robust anti-virus
software
High quality
recommendation
system
Spam-free social
network service
Cost-effective crowd-
sourcing system
Traditional machine learning and
data mining rarely focus on
adversarial settings.
5. Introduction Adversarial Learning Robust Learning Scalable Learning
outlier
detection
Related work
Multi-labeler
Semi-supervised
learning
Active learning
Outlier detection
multi-
labeler
learning
active
learning
semi-
supervised
learning
1 1
2
3
4
2
3
4
Research Idea
Data are labeled by
multiple labelers
Data are partially
labeled
An oracle provides
labels
Noisy data points do
not fit distribution
5
Some labelers are
adversaries
Even those limited
labels can not be fully
trusted
The oracle can provide
wrong label
Noise does not follow
any predefined
distribution
Adversarial setting
6. Introduction Adversarial Learning Robust Learning Scalable Learning
Roadmap of my dissertation
Contribution
Adversarial
learning
Robust learning
6
Robust and
scalable learning
How can adversaries exploit the
vulnerabilities of learning algorithms?
How to learn from unfaithful training
data?
Are current algorithms fast enough for
online learning?
How to learn from noisy data stream
for real-time applications?
ProblemTopic
Showed that convex-
inducing classifiers
are vulnerable to
explorative attack
Showed that SVMs
are vulnerable to
causative label-flip
attack
Developed a hierarchical Gaussian process
model and a graph-based model for multi-
labeler learning
Developed an
approximate
Gaussian process
for online regression
Developed online
algorithm learning
from partially labeled
data in client-server
setting
18. Introduction Adversarial Learning Robust Learning Scalable Learning
Learning from multiple yet unreliable labelers
18
• Each instance is labeled
by several labelers
• Labeler can be genuine
or adversary
• Groundtruth label is
unknown
19. Introduction Adversarial Learning Robust Learning Scalable Learning
Latent space model for connecting the input space and
label space
19
20. Introduction Adversarial Learning Robust Learning Scalable Learning
Gaussian process for modeling joint probability
20
Latent space GP model Labeler GP model
Maximum a posterior
21. Introduction Adversarial Learning Robust Learning Scalable Learning
Synthetic examples: recover from the
responses of four observers
21
22. Introduction Adversarial Learning Robust Learning Scalable Learning
22
Synthetic examples: recover from the
responses of four observers
23. Introduction Adversarial Learning Robust Learning Scalable Learning
A graph-based approach for multi-labeler problem
23
• Not all instances are labeled
• A labeler only label a set of instances
• Some labelers are adversaries
Problem setting
Goal
• Compute the label and uncertainty of each
instance
• Compute the confidence of each labeler
Idea: joint smoothness on graph
• Instances that are similar in item feature space
should have similar label
• Labeler that are similar in labeler feature space
should have similar confidence
i
µi
24. Introduction Adversarial Learning Robust Learning Scalable Learning
Joint smoothness on labeler-graph and instance-graph
24
Labeler similarity
graph
Item similarity
graph
Instances that are close together
should have similar predicted labels,
unless their uncertainties are large.
Predicted labeled should be close to
their assigned labels, unless the
instance is uncertain or the
corresponding labelers are not
confidence
Labelers that are close together should
have similar confidence.
The uncertainty of an instance/labeler
should not be too large or too close to
zero.
joint smoothness on two graphs
30. Introduction Adversarial Learning Robust Learning Scalable Learning
Proposed method achieves better performance
in less time
30
Accuracy
(root mean
square error)
Efficiency
(training and
prediction
time)
31. Introduction Adversarial Learning Robust Learning Scalable Learning
Scalable robust learning in client-server settings
31
Which instance should I query?
Homogenous clients
Heterogenous clients
Which instance should I query?
Who should I ask for labeling?
Learn a good model under
limited bandwidth
Client Server Unlabeled data Goal
Problem
32. Introduction Adversarial Learning Robust Learning Scalable Learning
Subset selection under given budget
(Homogenous) Client uploads only crucial data
according to the selection policy
Unlabeled
data
Keysteps
Candidate
pool
Selection
policy
Upload
selections
Two-learner
model
Update
selection
policy
Client Server
PurposeMethod
• Select a small set of data from the candidate pool for uploading
Requirement
• Uploaded data should improve the classification performance on the server
• Selection procedure should be light-weight for the client
• Selection policy should be light-weight for the network
• Select by optimizing a function consists of
two criterions
• Utility of instance (w.r.t. SCW)
• Redundancy w.r.t. the candidate pool
32
33. Introduction Adversarial Learning Robust Learning Scalable Learning
Server employs a two-learner model to learn
unlabeled data from client
33
Unlabeled
data
Candidate
pool
Selection
policy
Upload
selections
Two-learner
model
Update
selection
policy
Client Server
PurposeMethod
• Incrementally learn a binary classifier from unlabeled data
Requirement
• Leverage neighbor information for exploiting unlabeled data
• Learn in online fashion
• Be efficient enough to handle large-volume of data
• Be easily parameterized as a selection policy
• Two-learner structure
• Harmonic solution (HS)
• Soft confidence-weighted (SCW)
Keysteps
34. Introduction Adversarial Learning Robust Learning Scalable Learning
Proposed selection strategy reduces
communication cost and gives high accuracy
34
}FrameworkClient Communication Server
Selection policy on
client
Labeling rate (a mount
of human effort)
Sampling rate (a mount
of communication cost)
Accuracy averaged on
10 data sets
Full 100% 20% 92.16%
All 2% 100% 86.32%
Rand 2% 20% 86.38%
Proposed 2% 20% 87.08%
Unlabeled
data
Candidate
pool
Selection
policy
Upload
selections
Two-learner
model
Update
selection
policy
Client Server
Keysteps
35. Introduction Adversarial Learning Robust Learning Scalable Learning
Heterogenous clients: ask the most confident
client for labeling most uncertain instance
35
36. Introduction Adversarial Learning Robust Learning Scalable Learning
From adversarial learning to robust and scalable
learning
36
Contribution
Adversarial
learning
Robust learning
Robust and
scalable learning
How can adversaries exploit the
vulnerabilities of learning algorithms?
How to learn from unfaithful training
data?
Are current algorithms fast enough for
online learning?
How to learn from noisy data stream
for real-time applications?
ProblemTopic
Showed that convex-
inducing classifiers
are vulnerable to
explorative attack
Showed that SVMs
are vulnerable to
causative label-flip
attack
Developed a hierarchical Gaussian process
model and a graph-based model for multi-
labeler learning
Developed an
approximate
Gaussian process
for online regression
Developed online
algorithm learning
from partially labeled
data in client-server
setting
37. Introduction Adversarial Learning Robust Learning Scalable Learning
Conclusion
37
• Traditional machine learning algorithms are
vulnerable to the attack.
• Through labelers may contain adversaries, robust
learning can still be achieved.
• Multi-labeler learning (crowdsourcing) could have
more and more applications in the next couple of
years.