SlideShare a Scribd company logo
1 of 47
© Hee-Meng Foo, 2020
Machine Learning Testing:
Survey, Landscapes and
Horizons
The Cliff Notes
Hee-Meng Foo, Apr 2020
© Hee-Meng Foo, 2020
What this talk is about
● Essentially the “Cliff Notes” of this paper (37 page long):
“Machine Learning Testing: Survey, Landscapes and Horizons”, Jie M. Zhang,
Mark Harman, Lei Ma, Yan Liu, arXiv:1906.10742v2 [cs.LG], 21 Dec 2019
● Please don’t shoot the messenger
● There are ~ 300 citations, I have not read most of them
● “[XX]” (eg. [23]) in slides refer to a citation from the paper
● Slides will reference sections of paper eg. “3. Machine Learning Testing”
© Hee-Meng Foo, 2020
Sections Skipped
● 2. Preliminaries of Machine Learning
● 4. Paper Collection and Review Schema
● 8. Application Scenarios
© Hee-Meng Foo, 2020
Structure of Talk
● The Lay of the Land
● The “Heavy” Parts (or Deep Dive)
● ML Testing Challenges and Opportunities
© Hee-Meng Foo, 2020
Machine Learning Testing Is New But Growing
Source: the survey paper. Section:
1. Introduction
© Hee-Meng Foo, 2020
Timeline of Research
Source: the survey paper, 9. Analysis of Literature Review
© Hee-Meng Foo, 2020
Distribution of ML vs DNN Testing Research
Source: the survey paper, 9. Analysis of
Literature Review
© Hee-Meng Foo, 2020
Distribution of Supervised, Unsupervised &
Reinforcement Learning Testing Research
Source: the survey
paper, 9. Analysis of
Literature Review
© Hee-Meng Foo, 2020
Research Distribution Among Testing Properties
Source: the survey
paper, 9. Analysis of
Literature Review
© Hee-Meng Foo, 2020
Number of Datasets vs Number of Papers
Source: the survey
paper, 9. Analysis of
Literature Review
© Hee-Meng Foo, 2020
Dataset Categorization
Source: the survey paper, 9. Analysis of Literature Review
Table 6. NLP
Table 7. Decision Making
Table 8. Others
© Hee-Meng Foo, 2020
Preliminaries
● The Oracle Problem
○ Testing involves examining the behavior of a system in order to discover potential faults. Given an input for a
system, the challenge of distinguishing the corresponding desired, correct behavior from potentially incorrect
behavior is called the “test oracle problem”.
○ See https://www.youtube.com/watch?v=cquyBmIh0e4
● Metamorphic Testing
○ TSP example
● Differential Testing
○ Compare the result from > 2 other systems eg. compilers. DeepXplore [1]
● Adversarial Testing
○ Use perturbed data to test robustness
● MC/DC coverage
○ What is minimum combinations of a set of predicates to cover all possible meaningful combinations
○ Eg. A ^ B ^ C v D (truth table is 2^4)
○ See https://www.youtube.com/watch?v=HzmnCVaICQ4
Source:
http://algorist.com/problems/Traveling_Salesman_Problem.html
A
X
© Hee-Meng Foo, 2020
What Makes ML Testing Hard
● Data driven
○ Decision logic arrived via training procedure
○ Model behavior changes over time with new data
○ Need also to factor in data in testing
● ML Oracle Problem
○ Eg. K-means clustering - how do you know this is best result?
○ Involves designing metamorphic relations to address
● ML is less obvious to modularize and hence test in isolation
○ “Low accuracy/precision of a ML model … arising from a combination of behaviors of different
components such as training data, the learning program, and even the learning
framework/library”
● Errors may propagate and become amplified or suppressed
Source: the survey paper, 1. Introduction
© Hee-Meng Foo, 2020
Online vs Offline Testing
Source: the survey
paper, 3.2 ML Testing
Workflow
Online testing - to address the gap that offline testing relies on test data but does not represent future
data
Approaches to online testing:
● A/B testing
● MAB (Multi-Armed Bandit) approach
© Hee-Meng Foo, 2020
Idealized Workflow of ML Testing
Source: the survey paper, 3.2 ML Testing Workflow
© Hee-Meng Foo, 2020
ML Components (and where bugs reside)
Source: the survey paper, 3.3 ML Testing Components
© Hee-Meng Foo, 2020
ML Components (and where bugs reside)
Bugs in Data
Bugs in Learning
Program
Bugs in ML Framework
Source: the survey paper, 3.3 ML Testing Components
Completeness
Representative of future data
Noise
Bias
Data poisoning
Algorithm designed, chosen, configured
improperly
Typos
Bugs in ML Framework
eg. Keras, Sci-kit learn
etc
© Hee-Meng Foo, 2020
ML Testing Properties (Functional)
● Correctness
○ True Correctness vs Empirical Correctness
● Model Relevance
○ VC-dimension, Rademacher Complexity for Classification
○ See https://www.youtube.com/watch?v=gR9Q8pS03ZE
See survey paper, 3.4 ML Testing Properties
© Hee-Meng Foo, 2020
ML Testing Properties (Non-Functional)
● Robustness
○ Ie. how resilient is ML system’s correctness in presence of perturbations
○ Local vs Global robustness ie. robustness wrt 1 test input vs all test inputs
● Security
○ ML system’s resilience against potential harm, danger or loss via manipulating or illegally accessing
ML components eg. adversarial attacks, data poisoning
● Data Privacy
○ Current research focuses on how to present privacy-preserving ML, not detecting privacy violations
● Efficiency
○ ie. training or prediction speed. Also small footprint for mobile
● Fairness
○ Various sources of bias. Protected characteristics vs protected attributes vs sensitive attributes
● Interpretability
○ 2 aspects: transparency (how model works), post hoc explanations (other info derived fr model)
○ Good e-book on ML Interpretability [70]
See survey paper, 3.4 ML Testing Properties
© Hee-Meng Foo, 2020
Software Testing vs ML Testing
Source: the survey paper, 3.5 Software Testing vs ML Testing
© Hee-Meng Foo, 2020
Studies of ML Bugs
● Thung et al [159]
○ Looked at bug reports from Apache Mahout, Lucene, OpenNLP
○ 22.6% bugs due to incorrect implementation
○ 15.6% bugs were non-functional
○ 5.6% data bugs
● Zhang et al [160]
○ Looked at 175 Tensorflow bugs
○ 18.9% Tensorflow API misuse
○ 13.7% unaligned tensor
○ 21.7% incorrect model parameter or structure
● Banerjee et al [161]
○ Looked at bugs in AV systems
○ ML and decision control accounted for 64% of disengagements
See survey paper, 5.5 Bug Report Analysis
© Hee-Meng Foo, 2020
The “Heavy” Parts
● Test Input Generation
● Test Oracles
● Test Adequacy
● Test Prioritization and Reduction
● Debug & Repair
● General Testing Framework & Tools
● ML Properties to be tested
● ML Testing Components
© Hee-Meng Foo, 2020
Test Input Generation
● 5.1.1 Domain Specific Test Input Synthesis
○ 2 categories: adversarial (perturbed data, robustness test) & natural (test application scenario)
○ DeepXplore [1] generates “real world” test data for neuron coverage
○ DeepTest [76] - realistic image transforms for testing AV
■ Detected > 1000 erroneous behaviors in CNNs/RNNs [77]
○ GANs [79] - driving scene based test generation with different weather conditions
○ DeepBillboard [81] - generate real-world adversarial billboards for testing AV systems
○ Audio based DNN [82] - transformations tailored to audio inputs for testing
○ Image classification [83] - generates images and uses metamorphic relations for testing
○ Machine translation [86] - uses mutation of words to generate test inputs
© Hee-Meng Foo, 2020
Test Input Generation
● 5.1.2 Fuzz and Search based Test Input Generation
○ What is Fuzzing [9] - see https://www.youtube.com/watch?v=pcEy-4eZF6g
○ Search based test generation - uses metaheuristic search to guide fuzz process [17][87][88]
○ TensorFuzz [89] - hill-climbing approach to explore valid input space for Tensorflow graphs
○ DLFuzz [90] - based on ideas from DeepXplore (neuron coverage) to gen. adversarial examples
○ DeepHunter [91] - metamorphic transformation based coverage guided fuzzing technique
○ Feature guided test generation [93] - Using Scale-Invariant Feature Transform (SIFT) to identify
features that represent an image with a Gaussian mixture model, then transforms problem of
finding adversarial examples into a 2-player stochastic game
○ Evaluation of reinforcement learning with adversarial example generation [94]
○ Fuzzing and metamorphic testing to test LiDAR obstacle detection [95]
○ [97] test input for text classification + fuzzing that considers grammar
○ [98][99] - mutated sentences in Natural Language Inference (NLI) to generate test inputs for
robustness testing
○ [101] - test input generation to highlight discriminatory nature of of model
○ [102] - framework for testing AV systems that utilizing fuzz based test gen + search based test
gen
© Hee-Meng Foo, 2020
Test Input Generation
● 5.1.3 Symbolic Execution based Test Input Generation
○ What is symbolic execution [105] - see https://youtu.be/FlzroEd4pnw
○ Concolic testing (or DSE, also see video) - auto-gen test inputs to achieve high coverage
○ However for ML, we need to test combination of code + data
○ [7] - symbolic analysis and statistic approach to generate effective tests
○ [108] outlines 3 challenges applying symbolic execution to ML:
■ No explicit branching, highly non-linear, scalability issues due to complexity of ML models
○ DeepCheck [108] - create 1-pixel, 2-pixel attacks to fail image classification
○ DeepConcolic [111] - DSE for DNNs adopting MC/DC criteria for coverage
● 5.1.4 Synthetic Data to Test Learning Program
○ [112] - generated data with repeating/missing values or categorical data
○ [45] - synthetic training data that adhere to schema constraints to trigger hidden assumptions
○ [54] - synthetic data with known distributions to test overfitting
○ [113] - generating datasets with predictable characteristics to be used as pseudo oracles
© Hee-Meng Foo, 2020
Test Oracles
● 5.2.1 Metamorphic Relations as Test Oracles
○ Widely studied. Many metamorphic relations based on transformations of data that are
expected to yield unchanged or certain expected changes in predicted output
○ Coarse Grained Data Transformations
■ [115] - describes 6 transformations: additive, multiplicative, permutative, invertive, inclusive, exclusive
■ [116] - proposes 11 metamorphic relations for image classification
■ [117] - function level metamorphic relations (evaluation of 9 ML applications)
○ Fine Grained Data Transformations
■ [118] - proposed 5 types of metamorphic relations specific to certain models for supervised classifiers
■ [120] - discusses the differences in the metamorphic relations between SVM and DNNs
■ [54] - proposed Perturbed Model Validation (PMV) that combines metamorphic relations and data
mutation to detect overfitting
■ [123] - studied metamorphic relations of Naive Bayes, KNN
■ METTLE [125] - 6 types of metamorphic relations for unsupervised learners
■ [113][126] - discussed possibility of using different grained metamorphic relations to find problems in
SVM and DNNs
© Hee-Meng Foo, 2020
Test Oracles
● 5.2.1 Metamorphic Relations as Test Oracles
○ Metamorphic Relations Between Datasets
■ [127] [45] - studied metamorphic relations between training and new data
■ [45] - studied metamorphic relations among different datasets close in time
○ Frameworks to Apply Metamorphic Relations
■ Amsterdam [128] - framework to automate process of using metamorphic relations to detect ML bugs
■ Corduroy [117] - extends Java Modelling Language to let developers specify metamorphic properties
and generate test cases for ML testing
© Hee-Meng Foo, 2020
Test Oracles
● 5.2.2 Cross-Referencing as Test Oracles
○ Differential testing and N-version Programming
■ [133] 5 - 27% of test oracles for DNN libraries use differential testing
■ N-version Programming ie. generate multiple functionally equivalent programs to compare against
○ [135] used differential testing to discover 16 faults from 7 Naive Bayes implementations & 13
faults from 19 KNN implementations
○ CRADLE [48] - an approach to finding and localizing bugs in DNN frameworks (CNTK,
Tensorflow, Theano), 11 datasets (eg. ImageNet, MNIST), 30 pre-trained models
○ DeepXplore [1], DLFuzz [90] - used differential testing to find effective test inputs
○ [136] uses “mirror” programs instead of different implementations for diff testing
● 5.2.3 Measurement Metrics for Designing Test Oracles
○ [137] - robustness metric
○ [138] [139] [140] - fairness metric
○ [65] [141] - interpretability metric
© Hee-Meng Foo, 2020
Test Adequacy
● 5.3.1 Test Coverage
○ Unlike traditional software where decision logic is in the code, code coverage is not as
demanding a criteria for ML testing since decision logic is derived from training
○ Other proposed coverage techniques:
■ Neuron coverage
● DeepXplore [1] and [92]
■ MC/DC coverage variants
● [145] MC/DC inspired DNN test coverage
■ Layer level coverage
● [92] [148] [149] [147]
■ State level coverage
● For RNNs [82]
○ Limitations of coverage criteria research
■ Most focus on DNNs
■ [147] [150] talks about limitations of these approaches
© Hee-Meng Foo, 2020
Test Adequacy
● 5.3.2 Mutation Testing (or rather Mutation Score)
○ Mutation testing ie. fault injection
○ Mutation score = ratio of detected faults against all injected faults
○ DeepMutation [152] - mutate DNNs at source or model level
○ [153] propose 5 mutation operators for DNNs
● 5.3.3 Surprise Adequacy
○ See [127] for thorough explanation
○ ie. test data should be “sufficiently but not overly surprising” compared with training data
○ “Surprise” - measured using (a) KDE or (b) distance btw neuron activation vectors
● 5.3.4 Rule Based Checking of Test Adequacy
○ [154] 28 test aspects to consider + scoring system used by Google. 4 types: (a) tests for
Model, (b) tests for ML infrastructure, (c) tests for ML data and (d) tests if ML system works
well over time
© Hee-Meng Foo, 2020
Test Prioritization and Reduction
● 5.4 Test Prioritization and Reduction
○ [155] used DNN metrics eg. cross-entropy, surprisal and Baysian uncertainty to prioritize test
inputs
○ [156] adversarial test input prioritization
○ [157] used sampling technique guided by neurons of last hidden layer of DNN
○ [158] test selection metrics based on “model confidence”
© Hee-Meng Foo, 2020
Debug & Repair
● Debugging Approaches (From 5.6 Debug and Repair)
○ Data Resampling
■ Generated test data as training data for re-training - DeepXplore [1] 3% imp, DeepTest [76] 46% imp
■ “Faulty neurons” [162] - identified neurons responsible for misclassification
○ Debugging Frameworks
■ Storm [163] - program transformation framework to generate smaller programs that can support
debugging
■ tfdbg [164] - a debugger for ML models built on Tensorflow with 3 components:
● Analyzer, NodeStepper, RunStepper
■ MISTIQUE [165] - system to capture, store and query model intermediaries for debugging
■ PALM [166] - tool that explains complex model as 2-part surrogate model
● Repair Approaches
○ Fix Understanding
■ [167] - human-in-the-loop approach to simulate potential fixes
○ Program Repair
■ [168] - distribution guided inductive synthesis approach to repair decision making programs
© Hee-Meng Foo, 2020
General Testing Framework & Tools
● 5.7 General Testing Framework and Tools
○ [169] - framework to generate and validate test inputs for security testing
○ [170] - CNN testing framework
○ [171] - tool to help developers test and debug fairness bugs
○ [172] - testing framework including different evaluation aspects eg. availability, achieveability,
robustness, avoidability, improvability
○ [173] - framework for designing ML algorithms that simplifies the regulation of undesired
behaviors
© Hee-Meng Foo, 2020
ML Properties To Be Tested
● 6.1 Correctness
○ Classical approaches: cross-validation & bootstrapping
○ Classical measures: accuracy, precision, recall, AUC etc
○ [177] - studied variability of training/test data when assessing correctness of ML classifier
○ [178] - studied diff statistical methods when comparing AUC
○ [136] - “mirror” program as correctness oracle
○ [160] - survey of 175 Tensorflow bugs, 40 concern correctness
○ [179][180][181] - detecting data bugs
○ [116][120][121][123][130][154][162] - test input/test oracle design
○ [165][167][182] - test tool design
© Hee-Meng Foo, 2020
ML Properties To Be Tested
● 6.2 Model Relevance
○ Model relevance evaluation detects mismatches between model and data
○ Poor model relevance is usually associated with over or under-fitting
○ [183] When a model is too complex for the data, even noise in training data is fitted by model
○ [184][185][186] Overfitting can easily happen esp when data is insufficient
○ [54] PMV injects noise into training data, re-trains model, and uses training accuracy decrease
to detect over/under-fitting. PMV has better performance than 10 fold cross-validation
○ [42] overfitting detection by generating adversarial examples from test data
○ [187] repeated re-use of same test data will result in overfitting
○ [51] talks about training efficiency
○ [162] tries to address overfitting by re-sampling the training data
© Hee-Meng Foo, 2020
ML Properties To Be Tested
● 6.3 Robustness & Security
○ Robustness Measurement
■ [137] correctness of system with the existence of noise
■ DeepFool [188] computes perturbations to “fool” DNNs and quantify robustness
■ [189] 3 metrics: (a) pointwise (b) adversarial frequency (c) adversarial severity
■ [190] set of attacks to set upper bound for robustness
■ [191] upper and lower bounds based on test data
■ DeepSafe [192] data-driven approach to assessing DNN robustness
■ [193] - probabilistic robustness
■ [194] Bayesian Deep Learning to model propagation of errors
○ Perturbation Targeting Test Data
■ [190] adversarial example generation using distance metrics to measure similarity
■ [196][197] library to standardize adversarial example construction
○ Perturbation Targeting Whole System
■ AVFI [198] software fault injection to approx h/w errors for AV systems
■ Kayotee [199] systematically injects faults into s/w & h/w systems for AV systems
■ DriveFI [96] - mines situations and faults that maximally impact AV safety
■ [102] closed loop behavior of whole system to support adversarial example generation (also AV)
© Hee-Meng Foo, 2020
ML Properties To Be Tested
● 6.4 Efficiency
○ Already covered elsewhere
● 6.6 Interpretability
○ Manual Assessment of Interpretability
■ [65] taxonomy of evaluation approaches for interpretability
■ [215] local and global interpretability
■ [215] Decision Trees and Logistic regression more locally interpretable than DNNs
○ Automatic Assessment of Interpretability
■ [46] metric to understand the behaviors of an ML model
■ [70] measure interpretability based on category of ML model
■ [216] Metamorphic Relation Patterns (MRP) to help users understand how ML system works
○ Evaluation of Interpretability Improvement Methods
■ [217] investigated several interpretability improving methods
● 6.7 Privacy
○ [218] treat programs as grey boxes and detect differential privacy violations via statistical tests
○ [219] proposed to estimate the ∊ parameter in differential privacy
© Hee-Meng Foo, 2020
ML Properties To Be Tested
● 6.5 Fairness
○ Causes of Unfairness
■ [201] 5 causes: skewed sample, tainted examples, limited features, sample size disparity, proxies
■ [171] fairness bugs can offend and harm users, cause embarrassment, mistrust, rev loss, legal action
○ Fairness Definitions and Measurement Metrics
■ [202][203][204][205] several definitions of fairness but no firm consensus
■ Fairness Through Unawareness (FTU)
● Algorithm is fair so long as protected attributes not explicitly used in decision making
● [206][202][207]
■ Group Fairness
● Demographic parity [208], Equalized odds, Equal opportunity [139]
■ Counter-factual Fairness
● [206] a model satisfies this if the output is same when protected attribute is flipped
■ Individual Fairness
● [138] task specific similarity metric to desc pairs of individuals that shd be regarded as similar
■ Analysis and Comparison of Fairness Metrics [202][203][62][204]
■ Support for Fairness Improvement
● RobinHood [209][211] - multiple user defined fairness definitions
● [75] fairness aware programming, [212] fairness classification, [168] distributed guided inductive
synthesis
© Hee-Meng Foo, 2020
ML Properties To Be Tested
● 6.5 Fairness
○ Test Generation Techniques for Fairness Testing
■ Themis [5][213] generates tests randomly for group fairness
■ Aequitas [101] test generation to uncover discriminatory inputs
■ [109] symbolic execution together with local explainability to generate test inputs
■ [171] easily interpretable bug reports
■ [122] checking whether algorithm under test is sensitive to training data changes
© Hee-Meng Foo, 2020
ML Testing Components
● 7.1 Bug Detection in Data
○ Importance of Bug Detection in Data
■ [45] important to detect data bugs early as output of ML systems often used as input data downstream
■ [220] [8] data testing is challenging
○ Bug Detection in Training Data
■ Rule-Based Approach
● [179] data linter - (a) miscoded data, (b) outliers and scaling (c) packaging errors eg. dupes, empty examples
● [46] metrics to evaluate whether training data covered all scenarios
■ Performance Based Data Bug Detection
● MODE [162] identifies “faulty” neurons
○ Bug Detection in Test Data
■ [221] augment DNNs with small sub-network to detect adversarial perturbations
■ [222] DNN model mutation to expose adversarial examples
■ [224] compared 10 approaches to detect adversarial examples
○ Skew Detection in Training and Test Data
■ [127] using KDE and distance metric to detect skew
■ [45] studied skew in training and test data
© Hee-Meng Foo, 2020
ML Testing Components
● 7.1 Bug Detection in Data
○ Frameworks in Detecting Data Bugs
■ [45] data validation system using constraints to find bugs in data (used as part of TFX)
■ ActiveClean [225][226] - iterative data cleaning
■ BoostClean [180] - to detect domain value violations
■ [181] automatic “unit testing” of large scale data sets
■ AlphaClean [228] - greedy tree search to auto tune parameters for data cleaning
■ [182] ML platform that includes data cleansing
■ [229] adopting traditional database/data warehousing data cleaning approaches
© Hee-Meng Foo, 2020
ML Testing Components
● 7.2 Bug Detection in Learning Programs
○ Unit Tests for ML Learning Program
■ [230] unit testing for Tensorflow
■ [231] collection of unit tests for stochastic optimization
○ Algorithm Configuration Examination
■ [232][233] identified OS, language and h/w compatibility issues
■ [232] focused on sklearn, Paddle, Caffe
■ [233] focused on Tensorflow, Theano, Torch
■ [160] - most common learning program bug due to change of Tensorflow API
■ [234] - testing algorithm parameters in DNN testing problems
○ Algorithm Selection Examination
■ [235] compared DNN with classical ML
○ Mutant Simulations of Learning Program Faults
■ [117][128] used mutations to simulate program-code errors
■ [237] static analysis of tensors in Tensorflow
© Hee-Meng Foo, 2020
ML Testing Components
● 7.3 Bug Detection in Framework
○ Study of Framework Bugs
■ [238] studied bugs from Tensorflow, Caffe, Torch. Common issues: crash, non-term, out of mem
■ [233] studied Tensorflow, Theano, Torch - cmp runtime efficiency, training accuracy, robustness etc
■ [232] 10% of framework bugs are efficiency bugs
○ Implementation Testing of Frameworks
■ Challenges in Implementation Bug Detection
● [159] 22% of 500 bugs studies due to incorrect algorithm implementation
● [240] injected implementation bugs into classic ML code in Weka
■ Solutions for Implementation Bug Detection
● [135][48] used differential testing to identify faults (10 in Naive Bayes impl and 20 in KNN impl)
● [10][115] first to discuss possibility of using metamorphic relations in testing ML
● [118] - metamorphic relations to test supervised learning classifiers
● [121] - applied metamorphic relations to finding bugs in image classification
○ Study of Framework Test Oracles
■ [133] studied approximated oracles for Tensorflow, Theano, Pytorch, Keras
© Hee-Meng Foo, 2020
Structure of Talk
● The Lay of the Land
● The “Heavy” Parts (or Deep Dive)
● ML Testing Challenges and Opportunities
○ See 10 Challenges and Opportunities in survey paper
© Hee-Meng Foo, 2020
Challenges in ML Testing
● Challenges in Test Input Generation
○ (SBST) Search based Software Test generation [87]
■ Good fit for ML as it searches large input spaces
○ Adversarial Inputs for robustness
■ Criticisms that test data not natural
○ Test inputs as natural as possible
■ DeepTest [76], DeepHunter [92], DeepRoad [79] (AV) - however still amount of “unnaturalness”
● Challenges in Test Assessment Criteria
○ [290] lack of systematic evaluation of how diff assessment metrics correlate with bug revealing
success
● Challenges relating to Oracle Problem
○ Metamorphic relations require human ingenuity to discover/design
○ [128] flaky tests arise in metamorphic testing
○ Pseudo oracles may also lead to many false positives
● Challenges in Testing Cost Reduction
○ ML component testing requires retraining
○ Low footprint devices eg. mobile devices, IoT
© Hee-Meng Foo, 2020
Research Opportunities in ML Testing
● Testing More Application Scenarios
○ Most research are on supervised image classification, few on unsupervised, NLP, reinforcement learning
● Testing More ML Categories and Tasks
○ Transfer learning testing is still underdeveloped
● Testing Other Properties
○ Most work focus on correctness, robustness. Few on efficiency, model relevance,
interpretability
● Presenting More Testing Benchmarks
○ See Table 5-8
● Covering More Testing Activities
○ No coverage of requirement analysis, few on online testing, data testing is also lacking
○ Regression testing, bug report analysis, bug triage for ML systems
● Mutating Investigation in Machine Learning System
○ How to better design mutation operators for ML code
© Hee-Meng Foo, 2020
Q&A

More Related Content

What's hot

Link prediction with the linkpred tool
Link prediction with the linkpred toolLink prediction with the linkpred tool
Link prediction with the linkpred toolRaf Guns
 
Catch Me If You Can: Keeping Up With ML Models in Production
Catch Me If You Can: Keeping Up With ML Models in ProductionCatch Me If You Can: Keeping Up With ML Models in Production
Catch Me If You Can: Keeping Up With ML Models in ProductionDatabricks
 
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Praxitelis Nikolaos Kouroupetroglou
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Krishnaram Kenthapadi
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
 
Final thesis presentation
Final thesis presentationFinal thesis presentation
Final thesis presentationPawan Singh
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overviewalessio_ferrari
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Continual learning: Survey
Continual learning: SurveyContinual learning: Survey
Continual learning: SurveyWonjun Jeong
 
Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Krishnaram Kenthapadi
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
MLOps with Azure DevOps
MLOps with Azure DevOpsMLOps with Azure DevOps
MLOps with Azure DevOpsMarco Parenzan
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceNeo4j
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Neo4j
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 

What's hot (20)

Link prediction with the linkpred tool
Link prediction with the linkpred toolLink prediction with the linkpred tool
Link prediction with the linkpred tool
 
Catch Me If You Can: Keeping Up With ML Models in Production
Catch Me If You Can: Keeping Up With ML Models in ProductionCatch Me If You Can: Keeping Up With ML Models in Production
Catch Me If You Can: Keeping Up With ML Models in Production
 
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
Final thesis presentation
Final thesis presentationFinal thesis presentation
Final thesis presentation
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overview
 
Deep ar presentation
Deep ar presentationDeep ar presentation
Deep ar presentation
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Continual learning: Survey
Continual learning: SurveyContinual learning: Survey
Continual learning: Survey
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
MLOps with Azure DevOps
MLOps with Azure DevOpsMLOps with Azure DevOps
MLOps with Azure DevOps
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data Science
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 

Similar to Machine learning testing survey, landscapes and horizons, the Cliff Notes

Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directionsTao He
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Praveen Penumathsa
 
IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code OptimizationIRJET Journal
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksMarkus Scheidgen
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented SoftwarePraveen Penumathsa
 
cyberbullyingdetectionusingmachinelearning-11-220913143556-fec10e26.pptx
cyberbullyingdetectionusingmachinelearning-11-220913143556-fec10e26.pptxcyberbullyingdetectionusingmachinelearning-11-220913143556-fec10e26.pptx
cyberbullyingdetectionusingmachinelearning-11-220913143556-fec10e26.pptxSaiKiran101146
 
Model Drift Monitoring using Tensorflow Model Analysis
Model Drift Monitoring using Tensorflow Model AnalysisModel Drift Monitoring using Tensorflow Model Analysis
Model Drift Monitoring using Tensorflow Model AnalysisVivek Raja P S
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoMLArpitha Gurumurthy
 
2cee Master Cocomo20071
2cee Master Cocomo200712cee Master Cocomo20071
2cee Master Cocomo20071CS, NcState
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Lionel Briand
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
 
Validation and Verification of SYSML Activity Diagrams Using HOARE Logic
Validation and Verification of SYSML Activity Diagrams Using HOARE Logic Validation and Verification of SYSML Activity Diagrams Using HOARE Logic
Validation and Verification of SYSML Activity Diagrams Using HOARE Logic ijseajournal
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET Journal
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET Journal
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruptionjagan477830
 
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.Wolfgang Grieskamp
 
An Adjacent Analysis of the Parallel Programming Model Perspective: A Survey
 An Adjacent Analysis of the Parallel Programming Model Perspective: A Survey An Adjacent Analysis of the Parallel Programming Model Perspective: A Survey
An Adjacent Analysis of the Parallel Programming Model Perspective: A SurveyIRJET Journal
 

Similar to Machine learning testing survey, landscapes and horizons, the Cliff Notes (20)

Test for AI model
Test for AI modelTest for AI model
Test for AI model
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directions
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 
IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code Optimization
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented Software
 
cyberbullyingdetectionusingmachinelearning-11-220913143556-fec10e26.pptx
cyberbullyingdetectionusingmachinelearning-11-220913143556-fec10e26.pptxcyberbullyingdetectionusingmachinelearning-11-220913143556-fec10e26.pptx
cyberbullyingdetectionusingmachinelearning-11-220913143556-fec10e26.pptx
 
Model Drift Monitoring using Tensorflow Model Analysis
Model Drift Monitoring using Tensorflow Model AnalysisModel Drift Monitoring using Tensorflow Model Analysis
Model Drift Monitoring using Tensorflow Model Analysis
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
 
2cee Master Cocomo20071
2cee Master Cocomo200712cee Master Cocomo20071
2cee Master Cocomo20071
 
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Validation and Verification of SYSML Activity Diagrams Using HOARE Logic
Validation and Verification of SYSML Activity Diagrams Using HOARE Logic Validation and Verification of SYSML Activity Diagrams Using HOARE Logic
Validation and Verification of SYSML Activity Diagrams Using HOARE Logic
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
 
An Adjacent Analysis of the Parallel Programming Model Perspective: A Survey
 An Adjacent Analysis of the Parallel Programming Model Perspective: A Survey An Adjacent Analysis of the Parallel Programming Model Perspective: A Survey
An Adjacent Analysis of the Parallel Programming Model Perspective: A Survey
 

More from Heemeng Foo

Using ml to accelerate failure analysis
Using ml to accelerate failure analysisUsing ml to accelerate failure analysis
Using ml to accelerate failure analysisHeemeng Foo
 
Cyclomatic and cognitive complexity
Cyclomatic and cognitive complexityCyclomatic and cognitive complexity
Cyclomatic and cognitive complexityHeemeng Foo
 
Testing in digital agriculture
Testing in digital agricultureTesting in digital agriculture
Testing in digital agricultureHeemeng Foo
 
Device lab trials and tribulations
Device lab trials and tribulationsDevice lab trials and tribulations
Device lab trials and tribulationsHeemeng Foo
 
The application of Genetic Algorithms to the Berth Allocation Problem
The application of Genetic Algorithms to the Berth Allocation ProblemThe application of Genetic Algorithms to the Berth Allocation Problem
The application of Genetic Algorithms to the Berth Allocation ProblemHeemeng Foo
 
An introduction to AI in Test Engineering
An introduction to AI in Test EngineeringAn introduction to AI in Test Engineering
An introduction to AI in Test EngineeringHeemeng Foo
 

More from Heemeng Foo (6)

Using ml to accelerate failure analysis
Using ml to accelerate failure analysisUsing ml to accelerate failure analysis
Using ml to accelerate failure analysis
 
Cyclomatic and cognitive complexity
Cyclomatic and cognitive complexityCyclomatic and cognitive complexity
Cyclomatic and cognitive complexity
 
Testing in digital agriculture
Testing in digital agricultureTesting in digital agriculture
Testing in digital agriculture
 
Device lab trials and tribulations
Device lab trials and tribulationsDevice lab trials and tribulations
Device lab trials and tribulations
 
The application of Genetic Algorithms to the Berth Allocation Problem
The application of Genetic Algorithms to the Berth Allocation ProblemThe application of Genetic Algorithms to the Berth Allocation Problem
The application of Genetic Algorithms to the Berth Allocation Problem
 
An introduction to AI in Test Engineering
An introduction to AI in Test EngineeringAn introduction to AI in Test Engineering
An introduction to AI in Test Engineering
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Machine learning testing survey, landscapes and horizons, the Cliff Notes

  • 1. © Hee-Meng Foo, 2020 Machine Learning Testing: Survey, Landscapes and Horizons The Cliff Notes Hee-Meng Foo, Apr 2020
  • 2. © Hee-Meng Foo, 2020 What this talk is about ● Essentially the “Cliff Notes” of this paper (37 page long): “Machine Learning Testing: Survey, Landscapes and Horizons”, Jie M. Zhang, Mark Harman, Lei Ma, Yan Liu, arXiv:1906.10742v2 [cs.LG], 21 Dec 2019 ● Please don’t shoot the messenger ● There are ~ 300 citations, I have not read most of them ● “[XX]” (eg. [23]) in slides refer to a citation from the paper ● Slides will reference sections of paper eg. “3. Machine Learning Testing”
  • 3. © Hee-Meng Foo, 2020 Sections Skipped ● 2. Preliminaries of Machine Learning ● 4. Paper Collection and Review Schema ● 8. Application Scenarios
  • 4. © Hee-Meng Foo, 2020 Structure of Talk ● The Lay of the Land ● The “Heavy” Parts (or Deep Dive) ● ML Testing Challenges and Opportunities
  • 5. © Hee-Meng Foo, 2020 Machine Learning Testing Is New But Growing Source: the survey paper. Section: 1. Introduction
  • 6. © Hee-Meng Foo, 2020 Timeline of Research Source: the survey paper, 9. Analysis of Literature Review
  • 7. © Hee-Meng Foo, 2020 Distribution of ML vs DNN Testing Research Source: the survey paper, 9. Analysis of Literature Review
  • 8. © Hee-Meng Foo, 2020 Distribution of Supervised, Unsupervised & Reinforcement Learning Testing Research Source: the survey paper, 9. Analysis of Literature Review
  • 9. © Hee-Meng Foo, 2020 Research Distribution Among Testing Properties Source: the survey paper, 9. Analysis of Literature Review
  • 10. © Hee-Meng Foo, 2020 Number of Datasets vs Number of Papers Source: the survey paper, 9. Analysis of Literature Review
  • 11. © Hee-Meng Foo, 2020 Dataset Categorization Source: the survey paper, 9. Analysis of Literature Review Table 6. NLP Table 7. Decision Making Table 8. Others
  • 12. © Hee-Meng Foo, 2020 Preliminaries ● The Oracle Problem ○ Testing involves examining the behavior of a system in order to discover potential faults. Given an input for a system, the challenge of distinguishing the corresponding desired, correct behavior from potentially incorrect behavior is called the “test oracle problem”. ○ See https://www.youtube.com/watch?v=cquyBmIh0e4 ● Metamorphic Testing ○ TSP example ● Differential Testing ○ Compare the result from > 2 other systems eg. compilers. DeepXplore [1] ● Adversarial Testing ○ Use perturbed data to test robustness ● MC/DC coverage ○ What is minimum combinations of a set of predicates to cover all possible meaningful combinations ○ Eg. A ^ B ^ C v D (truth table is 2^4) ○ See https://www.youtube.com/watch?v=HzmnCVaICQ4 Source: http://algorist.com/problems/Traveling_Salesman_Problem.html A X
  • 13. © Hee-Meng Foo, 2020 What Makes ML Testing Hard ● Data driven ○ Decision logic arrived via training procedure ○ Model behavior changes over time with new data ○ Need also to factor in data in testing ● ML Oracle Problem ○ Eg. K-means clustering - how do you know this is best result? ○ Involves designing metamorphic relations to address ● ML is less obvious to modularize and hence test in isolation ○ “Low accuracy/precision of a ML model … arising from a combination of behaviors of different components such as training data, the learning program, and even the learning framework/library” ● Errors may propagate and become amplified or suppressed Source: the survey paper, 1. Introduction
  • 14. © Hee-Meng Foo, 2020 Online vs Offline Testing Source: the survey paper, 3.2 ML Testing Workflow Online testing - to address the gap that offline testing relies on test data but does not represent future data Approaches to online testing: ● A/B testing ● MAB (Multi-Armed Bandit) approach
  • 15. © Hee-Meng Foo, 2020 Idealized Workflow of ML Testing Source: the survey paper, 3.2 ML Testing Workflow
  • 16. © Hee-Meng Foo, 2020 ML Components (and where bugs reside) Source: the survey paper, 3.3 ML Testing Components
  • 17. © Hee-Meng Foo, 2020 ML Components (and where bugs reside) Bugs in Data Bugs in Learning Program Bugs in ML Framework Source: the survey paper, 3.3 ML Testing Components Completeness Representative of future data Noise Bias Data poisoning Algorithm designed, chosen, configured improperly Typos Bugs in ML Framework eg. Keras, Sci-kit learn etc
  • 18. © Hee-Meng Foo, 2020 ML Testing Properties (Functional) ● Correctness ○ True Correctness vs Empirical Correctness ● Model Relevance ○ VC-dimension, Rademacher Complexity for Classification ○ See https://www.youtube.com/watch?v=gR9Q8pS03ZE See survey paper, 3.4 ML Testing Properties
  • 19. © Hee-Meng Foo, 2020 ML Testing Properties (Non-Functional) ● Robustness ○ Ie. how resilient is ML system’s correctness in presence of perturbations ○ Local vs Global robustness ie. robustness wrt 1 test input vs all test inputs ● Security ○ ML system’s resilience against potential harm, danger or loss via manipulating or illegally accessing ML components eg. adversarial attacks, data poisoning ● Data Privacy ○ Current research focuses on how to present privacy-preserving ML, not detecting privacy violations ● Efficiency ○ ie. training or prediction speed. Also small footprint for mobile ● Fairness ○ Various sources of bias. Protected characteristics vs protected attributes vs sensitive attributes ● Interpretability ○ 2 aspects: transparency (how model works), post hoc explanations (other info derived fr model) ○ Good e-book on ML Interpretability [70] See survey paper, 3.4 ML Testing Properties
  • 20. © Hee-Meng Foo, 2020 Software Testing vs ML Testing Source: the survey paper, 3.5 Software Testing vs ML Testing
  • 21. © Hee-Meng Foo, 2020 Studies of ML Bugs ● Thung et al [159] ○ Looked at bug reports from Apache Mahout, Lucene, OpenNLP ○ 22.6% bugs due to incorrect implementation ○ 15.6% bugs were non-functional ○ 5.6% data bugs ● Zhang et al [160] ○ Looked at 175 Tensorflow bugs ○ 18.9% Tensorflow API misuse ○ 13.7% unaligned tensor ○ 21.7% incorrect model parameter or structure ● Banerjee et al [161] ○ Looked at bugs in AV systems ○ ML and decision control accounted for 64% of disengagements See survey paper, 5.5 Bug Report Analysis
  • 22. © Hee-Meng Foo, 2020 The “Heavy” Parts ● Test Input Generation ● Test Oracles ● Test Adequacy ● Test Prioritization and Reduction ● Debug & Repair ● General Testing Framework & Tools ● ML Properties to be tested ● ML Testing Components
  • 23. © Hee-Meng Foo, 2020 Test Input Generation ● 5.1.1 Domain Specific Test Input Synthesis ○ 2 categories: adversarial (perturbed data, robustness test) & natural (test application scenario) ○ DeepXplore [1] generates “real world” test data for neuron coverage ○ DeepTest [76] - realistic image transforms for testing AV ■ Detected > 1000 erroneous behaviors in CNNs/RNNs [77] ○ GANs [79] - driving scene based test generation with different weather conditions ○ DeepBillboard [81] - generate real-world adversarial billboards for testing AV systems ○ Audio based DNN [82] - transformations tailored to audio inputs for testing ○ Image classification [83] - generates images and uses metamorphic relations for testing ○ Machine translation [86] - uses mutation of words to generate test inputs
  • 24. © Hee-Meng Foo, 2020 Test Input Generation ● 5.1.2 Fuzz and Search based Test Input Generation ○ What is Fuzzing [9] - see https://www.youtube.com/watch?v=pcEy-4eZF6g ○ Search based test generation - uses metaheuristic search to guide fuzz process [17][87][88] ○ TensorFuzz [89] - hill-climbing approach to explore valid input space for Tensorflow graphs ○ DLFuzz [90] - based on ideas from DeepXplore (neuron coverage) to gen. adversarial examples ○ DeepHunter [91] - metamorphic transformation based coverage guided fuzzing technique ○ Feature guided test generation [93] - Using Scale-Invariant Feature Transform (SIFT) to identify features that represent an image with a Gaussian mixture model, then transforms problem of finding adversarial examples into a 2-player stochastic game ○ Evaluation of reinforcement learning with adversarial example generation [94] ○ Fuzzing and metamorphic testing to test LiDAR obstacle detection [95] ○ [97] test input for text classification + fuzzing that considers grammar ○ [98][99] - mutated sentences in Natural Language Inference (NLI) to generate test inputs for robustness testing ○ [101] - test input generation to highlight discriminatory nature of of model ○ [102] - framework for testing AV systems that utilizing fuzz based test gen + search based test gen
  • 25. © Hee-Meng Foo, 2020 Test Input Generation ● 5.1.3 Symbolic Execution based Test Input Generation ○ What is symbolic execution [105] - see https://youtu.be/FlzroEd4pnw ○ Concolic testing (or DSE, also see video) - auto-gen test inputs to achieve high coverage ○ However for ML, we need to test combination of code + data ○ [7] - symbolic analysis and statistic approach to generate effective tests ○ [108] outlines 3 challenges applying symbolic execution to ML: ■ No explicit branching, highly non-linear, scalability issues due to complexity of ML models ○ DeepCheck [108] - create 1-pixel, 2-pixel attacks to fail image classification ○ DeepConcolic [111] - DSE for DNNs adopting MC/DC criteria for coverage ● 5.1.4 Synthetic Data to Test Learning Program ○ [112] - generated data with repeating/missing values or categorical data ○ [45] - synthetic training data that adhere to schema constraints to trigger hidden assumptions ○ [54] - synthetic data with known distributions to test overfitting ○ [113] - generating datasets with predictable characteristics to be used as pseudo oracles
  • 26. © Hee-Meng Foo, 2020 Test Oracles ● 5.2.1 Metamorphic Relations as Test Oracles ○ Widely studied. Many metamorphic relations based on transformations of data that are expected to yield unchanged or certain expected changes in predicted output ○ Coarse Grained Data Transformations ■ [115] - describes 6 transformations: additive, multiplicative, permutative, invertive, inclusive, exclusive ■ [116] - proposes 11 metamorphic relations for image classification ■ [117] - function level metamorphic relations (evaluation of 9 ML applications) ○ Fine Grained Data Transformations ■ [118] - proposed 5 types of metamorphic relations specific to certain models for supervised classifiers ■ [120] - discusses the differences in the metamorphic relations between SVM and DNNs ■ [54] - proposed Perturbed Model Validation (PMV) that combines metamorphic relations and data mutation to detect overfitting ■ [123] - studied metamorphic relations of Naive Bayes, KNN ■ METTLE [125] - 6 types of metamorphic relations for unsupervised learners ■ [113][126] - discussed possibility of using different grained metamorphic relations to find problems in SVM and DNNs
  • 27. © Hee-Meng Foo, 2020 Test Oracles ● 5.2.1 Metamorphic Relations as Test Oracles ○ Metamorphic Relations Between Datasets ■ [127] [45] - studied metamorphic relations between training and new data ■ [45] - studied metamorphic relations among different datasets close in time ○ Frameworks to Apply Metamorphic Relations ■ Amsterdam [128] - framework to automate process of using metamorphic relations to detect ML bugs ■ Corduroy [117] - extends Java Modelling Language to let developers specify metamorphic properties and generate test cases for ML testing
  • 28. © Hee-Meng Foo, 2020 Test Oracles ● 5.2.2 Cross-Referencing as Test Oracles ○ Differential testing and N-version Programming ■ [133] 5 - 27% of test oracles for DNN libraries use differential testing ■ N-version Programming ie. generate multiple functionally equivalent programs to compare against ○ [135] used differential testing to discover 16 faults from 7 Naive Bayes implementations & 13 faults from 19 KNN implementations ○ CRADLE [48] - an approach to finding and localizing bugs in DNN frameworks (CNTK, Tensorflow, Theano), 11 datasets (eg. ImageNet, MNIST), 30 pre-trained models ○ DeepXplore [1], DLFuzz [90] - used differential testing to find effective test inputs ○ [136] uses “mirror” programs instead of different implementations for diff testing ● 5.2.3 Measurement Metrics for Designing Test Oracles ○ [137] - robustness metric ○ [138] [139] [140] - fairness metric ○ [65] [141] - interpretability metric
  • 29. © Hee-Meng Foo, 2020 Test Adequacy ● 5.3.1 Test Coverage ○ Unlike traditional software where decision logic is in the code, code coverage is not as demanding a criteria for ML testing since decision logic is derived from training ○ Other proposed coverage techniques: ■ Neuron coverage ● DeepXplore [1] and [92] ■ MC/DC coverage variants ● [145] MC/DC inspired DNN test coverage ■ Layer level coverage ● [92] [148] [149] [147] ■ State level coverage ● For RNNs [82] ○ Limitations of coverage criteria research ■ Most focus on DNNs ■ [147] [150] talks about limitations of these approaches
  • 30. © Hee-Meng Foo, 2020 Test Adequacy ● 5.3.2 Mutation Testing (or rather Mutation Score) ○ Mutation testing ie. fault injection ○ Mutation score = ratio of detected faults against all injected faults ○ DeepMutation [152] - mutate DNNs at source or model level ○ [153] propose 5 mutation operators for DNNs ● 5.3.3 Surprise Adequacy ○ See [127] for thorough explanation ○ ie. test data should be “sufficiently but not overly surprising” compared with training data ○ “Surprise” - measured using (a) KDE or (b) distance btw neuron activation vectors ● 5.3.4 Rule Based Checking of Test Adequacy ○ [154] 28 test aspects to consider + scoring system used by Google. 4 types: (a) tests for Model, (b) tests for ML infrastructure, (c) tests for ML data and (d) tests if ML system works well over time
  • 31. © Hee-Meng Foo, 2020 Test Prioritization and Reduction ● 5.4 Test Prioritization and Reduction ○ [155] used DNN metrics eg. cross-entropy, surprisal and Baysian uncertainty to prioritize test inputs ○ [156] adversarial test input prioritization ○ [157] used sampling technique guided by neurons of last hidden layer of DNN ○ [158] test selection metrics based on “model confidence”
  • 32. © Hee-Meng Foo, 2020 Debug & Repair ● Debugging Approaches (From 5.6 Debug and Repair) ○ Data Resampling ■ Generated test data as training data for re-training - DeepXplore [1] 3% imp, DeepTest [76] 46% imp ■ “Faulty neurons” [162] - identified neurons responsible for misclassification ○ Debugging Frameworks ■ Storm [163] - program transformation framework to generate smaller programs that can support debugging ■ tfdbg [164] - a debugger for ML models built on Tensorflow with 3 components: ● Analyzer, NodeStepper, RunStepper ■ MISTIQUE [165] - system to capture, store and query model intermediaries for debugging ■ PALM [166] - tool that explains complex model as 2-part surrogate model ● Repair Approaches ○ Fix Understanding ■ [167] - human-in-the-loop approach to simulate potential fixes ○ Program Repair ■ [168] - distribution guided inductive synthesis approach to repair decision making programs
  • 33. © Hee-Meng Foo, 2020 General Testing Framework & Tools ● 5.7 General Testing Framework and Tools ○ [169] - framework to generate and validate test inputs for security testing ○ [170] - CNN testing framework ○ [171] - tool to help developers test and debug fairness bugs ○ [172] - testing framework including different evaluation aspects eg. availability, achieveability, robustness, avoidability, improvability ○ [173] - framework for designing ML algorithms that simplifies the regulation of undesired behaviors
  • 34. © Hee-Meng Foo, 2020 ML Properties To Be Tested ● 6.1 Correctness ○ Classical approaches: cross-validation & bootstrapping ○ Classical measures: accuracy, precision, recall, AUC etc ○ [177] - studied variability of training/test data when assessing correctness of ML classifier ○ [178] - studied diff statistical methods when comparing AUC ○ [136] - “mirror” program as correctness oracle ○ [160] - survey of 175 Tensorflow bugs, 40 concern correctness ○ [179][180][181] - detecting data bugs ○ [116][120][121][123][130][154][162] - test input/test oracle design ○ [165][167][182] - test tool design
  • 35. © Hee-Meng Foo, 2020 ML Properties To Be Tested ● 6.2 Model Relevance ○ Model relevance evaluation detects mismatches between model and data ○ Poor model relevance is usually associated with over or under-fitting ○ [183] When a model is too complex for the data, even noise in training data is fitted by model ○ [184][185][186] Overfitting can easily happen esp when data is insufficient ○ [54] PMV injects noise into training data, re-trains model, and uses training accuracy decrease to detect over/under-fitting. PMV has better performance than 10 fold cross-validation ○ [42] overfitting detection by generating adversarial examples from test data ○ [187] repeated re-use of same test data will result in overfitting ○ [51] talks about training efficiency ○ [162] tries to address overfitting by re-sampling the training data
  • 36. © Hee-Meng Foo, 2020 ML Properties To Be Tested ● 6.3 Robustness & Security ○ Robustness Measurement ■ [137] correctness of system with the existence of noise ■ DeepFool [188] computes perturbations to “fool” DNNs and quantify robustness ■ [189] 3 metrics: (a) pointwise (b) adversarial frequency (c) adversarial severity ■ [190] set of attacks to set upper bound for robustness ■ [191] upper and lower bounds based on test data ■ DeepSafe [192] data-driven approach to assessing DNN robustness ■ [193] - probabilistic robustness ■ [194] Bayesian Deep Learning to model propagation of errors ○ Perturbation Targeting Test Data ■ [190] adversarial example generation using distance metrics to measure similarity ■ [196][197] library to standardize adversarial example construction ○ Perturbation Targeting Whole System ■ AVFI [198] software fault injection to approx h/w errors for AV systems ■ Kayotee [199] systematically injects faults into s/w & h/w systems for AV systems ■ DriveFI [96] - mines situations and faults that maximally impact AV safety ■ [102] closed loop behavior of whole system to support adversarial example generation (also AV)
  • 37. © Hee-Meng Foo, 2020 ML Properties To Be Tested ● 6.4 Efficiency ○ Already covered elsewhere ● 6.6 Interpretability ○ Manual Assessment of Interpretability ■ [65] taxonomy of evaluation approaches for interpretability ■ [215] local and global interpretability ■ [215] Decision Trees and Logistic regression more locally interpretable than DNNs ○ Automatic Assessment of Interpretability ■ [46] metric to understand the behaviors of an ML model ■ [70] measure interpretability based on category of ML model ■ [216] Metamorphic Relation Patterns (MRP) to help users understand how ML system works ○ Evaluation of Interpretability Improvement Methods ■ [217] investigated several interpretability improving methods ● 6.7 Privacy ○ [218] treat programs as grey boxes and detect differential privacy violations via statistical tests ○ [219] proposed to estimate the ∊ parameter in differential privacy
  • 38. © Hee-Meng Foo, 2020 ML Properties To Be Tested ● 6.5 Fairness ○ Causes of Unfairness ■ [201] 5 causes: skewed sample, tainted examples, limited features, sample size disparity, proxies ■ [171] fairness bugs can offend and harm users, cause embarrassment, mistrust, rev loss, legal action ○ Fairness Definitions and Measurement Metrics ■ [202][203][204][205] several definitions of fairness but no firm consensus ■ Fairness Through Unawareness (FTU) ● Algorithm is fair so long as protected attributes not explicitly used in decision making ● [206][202][207] ■ Group Fairness ● Demographic parity [208], Equalized odds, Equal opportunity [139] ■ Counter-factual Fairness ● [206] a model satisfies this if the output is same when protected attribute is flipped ■ Individual Fairness ● [138] task specific similarity metric to desc pairs of individuals that shd be regarded as similar ■ Analysis and Comparison of Fairness Metrics [202][203][62][204] ■ Support for Fairness Improvement ● RobinHood [209][211] - multiple user defined fairness definitions ● [75] fairness aware programming, [212] fairness classification, [168] distributed guided inductive synthesis
  • 39. © Hee-Meng Foo, 2020 ML Properties To Be Tested ● 6.5 Fairness ○ Test Generation Techniques for Fairness Testing ■ Themis [5][213] generates tests randomly for group fairness ■ Aequitas [101] test generation to uncover discriminatory inputs ■ [109] symbolic execution together with local explainability to generate test inputs ■ [171] easily interpretable bug reports ■ [122] checking whether algorithm under test is sensitive to training data changes
  • 40. © Hee-Meng Foo, 2020 ML Testing Components ● 7.1 Bug Detection in Data ○ Importance of Bug Detection in Data ■ [45] important to detect data bugs early as output of ML systems often used as input data downstream ■ [220] [8] data testing is challenging ○ Bug Detection in Training Data ■ Rule-Based Approach ● [179] data linter - (a) miscoded data, (b) outliers and scaling (c) packaging errors eg. dupes, empty examples ● [46] metrics to evaluate whether training data covered all scenarios ■ Performance Based Data Bug Detection ● MODE [162] identifies “faulty” neurons ○ Bug Detection in Test Data ■ [221] augment DNNs with small sub-network to detect adversarial perturbations ■ [222] DNN model mutation to expose adversarial examples ■ [224] compared 10 approaches to detect adversarial examples ○ Skew Detection in Training and Test Data ■ [127] using KDE and distance metric to detect skew ■ [45] studied skew in training and test data
  • 41. © Hee-Meng Foo, 2020 ML Testing Components ● 7.1 Bug Detection in Data ○ Frameworks in Detecting Data Bugs ■ [45] data validation system using constraints to find bugs in data (used as part of TFX) ■ ActiveClean [225][226] - iterative data cleaning ■ BoostClean [180] - to detect domain value violations ■ [181] automatic “unit testing” of large scale data sets ■ AlphaClean [228] - greedy tree search to auto tune parameters for data cleaning ■ [182] ML platform that includes data cleansing ■ [229] adopting traditional database/data warehousing data cleaning approaches
  • 42. © Hee-Meng Foo, 2020 ML Testing Components ● 7.2 Bug Detection in Learning Programs ○ Unit Tests for ML Learning Program ■ [230] unit testing for Tensorflow ■ [231] collection of unit tests for stochastic optimization ○ Algorithm Configuration Examination ■ [232][233] identified OS, language and h/w compatibility issues ■ [232] focused on sklearn, Paddle, Caffe ■ [233] focused on Tensorflow, Theano, Torch ■ [160] - most common learning program bug due to change of Tensorflow API ■ [234] - testing algorithm parameters in DNN testing problems ○ Algorithm Selection Examination ■ [235] compared DNN with classical ML ○ Mutant Simulations of Learning Program Faults ■ [117][128] used mutations to simulate program-code errors ■ [237] static analysis of tensors in Tensorflow
  • 43. © Hee-Meng Foo, 2020 ML Testing Components ● 7.3 Bug Detection in Framework ○ Study of Framework Bugs ■ [238] studied bugs from Tensorflow, Caffe, Torch. Common issues: crash, non-term, out of mem ■ [233] studied Tensorflow, Theano, Torch - cmp runtime efficiency, training accuracy, robustness etc ■ [232] 10% of framework bugs are efficiency bugs ○ Implementation Testing of Frameworks ■ Challenges in Implementation Bug Detection ● [159] 22% of 500 bugs studies due to incorrect algorithm implementation ● [240] injected implementation bugs into classic ML code in Weka ■ Solutions for Implementation Bug Detection ● [135][48] used differential testing to identify faults (10 in Naive Bayes impl and 20 in KNN impl) ● [10][115] first to discuss possibility of using metamorphic relations in testing ML ● [118] - metamorphic relations to test supervised learning classifiers ● [121] - applied metamorphic relations to finding bugs in image classification ○ Study of Framework Test Oracles ■ [133] studied approximated oracles for Tensorflow, Theano, Pytorch, Keras
  • 44. © Hee-Meng Foo, 2020 Structure of Talk ● The Lay of the Land ● The “Heavy” Parts (or Deep Dive) ● ML Testing Challenges and Opportunities ○ See 10 Challenges and Opportunities in survey paper
  • 45. © Hee-Meng Foo, 2020 Challenges in ML Testing ● Challenges in Test Input Generation ○ (SBST) Search based Software Test generation [87] ■ Good fit for ML as it searches large input spaces ○ Adversarial Inputs for robustness ■ Criticisms that test data not natural ○ Test inputs as natural as possible ■ DeepTest [76], DeepHunter [92], DeepRoad [79] (AV) - however still amount of “unnaturalness” ● Challenges in Test Assessment Criteria ○ [290] lack of systematic evaluation of how diff assessment metrics correlate with bug revealing success ● Challenges relating to Oracle Problem ○ Metamorphic relations require human ingenuity to discover/design ○ [128] flaky tests arise in metamorphic testing ○ Pseudo oracles may also lead to many false positives ● Challenges in Testing Cost Reduction ○ ML component testing requires retraining ○ Low footprint devices eg. mobile devices, IoT
  • 46. © Hee-Meng Foo, 2020 Research Opportunities in ML Testing ● Testing More Application Scenarios ○ Most research are on supervised image classification, few on unsupervised, NLP, reinforcement learning ● Testing More ML Categories and Tasks ○ Transfer learning testing is still underdeveloped ● Testing Other Properties ○ Most work focus on correctness, robustness. Few on efficiency, model relevance, interpretability ● Presenting More Testing Benchmarks ○ See Table 5-8 ● Covering More Testing Activities ○ No coverage of requirement analysis, few on online testing, data testing is also lacking ○ Regression testing, bug report analysis, bug triage for ML systems ● Mutating Investigation in Machine Learning System ○ How to better design mutation operators for ML code
  • 47. © Hee-Meng Foo, 2020 Q&A