July 11, 2007
Upcoming SlideShare
Loading in...5

July 11, 2007






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    July 11, 2007 July 11, 2007 Presentation Transcript

    • An Approach to Software Testing of Machine Learning Applications Chris Murphy, Gail Kaiser, Marta Arias Columbia University
    • Introduction
      • We are investigating the quality assurance of Machine Learning (ML) applications
      • Currently we are concerned with a real-world application for potential future use in predicting electrical device failures
      • Machine Learning applications fall into a class for which it can be said that there is “no reliable oracle”
        • These are also known as “non-testable programs” and could fall into Davis and Weyuker’s class of “programs which were written in order to determine the answer in the first place. There would be no need to write such programs, if the correct answer were known.”
    • Introduction
      • We have developed an approach to creating test cases for Machine Learning applications:
          • Analyze the problem domain and real-world data sets
          • Analyze the algorithm as it is defined
          • Analyze an implementation’s runtime options
      • Our approach was designed for MartiRank and then generalized to other ranking algorithms such as Support Vector Machines (SVM)
    • Overview
      • Machine Learning Background
      • Testing Approach and Framework
      • Findings and Results
      • Evaluation and Observations
      • Future Work
    • Machine Learning Fundamentals
      • Data sets consist of a number of examples , each of which has attributes and a label
      • In the first phase (“ training ”), a model is generated that attempts to generalize how attributes relate to the label
      • In the second phase, the model is applied to a previously-unseen data set (“ testing ” data) with unknown labels to produce a classification (or, in our case, a ranking)
        • This can be used for validation or for prediction
    • MartiRank and SVM
      • MartiRank was specifically designed for the device failure application
        • Seeks to find the combination of segmenting and sorting the data that produces the best result
      • SVM is typically a classification algorithm
        • Seeks to find a hyperplane that separates examples from different classes
        • Different “kernels” use different approaches
        • SVM-Light has a ranking mode based on the distance from the hyperplane
    • Related Work
      • There has been much research into applying Machine Learning techniques to software testing, but not the other way around
      • Reusable real-world data sets and Machine Learning frameworks are available for checking how well a Machine Learning algorithm predicts, but not for testing its correctness
    • Analyzing the Problem Domain
      • Consider properties of the real-world data sets
        • Data set size: Number of attributes and examples
        • Range of values: attributes and labels
        • Precision of floating-point numbers
        • Categorical data: how alphanumeric attrs are addressed
      • Also, repeating or missing data values
    • Analyzing the Algorithm
      • Look for imprecisions in the specification, not necessarily bugs in the implementation
        • How to handle missing attribute values
        • How to handle negative labels
      • Consider how to construct a data set that could cause a “predictable” ranking
    • Analyzing the Runtime Options
      • Determine how the implementation may manipulate the input data
        • Permuting the input order
        • Reading the input in “chunks”
      • Consider configuration parameters
        • For example, disabled anything probabilistic
      • Need to ensure that results are deterministic and repeatable
    • The Testing Framework
      • Data set generator: # of examples, # of attributes, % failures, % missing, any categorical data, repeat/no-repeat modes
      • Model comparison: specific to MartiRank
      • Ranking comparison: includes metrics like normalized equivalence and AUCs
      • Tracing options: for generating and comparing outputs of debugging statements
    • Equivalence Classes
      • Data sizes of different orders of magnitude
      • Repeating vs. non-repeating attribute values
      • Missing vs. no-missing attribute values
      • Categorical vs. non-categorical data
      • 0/1 labels vs. non-negative integer labels
      • Predictable vs. non-predictable data sets
      • Used data set generator to parameterize test case selection criteria
    • Testing MartiRank
      • Produced a core dump on data sets with large number of attributes (over 200)
      • Implementation does not correctly handle negative labels
      • Does not use a “stable” sorting algorithm
    • Regression Testing of MartiRank
      • Creation of a suite of testing data allowed us to use it for regression testing
      • Discovered that refactoring had introduced a bug into an important calculation
    • Testing Multiple Implementations of MartiRank
      • We had three implementations developed by three different coders
      • Can be used as “pseudo-oracles” for each other
      • Used to discover a bug in the way one implementation was handling missing values
    • Applying Approach to SVM-Light
      • Permuting the input data led to different models
        • Caused by “chunking” data for use by an approximating variant of optimization algorithm
      • Introduction of noise in a data set in some cases caused it not to find a “predictable” ranking
      • Different kernels also caused different results with “predictable” rankings
    • Evaluation and Observations
      • Testing approach revealed bugs and imprecision in the implementations, as well as discrepancies from the stated algorithms
      • Inspection of the algorithms led to the creation of “predictable” data sets
      • What is “predictable” for one algorithm may not lead to a “predictable” ranking in another
      • Algorithm’s failure to address specific data set traits can lead to incorrect results (and/or inconsistent results across implementations)
      • The approach can be generalized to other Machine Learning ranking algorithms, as well as classification
    • Limitations and Future Work
      • Test suite adequacy for coverage not addressed
      • Can also include mutation testing for effectiveness of data sets
      • Should investigate creating large data sets that correlate to real-world data
      • Could also consider non-deterministic Machine Learning algorithms
    • Questions?