Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Imbalanced classification problem: A remote sensing example

7,453 views

Published on

The video presenting the content for these slides and all the related materials including source code and sample data can be downloaded from this link: http://amsantac.co/blog/en/2016/09/20/balanced-image-classification-r.html.

When conducting a supervised classification with machine learning algorithms such as RandomForests, one recommended practice is to work with a balanced classification dataset. However, this recommendation is sometimes overlooked due to unawareness of its relevance or lack of knowledge about how to deal with it.

In this presentation I initially examine some of the consequences of working with an imbalanced dataset, using an image classification problem. Later I test and suggest some techniques to solve this problem.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Imbalanced classification problem: A remote sensing example

  1. 1. Imbalancedclassificationproblem Aremotesensingexample Ali Santacruz R-Spatialist at amsantac.co
  2. 2. Keyideas Machine learning classifiers fail to cope with imbalanced training datasets Performance of ML classifiers may get biased towards majority class Sampling  methods  are  used  to  treat  imbalanced  datasets:  undersampling,  oversampling, synthetic data generation and cost sensitive learning Metrics such as precision, recall, or F­score are preferred over overall accuracy as performance measures when dealing with imbalanced datasets · · · · 2/12
  3. 3. Remotesensingexample First let's import image to be classified and shapefile with training data library(rgdal)  library(raster)  library(caret)    set.seed(123)    img <‐ brick(stack(as.list(list.files("data/", "sr_band", full.names = TRUE))))  names(img) <‐ c(paste0("B", 1:5, coll = ""), "B7")     trainData <‐ shapefile("data/training_15.shp")  responseCol <‐ "class"  3/12
  4. 4. Extractdatafromimagebands dfAll = data.frame(matrix(vector(), nrow = 0, ncol = length(names(img)) + 1))     for (i in 1:length(unique(trainData[[responseCol]]))){                              category <‐ unique(trainData[[responseCol]])[i]    categorymap <‐ trainData[trainData[[responseCol]] == category,]    dataSet <‐ extract(img, categorymap)    dataSet <‐ sapply(dataSet, function(x){cbind(x, class = rep(category, nrow(x)))})    df <‐ do.call("rbind", dataSet)    dfAll <‐ rbind(dfAll, df)    }  dim(dfAll)  [1] 80943     7  4/12
  5. 5. Createpartitionfortrainingandtestsets inBuild <‐ createDataPartition(y = dfAll$class, p = 0.7, list = FALSE)  training <‐ dfAll[inBuild,]  testing <‐ dfAll[‐inBuild,]  dim(training)  [1] 56662     7  dim(testing)  [1] 24281     7  table(training$class)        1     2     3     5     6     7    4753 21626 14866  8093  3535  3789   5/12
  6. 6. Modelusingimbalanceddataset training_imb <‐ training[sample(1:nrow(training), 2400), ]  table(training_imb$class)      1   2   3   5   6   7   212 900 613 353 149 173   mod1_imb <‐ train(as.factor(class) ~ B3 + B4 + B5, method = "rf", data = training_imb)  note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .  mod1_imb$results[, 1:2]    mtry Accuracy  1    2 0.979454  2    3 0.977318  6/12
  7. 7. Balancingadatasetbyundersampling undersample_ds <‐ function(x, classCol, nsamples_class){    for (i in 1:length(unique(x[, classCol]))){      class.i <‐ unique(x[, classCol])[i]      if((sum(x[, classCol] == class.i) ‐ nsamples_class) != 0){        x <‐ x[‐sample(which(x[, classCol] == class.i),                        sum(x[, classCol] == class.i) ‐ nsamples_class), ]        }    }    return(x)  }  7/12
  8. 8. Balancetrainingdataset (nsamples_class <‐ 400)   [1] 400  training_bc <‐ undersample_ds(training, "class", nsamples_class)  table(training_bc$class)      1   2   3   5   6   7   400 400 400 400 400 400   8/12
  9. 9. Modelusingbalanceddataset mod1_bc <‐ train(as.factor(class) ~ B3 + B4 + B5, method = "rf", data = training_bc)  note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .  mod1_bc$results[, 1:2]    mtry  Accuracy  1    2 0.9797371  2    3 0.9766507  9/12
  10. 10. Evaluateaccuracyofthetwomodelsusingthe testingset # Imbalanced data  pred1_imb <‐ predict(mod1_imb, testing)  confusionMatrix(pred1_imb, testing$class)$overall[1]   Accuracy   0.9829496   # Balanced data  pred1_bc <‐ predict(mod1_bc, testing)  confusionMatrix(pred1_bc, testing$class)$overall[1]   Accuracy   0.9788312   10/12
  11. 11. Evaluatesensitivityinthetwomodelsusingthe testingset # Imbalanced data  confusionMatrix(pred1_ub, testing$class)$byClass[, 1]   Class: 1  Class: 2  Class: 3  Class: 5  Class: 6  Class: 7   0.9951644 0.9794283 0.9806938 0.9945213 1.0000000 0.9809816   # Balanced data  confusionMatrix(pred1_bc, testing$class)$byClass[, 1]   Class: 1  Class: 2  Class: 3  Class: 5  Class: 6  Class: 7   0.9941973 0.9662191 0.9759849 0.9904844 1.0000000 0.9975460   11/12
  12. 12. Furtherresources For a detailed explanation please see: Also check out these useful resources: · This post in my blog (includes link for downloading sample data and source code) And this video on my YouTube channel ­ ­ · Practical guide to deal with imbalanced classification problems in R 8 tactics to combat imbalanced classes in your machine learning dataset ­ ­ 12/12

×