Feature Selection with
Imbalanced Data in
Agriculture
Mohamed Adel Omar, Ph.D student
Agriculture research center email
Student Member
of Scientific
Research Group
in Egypt (SRGE).
Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021
1 • Problem Definition
2 • Measuring
3 • Approaches
4 • Proposed Solution
5 • Road Map
Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021
Agenda
Problem definition
• What is class imbalanced problem ?
• It is the problem when the number of
examples belonged to a class is significantly
greater than those of the others.
• For example:
– In financial fraud data set, majority of transactions
belong to non-fraud class and vice versa.
– In cancer data, the number of patients who have
cancer is much smaller than that who don’t.
• The ratio of minority to majority classes
can be 1:100, 1:1000, or 1:10000 or
even more
• There are many other domains that
have imbalanced data sets:
– Customer churn
– Credit approval
– Network intrusion detection
– Protein detection
– Oil spill detection etc.
• Standard algorithms have poor
performance on imbalanced data.
• Minimize global error rate without taking
data distribution into consideration.
• Cause performance bias.
• Poor accuracy on minority class and high
accuracy on majority class.
• Correctly classifying minority class examples
are more important than those of majority
class.
• Cost of misclassifications are different.
• E.g. Misclassifying fraud cost >
misclassifying non-fraud
• Misclassifying buyer cost > misclassifying
non-buyer
Therefore, rather than general algorithms, we need a more sophisticated approaches to handle
class imbalanced problem.
Problem definition
Confusion Matrix
Metrics that can
provide better insight
• Confusion Matrix: a table showing correct predictions and types of incorrect
predictions.
• Precision: the number of true positives divided by all positive predictions.
Precision is also called Positive Predictive Value. It is a measure of a classifier’s
exactness. Low precision indicates a high number of false positives.
• Recall: the number of true positives divided by the number of positive values
in the test data. The recall is also called Sensitivity or the True Positive Rate. It
is a measure of a classifier’s completeness. Low recall indicates a high number
of false negatives.
• F1: Score: the weighted average of precision and recall.
• Area Under ROC Curve (AUROC): AUROC represents the likelihood of your
model distinguishing observations from two classes.
Confusion Matrix
PREDICTED CLASS
ACTUAL
CLASS
Yes No
Yes TP FN
No FP TN
Approaches
The solution addressing data imbalanced problem could be classified into
three groups:
1. Data level
2. Algorithmic level
3. Ensemble level
4. Hybrid Level
5. Feature Selection Level
Data level
• Data level: modify class distribution in data.
• The approaches include under-sampling and over-sampling.
• Synthetic Minority Over-sampling Technique is the state-of-the-art method. SMOTE
generates synthetic examples based on feature spaces. It generates K-nearest
neighbors and randomly choose one of the neighbors to create new synthetic
examples.
Original data
After under-sampling
After over-sampling
 Pros: can be applied to any learning algorithm without
modification on algorithm.
 Cons:
 Over-sampling could cause the model to be over-fitting.
 Over-sampling increases computational cost.
 Under-sampling could result in losing important information.
1.Data Level Approaches
• Random Oversampling
• SMOTE
• Borderline SMOTE
• SVM SMOTE
• k-Means SMOTE
• ADASYN
• Random Undersampling
• Condensed Nearest Neighbor
• Tomek Links
• Edited Nearest Neighbors
• Neighborhood Cleaning Rule
• One Sided Selection
• SMOTE and Random
Under-sampling
• SMOTE and Tomek Links
• SMOTE and Edited Nearest
Neighbors
Oversampling Undersampling Hybird
2. Algorithmic level Approaches
• Logistic Regression
• Decision Trees
• Support Vector Machines
• Artificial Neural Networks
• Bagged Decision Trees
• Random Forest
• Stochastic Gradient Boosting
Cost-Sensitive One-Class Probability Tuning
• One-Class Support Vector Machines
• Isolation Forests
• Minimum Covariance Determinant
• Logistic Regression
• Linear Discriminant Analysis
• Naive Bayes
• Artificial Neural Networks
3.Ensemble Approach
Bagging Boosting Staking
• AdaBoost (canonical
boosting)
• Gradient Boosting Machines
• Stochastic Gradient Boosting
(XGBoost and similar)
• Stacked Models (canonical
stacking)
• Blending
• Super Ensemble
• Bagged Decision Trees
(canonical bagging)
• Random Forest
• Extra Trees
4.Hybird Approach
• Cost-sensitive learning and sampling using SMOTE algorithm
• PSO-based cost sensitive neural network
• SVM with Asymmetrical Misclassifications Cost
5.Feature selection methods
Proposed Method
1. Input - reduct sets {R}.
2. Identify the Classifier.
3. Construct confusion matrix for each
reduct.
4. Estimate the accuracy obtained.
5. Terminate the process.
Road Map
1. Select a Metric
2. Spot Check Algorithms
3. Spot Check Imbalanced Algorithms
4. Hyper-parameter Tuning
1. Select a Metric
2. Spot Check Algorithms
3. Spot Check Imbalanced Algorithms
4. Hyper-parameter Tuning
There are three popular hyper-parameter tuning algorithms that you may choose from:
1. Random Search
2. Grid Search
3. Bayesian Optimization
Acknowledgment

Feature selection with imbalanced data in agriculture

  • 1.
    Feature Selection with ImbalancedData in Agriculture Mohamed Adel Omar, Ph.D student Agriculture research center email Student Member of Scientific Research Group in Egypt (SRGE). Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021
  • 2.
    1 • ProblemDefinition 2 • Measuring 3 • Approaches 4 • Proposed Solution 5 • Road Map Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021 Agenda
  • 3.
    Problem definition • Whatis class imbalanced problem ? • It is the problem when the number of examples belonged to a class is significantly greater than those of the others. • For example: – In financial fraud data set, majority of transactions belong to non-fraud class and vice versa. – In cancer data, the number of patients who have cancer is much smaller than that who don’t. • The ratio of minority to majority classes can be 1:100, 1:1000, or 1:10000 or even more • There are many other domains that have imbalanced data sets: – Customer churn – Credit approval – Network intrusion detection – Protein detection – Oil spill detection etc.
  • 4.
    • Standard algorithmshave poor performance on imbalanced data. • Minimize global error rate without taking data distribution into consideration. • Cause performance bias. • Poor accuracy on minority class and high accuracy on majority class. • Correctly classifying minority class examples are more important than those of majority class. • Cost of misclassifications are different. • E.g. Misclassifying fraud cost > misclassifying non-fraud • Misclassifying buyer cost > misclassifying non-buyer Therefore, rather than general algorithms, we need a more sophisticated approaches to handle class imbalanced problem. Problem definition
  • 5.
  • 6.
    Metrics that can providebetter insight • Confusion Matrix: a table showing correct predictions and types of incorrect predictions. • Precision: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high number of false positives. • Recall: the number of true positives divided by the number of positive values in the test data. The recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness. Low recall indicates a high number of false negatives. • F1: Score: the weighted average of precision and recall. • Area Under ROC Curve (AUROC): AUROC represents the likelihood of your model distinguishing observations from two classes.
  • 7.
  • 8.
    Approaches The solution addressingdata imbalanced problem could be classified into three groups: 1. Data level 2. Algorithmic level 3. Ensemble level 4. Hybrid Level 5. Feature Selection Level
  • 9.
    Data level • Datalevel: modify class distribution in data. • The approaches include under-sampling and over-sampling. • Synthetic Minority Over-sampling Technique is the state-of-the-art method. SMOTE generates synthetic examples based on feature spaces. It generates K-nearest neighbors and randomly choose one of the neighbors to create new synthetic examples. Original data After under-sampling After over-sampling  Pros: can be applied to any learning algorithm without modification on algorithm.  Cons:  Over-sampling could cause the model to be over-fitting.  Over-sampling increases computational cost.  Under-sampling could result in losing important information.
  • 10.
    1.Data Level Approaches •Random Oversampling • SMOTE • Borderline SMOTE • SVM SMOTE • k-Means SMOTE • ADASYN • Random Undersampling • Condensed Nearest Neighbor • Tomek Links • Edited Nearest Neighbors • Neighborhood Cleaning Rule • One Sided Selection • SMOTE and Random Under-sampling • SMOTE and Tomek Links • SMOTE and Edited Nearest Neighbors Oversampling Undersampling Hybird
  • 11.
    2. Algorithmic levelApproaches • Logistic Regression • Decision Trees • Support Vector Machines • Artificial Neural Networks • Bagged Decision Trees • Random Forest • Stochastic Gradient Boosting Cost-Sensitive One-Class Probability Tuning • One-Class Support Vector Machines • Isolation Forests • Minimum Covariance Determinant • Logistic Regression • Linear Discriminant Analysis • Naive Bayes • Artificial Neural Networks
  • 12.
    3.Ensemble Approach Bagging BoostingStaking • AdaBoost (canonical boosting) • Gradient Boosting Machines • Stochastic Gradient Boosting (XGBoost and similar) • Stacked Models (canonical stacking) • Blending • Super Ensemble • Bagged Decision Trees (canonical bagging) • Random Forest • Extra Trees
  • 13.
    4.Hybird Approach • Cost-sensitivelearning and sampling using SMOTE algorithm • PSO-based cost sensitive neural network • SVM with Asymmetrical Misclassifications Cost
  • 14.
  • 15.
    Proposed Method 1. Input- reduct sets {R}. 2. Identify the Classifier. 3. Construct confusion matrix for each reduct. 4. Estimate the accuracy obtained. 5. Terminate the process.
  • 16.
    Road Map 1. Selecta Metric 2. Spot Check Algorithms 3. Spot Check Imbalanced Algorithms 4. Hyper-parameter Tuning
  • 17.
  • 18.
    2. Spot CheckAlgorithms
  • 19.
    3. Spot CheckImbalanced Algorithms
  • 20.
    4. Hyper-parameter Tuning Thereare three popular hyper-parameter tuning algorithms that you may choose from: 1. Random Search 2. Grid Search 3. Bayesian Optimization
  • 21.