Feature selection with imbalanced data in agriculture

Feature Selection with
Imbalanced Data in
Agriculture
Mohamed Adel Omar, Ph.D student
Agriculture research center email
Student Member
of Scientific
Research Group
in Egypt (SRGE).
Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021

1 • Problem Definition
2 • Measuring
3 • Approaches
4 • Proposed Solution
5 • Road Map
Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021
Agenda

Problem definition
• What is class imbalanced problem ?
• It is the problem when the number of
examples belonged to a class is significantly
greater than those of the others.
• For example:
– In financial fraud data set, majority of transactions
belong to non-fraud class and vice versa.
– In cancer data, the number of patients who have
cancer is much smaller than that who don’t.
• The ratio of minority to majority classes
can be 1:100, 1:1000, or 1:10000 or
even more
• There are many other domains that
have imbalanced data sets:
– Customer churn
– Credit approval
– Network intrusion detection
– Protein detection
– Oil spill detection etc.

• Standard algorithms have poor
performance on imbalanced data.
• Minimize global error rate without taking
data distribution into consideration.
• Cause performance bias.
• Poor accuracy on minority class and high
accuracy on majority class.
• Correctly classifying minority class examples
are more important than those of majority
class.
• Cost of misclassifications are different.
• E.g. Misclassifying fraud cost >
misclassifying non-fraud
• Misclassifying buyer cost > misclassifying
non-buyer
Therefore, rather than general algorithms, we need a more sophisticated approaches to handle
class imbalanced problem.
Problem definition

Metrics that can
provide better insight
• Confusion Matrix: a table showing correct predictions and types of incorrect
predictions.
• Precision: the number of true positives divided by all positive predictions.
Precision is also called Positive Predictive Value. It is a measure of a classifier’s
exactness. Low precision indicates a high number of false positives.
• Recall: the number of true positives divided by the number of positive values
in the test data. The recall is also called Sensitivity or the True Positive Rate. It
is a measure of a classifier’s completeness. Low recall indicates a high number
of false negatives.
• F1: Score: the weighted average of precision and recall.
• Area Under ROC Curve (AUROC): AUROC represents the likelihood of your
model distinguishing observations from two classes.

Confusion Matrix
PREDICTED CLASS
ACTUAL
CLASS
Yes No
Yes TP FN
No FP TN

Approaches
The solution addressing data imbalanced problem could be classified into
three groups:
1. Data level
2. Algorithmic level
3. Ensemble level
4. Hybrid Level
5. Feature Selection Level

Data level
• Data level: modify class distribution in data.
• The approaches include under-sampling and over-sampling.
• Synthetic Minority Over-sampling Technique is the state-of-the-art method. SMOTE
generates synthetic examples based on feature spaces. It generates K-nearest
neighbors and randomly choose one of the neighbors to create new synthetic
examples.
Original data
After under-sampling
After over-sampling
 Pros: can be applied to any learning algorithm without
modification on algorithm.
 Cons:
 Over-sampling could cause the model to be over-fitting.
 Over-sampling increases computational cost.
 Under-sampling could result in losing important information.

1.Data Level Approaches
• Random Oversampling
• SMOTE
• Borderline SMOTE
• SVM SMOTE
• k-Means SMOTE
• ADASYN
• Random Undersampling
• Condensed Nearest Neighbor
• Tomek Links
• Edited Nearest Neighbors
• Neighborhood Cleaning Rule
• One Sided Selection
• SMOTE and Random
Under-sampling
• SMOTE and Tomek Links
• SMOTE and Edited Nearest
Neighbors
Oversampling Undersampling Hybird

2. Algorithmic level Approaches
• Logistic Regression
• Decision Trees
• Support Vector Machines
• Artificial Neural Networks
• Bagged Decision Trees
• Random Forest
• Stochastic Gradient Boosting
Cost-Sensitive One-Class Probability Tuning
• One-Class Support Vector Machines
• Isolation Forests
• Minimum Covariance Determinant
• Logistic Regression
• Linear Discriminant Analysis
• Naive Bayes
• Artificial Neural Networks

3.Ensemble Approach
Bagging Boosting Staking
• AdaBoost (canonical
boosting)
• Gradient Boosting Machines
• Stochastic Gradient Boosting
(XGBoost and similar)
• Stacked Models (canonical
stacking)
• Blending
• Super Ensemble
• Bagged Decision Trees
(canonical bagging)
• Random Forest
• Extra Trees

4.Hybird Approach
• Cost-sensitive learning and sampling using SMOTE algorithm
• PSO-based cost sensitive neural network
• SVM with Asymmetrical Misclassifications Cost

Proposed Method
1. Input - reduct sets {R}.
2. Identify the Classifier.
3. Construct confusion matrix for each
reduct.
4. Estimate the accuracy obtained.
5. Terminate the process.

Road Map
1. Select a Metric
2. Spot Check Algorithms
3. Spot Check Imbalanced Algorithms
4. Hyper-parameter Tuning

3. Spot Check Imbalanced Algorithms

4. Hyper-parameter Tuning
There are three popular hyper-parameter tuning algorithms that you may choose from:
1. Random Search
2. Grid Search
3. Bayesian Optimization

Feature selection with imbalanced data in agriculture

More Related Content

What's hot

Similar to Feature selection with imbalanced data in agriculture

More from Aboul Ella Hassanien

Recently uploaded

Feature selection with imbalanced data in agriculture