Presentation at Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021 organized by the scientific research group in Egypt with Collaboration with Faculty of Computers and AI, Cairo University and the Chinese University in Egypt
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Feature selection with imbalanced data in agriculture
1. Feature Selection with
Imbalanced Data in
Agriculture
Mohamed Adel Omar, Ph.D student
Agriculture research center email
Student Member
of Scientific
Research Group
in Egypt (SRGE).
Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021
2. 1 • Problem Definition
2 • Measuring
3 • Approaches
4 • Proposed Solution
5 • Road Map
Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021
Agenda
3. Problem definition
• What is class imbalanced problem ?
• It is the problem when the number of
examples belonged to a class is significantly
greater than those of the others.
• For example:
– In financial fraud data set, majority of transactions
belong to non-fraud class and vice versa.
– In cancer data, the number of patients who have
cancer is much smaller than that who don’t.
• The ratio of minority to majority classes
can be 1:100, 1:1000, or 1:10000 or
even more
• There are many other domains that
have imbalanced data sets:
– Customer churn
– Credit approval
– Network intrusion detection
– Protein detection
– Oil spill detection etc.
4. • Standard algorithms have poor
performance on imbalanced data.
• Minimize global error rate without taking
data distribution into consideration.
• Cause performance bias.
• Poor accuracy on minority class and high
accuracy on majority class.
• Correctly classifying minority class examples
are more important than those of majority
class.
• Cost of misclassifications are different.
• E.g. Misclassifying fraud cost >
misclassifying non-fraud
• Misclassifying buyer cost > misclassifying
non-buyer
Therefore, rather than general algorithms, we need a more sophisticated approaches to handle
class imbalanced problem.
Problem definition
6. Metrics that can
provide better insight
• Confusion Matrix: a table showing correct predictions and types of incorrect
predictions.
• Precision: the number of true positives divided by all positive predictions.
Precision is also called Positive Predictive Value. It is a measure of a classifier’s
exactness. Low precision indicates a high number of false positives.
• Recall: the number of true positives divided by the number of positive values
in the test data. The recall is also called Sensitivity or the True Positive Rate. It
is a measure of a classifier’s completeness. Low recall indicates a high number
of false negatives.
• F1: Score: the weighted average of precision and recall.
• Area Under ROC Curve (AUROC): AUROC represents the likelihood of your
model distinguishing observations from two classes.
8. Approaches
The solution addressing data imbalanced problem could be classified into
three groups:
1. Data level
2. Algorithmic level
3. Ensemble level
4. Hybrid Level
5. Feature Selection Level
9. Data level
• Data level: modify class distribution in data.
• The approaches include under-sampling and over-sampling.
• Synthetic Minority Over-sampling Technique is the state-of-the-art method. SMOTE
generates synthetic examples based on feature spaces. It generates K-nearest
neighbors and randomly choose one of the neighbors to create new synthetic
examples.
Original data
After under-sampling
After over-sampling
Pros: can be applied to any learning algorithm without
modification on algorithm.
Cons:
Over-sampling could cause the model to be over-fitting.
Over-sampling increases computational cost.
Under-sampling could result in losing important information.
10. 1.Data Level Approaches
• Random Oversampling
• SMOTE
• Borderline SMOTE
• SVM SMOTE
• k-Means SMOTE
• ADASYN
• Random Undersampling
• Condensed Nearest Neighbor
• Tomek Links
• Edited Nearest Neighbors
• Neighborhood Cleaning Rule
• One Sided Selection
• SMOTE and Random
Under-sampling
• SMOTE and Tomek Links
• SMOTE and Edited Nearest
Neighbors
Oversampling Undersampling Hybird
11. 2. Algorithmic level Approaches
• Logistic Regression
• Decision Trees
• Support Vector Machines
• Artificial Neural Networks
• Bagged Decision Trees
• Random Forest
• Stochastic Gradient Boosting
Cost-Sensitive One-Class Probability Tuning
• One-Class Support Vector Machines
• Isolation Forests
• Minimum Covariance Determinant
• Logistic Regression
• Linear Discriminant Analysis
• Naive Bayes
• Artificial Neural Networks
12. 3.Ensemble Approach
Bagging Boosting Staking
• AdaBoost (canonical
boosting)
• Gradient Boosting Machines
• Stochastic Gradient Boosting
(XGBoost and similar)
• Stacked Models (canonical
stacking)
• Blending
• Super Ensemble
• Bagged Decision Trees
(canonical bagging)
• Random Forest
• Extra Trees
13. 4.Hybird Approach
• Cost-sensitive learning and sampling using SMOTE algorithm
• PSO-based cost sensitive neural network
• SVM with Asymmetrical Misclassifications Cost
20. 4. Hyper-parameter Tuning
There are three popular hyper-parameter tuning algorithms that you may choose from:
1. Random Search
2. Grid Search
3. Bayesian Optimization