Hybrid SVM_LR classifier for plant disease prediction
1. HYBRID SVM-LR CLASSIFIER
FOR POWDERY MILDEW
DISEASE PREDICTION IN
TOMATO PLANT
PAPER ID-129
ANSHUL BHATIA
RESEARCH SCHOLAR, USIC&T
GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY
DWARKA SECTOR-16C
NEW DELHI (110078), INDIA
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
2. CONTENTS
• Introduction
• Database Used
• Random Over (RO) Sampling
• Adaptive Sampling based noise reduction (ANR) method
• Support Vector Machine
• Logistic Regression
• Proposed Method
• Experimental Results
• Conclusion and Future Direction
• References
2
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
3. INTRODUCTION
3
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
• Powdery mildew is a contagious disease caused by fungus named LeveillulaTaurica, which can
severely affect the quality and productivity of tomato crop.
• So, the detection and treatment of powdery mildew disease in tomatoes is very crucial because
it can adversely affect the yield of tomato plants.
• Machine learning based classification algorithms can be used for developing forecasting model
for plant disease prediction.
• A hybrid SVM-LR classifier has been proposed in current study for detection of powdery mildew
disease.
• Hybrid SVM-LR is implemented here to get more accurate results for tomato powdery mildew
prediction as compared to previous study.
4. DATABASE USED
4
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
Tomato Powdery Mildew Disease (TPMD) Dataset
• Binary-class imbalanced dataset
• Includes statistics about severity of powdery mildew disease based on weather conditions
• The overall dataset contains 244 data points upon 5 unique features
• Independent Variables: GR (watt/m2), LW (%), WS (KM/h), RH (%), and T (ₒc)
• Dependent Variables: Day Prediction (DP) (conducive or non-conducive)
5. RANDOM OVER (RO) SAMPLING
5
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
• Non-heuristic resampling technique
• Widely used for balancing imbalanced datasets
• Balances imbalanced dataset by randomly copying the existing samples of minor classes for
increasing the number of data points in the train-set in order to balance it with major classes
• Following table shows the distribution of classes before and after RO sampling for TPMD
dataset:
TPMD dataset
Class Before RO sampling After RO sampling
Conducive 27 217
Non-Conducive 217 217
(No. of samples) 244 434
6. ADAPTIVE SAMPLING BASED NOISE REDUCTION
(ANR) METHOD
6
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
• Deal with noisy class labeled data
• Acts as a wrapper for various classifiers, such as LR, k-Nearest Neighbor (kNN), SVM, weighted
kNN, and LDA
• Provides a noise-minimized train set by iteratively calculating the probability of class mislabeling
by using adaptive sampling technique
• Improved train set obtained from this model can reduce the risk of choosing mislabeled samples
for training of model
• Hence, a precise and generalized model can be obtained
7. ADAPTIVE SAMPLING BASED NOISE REDUCTION
(ANR) METHOD (CONT...)
7
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
• In this study, ANR method has been used with SVM classifier for the reduction of noise labels
(misclassified labels) from the train set obtained from TPMD dataset
• ANR method with SVM classifier provides the probability of conducive and non-conducive labels
based on the independent weather parameters.
• These probabilities have further been used for developing the noise-minimized train set.
• Following table shows a sample of train set with the probability value of class labels:
Weather-Parameters (Independent Variables) Probabilities of class labels
T RH LW WS GR P N
24.8 92 35 1 34 0.998 0.001
21.4 82 30 3 32 0.001 0.998
25.1 83 29 1 41 0.987 0.012
24.3 65 16 2 40 0.001 0.998
30.1 67 34 2 56 0.000 0.999
8. ADAPTIVE SAMPLING BASED NOISE REDUCTION
(ANR) METHOD (CONT...)
8
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
• In this study, ANR method has been used with SVM classifier for the reduction of noise labels
(misclassified labels) from the train set obtained from TPMD dataset
• ANR method with SVM classifier provides the probability of conducive and non-conducive labels
based on the independent weather parameters.
• These probabilities have further been used for developing the noise-minimized train set.
• Following table shows a sample of train set with the probability value of class labels:
In this table:
P: Probability of positive class i.e. conducive class
N: Probability of negative class i.e. non-conducive class.
Weather-Parameters (Independent
Variables)
Probabilities of class labels
T
R
H
L
W
W
S
GR P N
24.8 92 35 1 34 0.998 0.001
21.4 82 30 3 32 0.001 0.998
25.1 83 29 1 41 0.987 0.012
24.3 65 16 2 40 0.001 0.998
30.1 67 34 2 56 0.000 0.999
9. ADAPTIVE SAMPLING BASED NOISE REDUCTION
(ANR) METHOD (CONT...)
9
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
• Based on these probabilities, a modified train-set has been developed using predicted class
adjustment criteria as shown in following table:
• Based on above mentioned criteria, a noise-minimized dataset has been developed. A sample of
noise-minimized dataset is shown in following table:
Probabilities Comparison Adjusted Class
P>N 1 (Conducive)
N>P 0 (Non-Conducive)
Weather-Parameters (Independent Variables) Class (Dependent Variable)
T RH LW WS GR
24.8 92 35 1 34 1
21.4 82 30 3 32 0
25.1 83 29 1 41 1
24.3 65 16 2 40 0
30.1 67 34 2 56 0
10. SUPPORT VECTOR MACHINE (SVM)
10
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
• Widely used supervised machine learning algorithm
• A point is plotted for each sample present in the dataset
in an m-dimensional space, where m represents
number of attributes available in the dataset
• Each coordinate present in space indicates a particular
attribute.
• SVM algorithm basically identifies the best hyper-plane
that divides the two labeled classes accurately
• The hyper-plane with the highest marginal difference is
considered as the best hyper-plane.
x
Y
A
B
C
11. LOGISTIC REGRESSION (LR)
11
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
• Supervised Machine Learning Algorithm
• Shows relationship between categorical dependent variable and a set of predicted
(independent) variables.
• Prediction obtained from LR model provides the probabilities of successful and unsuccessful
events for the collection of independent variables.
• If Class is a response variable and T, RH, LW, GR, and WS are predicted variables then the
equation of LR can be written as follows (Equation 1):
𝑙𝑛
𝑝 𝐶𝑙𝑎𝑠𝑠
1−𝑝 𝐶𝑙𝑎𝑠𝑠
= 𝛽0 + 𝛽1𝑇 + 𝛽2𝑅𝐻 + 𝛽3𝐿𝑊 + 𝛽4𝐺𝑅 + 𝛽5𝑊𝑆 (1)
12. LOGISTIC REGRESSION (LR) (CONT.…)
12
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
• In Equation 1:
p (Class)/1-p (Class) = Ratio of probability of success to failure
β0 to β5 : Regression coefficients
Class: Response Variable (0 and 1)
• Regression coefficients can be calculated using a popular approach known as Maximum Likelihood
Estimation. On taking inverse of Equation 1, we get:
𝑝 𝐶𝑙𝑎𝑠𝑠 =
𝑒𝛽0+𝛽1𝑇+𝛽2𝑅𝐻+𝛽3𝐿𝑊+𝛽4𝐿𝑅+𝛽5𝑊𝑆
1+𝑒𝛽0+𝛽1𝑇+𝛽2𝑅𝐻+𝛽3𝐿𝑊+𝛽4𝐿𝑅+𝛽5𝑊𝑆 (2)
• The above gives the value of probabilities within the range of 0 and 1.
• If the value of p comes out to be greater than 0.5 then the value of response variable Class is 1
otherwise it is 0.
13. PROPOSED METHOD
13
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
Application of RO sampling
Train set (70%)
SVM classifier
ANR method
TPMD data set (Imbalanced)
Noise Reduction
Data Cleaning
Modified train set
10-fold cross validation
LR classifier
Prediction
Model
Test
set
(30%)
Performance Evaluation
(Accuracy, AUC, and F1-score)
TPMD dataset (Balanced)
14. EXPERIMENTAL RESULTS
14
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
Classifier
Performance Metrics
Accuracy AUC F1-score
LR 87.02% 0.8777 0.8722
SVM 89.31% 0.8988 0.8923
SVM-LR 92.37% 0.9270 0.9264
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
LR SVM SVM-LR
Performance
Metrics
Classifiers
Performance of SVM, LR and SVM-LR classifier
Accuracy
AUC
F1-score
15. CONCLUSION AND FUTURE DIRECTION
15
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
• This study discusses a Hybrid SVM-LR approach for better prediction of powdery mildew disease
in tomato plants.
• The proposed approach has effectively been implemented on TPMD dataset showing superiority
in predicting powdery mildew disease over SVM and LR classifiers in terms of accuracy, AUC and
F1-score metrics.
• Since current work did not use any feature selection algorithm to identify the most important
features for the detection of powdery mildew disease in tomato plant.
• This work can further be extended by using feature selection techniques to further improve the
performance of prediction models.
• Various meta-heuristic and optimization algorithms can also be used for better results.
16. REFERENCES
16
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India
• W. B. Jones, S. V Thomson, and others, “Source of inoculum, yield, and quality of tomato as affected by Leveillula taurica.,” Plant Dis., vol. 71, no. 3, pp. 266–268, 1987.
• U. Braun and others, “A monograph of the Erysiphales (powdery mildews).,” Beihefte zur Nov. Hedwigia, no. 89, 1987.
• A. R. T. Bakeer, M. A. E. Abdel-Latef, M. A. Afifi, and M. E. Barakat, “Validation of Tomato Powdery Mildew Forecasting Model using Meteorological Data in Egypt,” Int. J. Agric. Sci., vol. 5, no. 2, p. 372, 2013.
• R. A. Guzman-Plazola, Development of a Spray Forecast Model for Tomato Powdery Mildew (Leveillula Taurica (Lev). Arn.). University of California, Davis, 1997.
• A. Fuentes, S. Yoon, S. Kim, and D. Park, “A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition,” Sensors, vol. 17, no. 9, p. 2022, 2017.
• U. Mokhtar, M. A. S. Ali, A. E. Hassenian, and H. Hefny, “Tomato leaves diseases detection approach based on support vector machines,” in 2015 11th International Computer Engineering Conference (ICENCO), 2015, pp. 246–250.
• R. Ghaffari et al., “Early detection of diseases in tomato crops: An electronic nose and intelligent systems approach,” in The 2010 International Joint Conference on Neural Networks (IJCNN), 2010, pp. 1–6.
• S. Verma, A. Bhatia, A. Chug, and A. P. Singh, “Recent Advancements in Multimedia Big Data Computing for IoT Applications in Precision Agriculture: Opportunities, Issues, and Challenges,” in Multimedia Big Data Computing for IoT Applications, Springer, 2020, pp. 391–
416.
• S. Verma, A. Chug, and A. P. Singh, “Prediction Models for Identification and Diagnosis of Tomato Plant Diseases,” in 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2018, pp. 1557–1563.
• S. Verma, A. Chug, A. P. Singh, S. Sharma, and P. Rajvanshi, “Deep Learning-Based Mobile Application for Plant Disease Diagnosis: A Proof of Concept With a Case Study on Tomato Plant,” in Applications of Image Processing and Soft Computing Systems in Agriculture, IGI
Global, 2019, pp. 242–271.
• T. Rumpf, A.-K. Mahlein, U. Steiner, E.-C. Oerke, H.-W. Dehne, and L. Plümer, “Early detection and classification of plant diseases with support vector machines based on hyperspectral reflectance,” Comput. Electron. Agric., vol. 74, no. 1, pp. 91–99, 2010.
• G. Prince, J. P. Clarkson, N. M. Rajpoot, and others, “Automatic detection of diseased tomato plants using thermal and stereo visible light images,” PLoS One, vol. 10, no. 4, p. e0123262, 2015.
• M. McGrath, “Powdery mildew on tomatoes.” [Online]. Available:http://blogs.cornell.edu/livegpath/gallery/tomato/powdery-mildew-on-tomatoes/.
• G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 20–29, 2004.
• P. Yang, J. T. Ormerod, W. Liu, C. Ma, A. Y. Zomaya, and J. Y. H. Yang, “AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications,” IEEE Trans. Cybern., vol. 49, no. 5, pp. 1932–1943, 2018.
• J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300, 1999.
17. THANK YOU
17
7th International Conference on Signal Processing and Integrated Networks (SPIN 2020)
27 - 28 February 2020, Amity University, Noida, India