1. Guide for reproducingresults of Bioassay paperusing Weka
2. Important points to remember beforestarting a run: All datasets should be in ARFF format, otherwise weka will complain for incompatible format during training and testing. Standard classifiers are used for confirmatory screen data as it is smaller and less im- balanced, whereas cost-sensitive classifiers are used with primary & mixed datasets as they are more imbalanced. We have two goals- 1. To find most robust and versatile classifier for imbalanced bioassay data. 2. To find out optimal misclassification cost setting for a classifier. The misclassification cost for False Negatives has to be set in order to achieve maxi- mum number of True Positives with a False Positive rate less than 20%. The datasets are randomly split into 80% training and validation set and 20% independ- ent test set, so we should have two files for each dataset one for training the classifier and one for testing the model built by that classifier. Use 5 fold cross-validation for larger datasets i.e. primary and mixed screens and use 10 fold cross–validation for smaller datasets i.e. confirmatory screens. CostSensitiveClassifier is used for base classifiers Naïve Bayes, SMO (Sequential Minimal Optimization) and Random Forest, as it outperforms other meta-learners. MetaCost with J48 produces bettet results than other meta-learners. For Naïve Bayes and Random Forest, default options are used. For SMO, option BuildLogisticModels was set to true. For J48, option Unpruned was set to true. For more details please refer the paper.
3. Step wise guide to set-up a weka run:1. Start weka explorer.2. In Preprocess tab go to open file…3. Open a training file in ARFF format. Click open4. For example, AID1608red_train.arff.5. After opening the file should look like:
4. 6. Now click on classify tab in the menu bar.7. We will first train a model using Naïve Bayes classifier, as we are using confirmatory screen AID1608 we will first apply standard classifiers and if there will be less than 20% False Positive rate than cost-sensitive classifiers is used.8. Click on Choose button to select a classifier. From Bayes folder choose Naïve Bayes.9. Your window should appear as below with cross-validation selected with 10 folds:
5. 10. Now click on start button, model will start building.11. Since we have used 10 fold cross-validation so it will build models for 10 folds. Check status here Run completed
6. 12. Look at the output section scroll to bottom section as shown:13. This is the model generated by Naïve Bayes classifier by using training set AID1608red_train.14. Next step is to test this model on the independent test set AID1608red_test.15. Go to section test options select Supplied test set and click on set.16. Open the test file AID1608red_test.
7. 17. After reading the file close the Test instances dialog by clicking on close.18. Now right-click on your model in result list and choose Re-evaluate model on currenttest set. Click here
8. 19. Within fraction of a second results are produced in the same output window. False positive True positive False negative True negative20. We have obtained a False Positive rate of 14.5% which is less than 20% and a True posi-tive rate of 15.4% which is very low. Now, we will set cost-sensitive classifier to improvethe results.21. As mentioned in page 2 of this tutorial for Naïve Bayes we will use Weka’s CostSensi-tiveClassifier.22. The author has used incremental costing where cost was increased in stages from 2 to 1000000, until a 20% False positive rate was reached.23. So, we will set up a cost matrix by starting with a misclassification cost of 2.
9. 24. Go to choose button, select CostSensitiveClassifier from meta folder.25. Click on the text box to open the GenericObjectEditor dialog box as shown: Click here and this dialog box will open up
10. 26. In this dialog box, select Naïve Bayes from choose classifier.27. Next, click on costMatrix to set up misclassification cost.28. We have 2 classes in our dataset i.e. actives and inactives so we will set up a 2X2 Matrix. ( For TP, FP, TN, FN). In classes enter 2. Click resize to cre-ate a 2X2 matrix. Change misclassi-fication cost for falsenegatives to 2. Then close thedialog box. Write 2 in place of 1
11. 29. Leave all other options default and now close GenericObjectEditor dialog by clicking OK30. Click start to begin building cost-sensitive model.31. Repeat steps 13-19 as described above for testing.32. See improved results, True Positives has increased within a 20% limit for False Positives.33. We stop here as we have achieved our goal.34. Similarly, you can build models using SMO, Random Forest and J48. Check their settings as mentioned on page 2 of this tutorial before starting the run.