This document discusses common pitfalls to avoid when comparing machine learning classifiers and recommends a better approach. It notes that using small, public datasets can lead to statistical accidents and an inability to improve results. When comparing multiple algorithms on multiple datasets, the risk of false positives increases substantially due to multiple testing. The document recommends dividing datasets into training, validation, and test sets for cross-validation; comparing algorithms based on average accuracy across cross-validation folds; and applying Bonferroni corrections for multiple comparisons. The goal is to conduct statistically valid experiments and avoid making unsupported claims about algorithm performance or generalizability.