Over the last few years, the increasing interest in machine learning has resulted in the design and development of several competitive learners. Usually, the performance of these methods is evaluated by comparing the new techniques to state-of-the-art methods over a collection of real-world problems.
In early days, these comparisons followed no standard, and qualitative arguments where used to extract conclusions from the results. Although these types of analyses enabled highlighting key points about the results, they also depended, to a certain extent, on the eyes of the beholder. Therefore, the need for finding a saver framework to analyze the results arose. With these, several researchers started drawing a methodology based on statistical tests. In the last three years, the first papers appeared on that topic. One of the first contributions can be found in the paper “Statistical Comparisons of Classifiers over Multiple Data Sets” by Janez Demsar. Later on, several authors extended this first efforts to build a save environment for results analysis.
And even more recently, Francisco Herrera and his research group gathered all these efforts and made a tutorial which is available here. The tutorial explains how different tests work and draws different ways to take when applying a statistical analysis to your results.