What size test set gives good error rate estimates?

I. Guyon, J. Makhoul, R. Schwartz, and V. Vapnik.
PAMI, 20 (1), pages 52--64, IEEE.
1998

We address the problem of determining what size test set guarantees statistically significant results in a character recognition task, as a function of the expected error rate. We provide a statistical analysis showing that if, for example, the expected character error rate is around 1 , then with a test set of at least statistically independent handwritten characters (which could be obtained by taking 100 characters from each of 100 different writers), we guarantee, with confidence, that: (1) the expected value of the character error rate is not worse than , where is the empirical character error rate of the best recognizer, calculated on the test set; and (2) a difference of between the error rates of two recognizers is significant. We developed this framework with character recognition applications in mind, but it applies as well to speech recognition and to other pattern recognition problems.

The problem often arises when organizing benchmarks in pattern recognition to determine what size test set will give statistically significant results. This is a chicken and egg problem since before getting the recognizer performance, it is not possible to determine the statistical significance. Nevertheless, since approximate values of the error rates of particular recognizers on similar tasks are known, it is possible to estimate what reasonable size a test set should have. In this paper, we use fairly straightforward statistical arguments to address that problem. The method has been designed to help in preparing the data for the first UNIPEN benchmark , but the results are fairly general and a broader applicability is expected.

We tackle the problem from the point of view of the benchmark organizer. Thus, our approach differs from the classical ``hypothesis testing'' framework (see e.g. ) in that we do not test the statistical significance of the result of an actual experiment. Rather, we seek bounds on the minimum number of test examples that guarantee our future benchmark to provide: 1) a good estimate of the state-of-the-art error rate on the target task, and 2) good confidence that one system is better than another, for a relatively small difference in their error rates.

In this paper, we will address the following issues:

Principle of our method and notion of guaranteed estimators . Test set sizes, assuming that the errors are independently and identically distributed.

Problem of ``correlations'' between errors due, for instance, to having many consecutive examples provided by the same writer. We generalize the results to the case of multiple factors of correlation, including: recording conditions and linguistic material.

Problem of the statistical significance of the difference in performance of two recognizers.

[ next paper ]