According to experiments using data from the NTCIR-12 STC task:
- Variance estimates for evaluation measures were generally accurate even when using as few as 25 topics, provided reasonably stable measures were used.
- As fewer teams' data was used to estimate variances, the estimates became more unstable and inaccurate, especially for less stable measures like nG@1. Using data from at least 7-9 teams seemed necessary to obtain reliable estimates.
- Variance estimates were more accurate when starting with 100 topics compared to only 10 topics, and informational measures had tighter confidence intervals than navigational measures.