This document discusses the improvement of interleaved methods for estimating ranking function outcomes using statistical hypothesis testing, particularly in scenarios with long-tailed query distributions. The paper reproduces and evaluates previous work on interleaved methods, proposing two variations for the δab score estimator: stat-pruning and stat-weight, to enhance evaluation accuracy. Experiments reveal the limitations and benefits of these methods across differing query distributions, highlighting challenges in reproducing original findings due to incomplete information.