The TBCP-funded global signals in genomic data project is developing methods and software to view and characterise and in the case of batch effects, also correct for, large correlated signals in genomic data. In the last year, we have developed Python-based software to quickly identify genomic signals, with the next phase being the characterisation of these signals. In parallel, we have finished development of a method to identify and remove batch effects which outperforms existing methods. While we have several bodies of work in development, in this talk we will discuss in particular, the performance and importance of the new batch effect removal algorithm.
This new technique maximises the removal of the the structured technical noise known as batch effects, with the constraint that the probability of overcorrection is kept to a fraction which is set by the end-user. This tunability allows control for overcorrection - defined as, the removal of genuine biological variance as well as batch noise. Overcorrection should be minimised as it can lead to false positive results due to the artificial deflation of within-group variances. Benchmarking across four datasets against Combat, the leading currently used technique, we show this new method is far superior in balancing removal of batch noise while preserving biological signal. Additionally, the new method is able to leave largely unchanged one of the datasets which has no significant batch effect, whereas Combat reduces the variance of that dataset by over 45%. For noise removal, we use “guided-PCA” a recently published quantifier of batch effects to show the probability of batch effects remaining in the data post correction. For signal preservation, we calculate in each case, the proportion of the original variance which remains in the datasets after correction.