Estimating the Number of Clusters in Big Data with the Aligned Box Criterion: Finding the number, k, of clusters in a dataset is a fundamental problem in unsupervised learning. It is also an important business problem, e.g. in market segmentation. Existing approaches include the silhouette measure, the gap statistic and Dirichlet process clustering. For thirty years SAS procedures have included the option of using the cubic clustering criterion (CCC) to estimate k. While CCC remains competitive, we propose a significant and original improvement, referred to herein as the aligned box criterion (ABC). Like CCC, ABC is based on a hypothesis-testing framework, but instead of a heuristic measure we use data-adaptive reference distributions to generate more realistic null hypotheses in a scalable and easily parallelizable manner. We have implemented ABC using SAS’ High Performance Analytics platform, and achieve state-of-the-art accuracy in the estimation of k.