This document discusses performance limits for machine learning on single machines and clusters, emphasizing the roofline design framework. It highlights the need for new machine learning toolkits that leverage GPU performance, provide customizability, and support easy deployment for large-scale applications. Various benchmarks demonstrate that BidMach significantly outperforms other systems in terms of speed, cost, and energy efficiency for multiple machine learning tasks.