This document summarizes Microsoft's experience running YARN at a massive scale of over 40,000 machines to support its Cosmos big data platform. Some key points: - Cosmos processes over 500,000 jobs and 2 million containers per hour with high reliability and CPU utilization. - Scaling YARN to this level required optimizations like scheduler key pruning, time-based locality decay, and asynchronous logging to achieve sub-5 second allocation latencies. - A federated approach was used to partition the large cluster into independent YARN sub-clusters for improved scalability and maintenance. - Ongoing work involves tuning multi-cast policies, opportunistic container utilization, and log management to maximize scal