Maintaining large-scale distributed systems is a herculean task and Hadoop is no exception. The scale and velocity that we operate at Rocket Fuel presents a unique challenge. We observed 5 fold PB growth in our data and 5 fold number of machines, all in just a year’s time. As Hadoop became a critical infrastructure at Rocket Fuel, we had to ensure scale and high availability so our reporting, data mining, and machine learning could continue to excel. We also had to ensure business continuity with disaster recovery plans in the face of this drastic growth. In this presentation, we will discuss what worked well for us and what we learned 9the hard way). Specifically, we will (a) describe how we automated installation and dynamic configuration using Puppet and InfraDB (b) describe the performance tuning for scaling Hadoop (c) talk about the good, bad, and ugly of scheduling and multi-tenancy (d) detail some of the hard-fought issues (e) brief our Business-Continuity Plans and Disaster Recovery (f) touch upon how we monitor our Monster Hadoop cluster, and finally, (g) share our experience of Yarn-at-Scale at Rocket Fuel.