At Stitch Fix, we maintain a distributed Kafka Connect cluster running several hundred connectors. Over the years, we've learned invaluable lessons for keeping our connectors going 24/7. As many conference goers probably know, event driven applications require a new way of thinking. With this new paradigm comes unique operational considerations, which I will delve into. Specifically, this talk will be an overview of: 1) Our deployment model and use case (we have a large distributed Kafka Connect cluster that powers a self-service data integration platform tailored to the needs of our Data Scientists). 2) Our favorite operational tools that we have built for making things run smoothly (the jobs, alerts and dashboards we find most useful. A quick run down of the admin service we wrote that sits on top of Kafka Connect). 3) Our approach to end-to-end integrity monitoring (our tracer bullet system that we built to constantly monitor all our sources and sinks). 4) Lessons learned from production issues and painful migrations (why, oh why did we not use schemas from the beginning?? Pausing connectors doesn't do what you think it does... rebalancing is tricky... jar hell problems are a thing of the past, upgrade and use plugin.path!). 5) Future areas of improvement. The target audience member is an engineer who is curious about Kafka Connect or currently maintains a small to medium sized Kafka Connect cluster. They should walk away from the talk with increased confidence in using and maintaining a large Kafka Connect cluster, and should be armed with the hard won experiences of our team. For the most part, we've been very happy with our Kafka Connect powered data integration platform, and we'd love to share our lessons learned with the community in order to drive adoption.