At Cloudflare we are big Kafka adopters and we run Kafka at a massive scale. We deploy our microservices leveraging Kafka on Kubernetes and we have have some interesting experience on how to keep the latter operational to avoid downtime. To do so, we implemented our own Intelligent Smart Health checks for microservices leveraging Kafka. This has allowed our services to be much more self-healing, meaning there is much less manual intervention required. Before we used to get paged when applications got stuck and this also led to different incidents that were also customer impacting. We've implemented this in go, using the Shopify/sarama package but the same concepts can be adopted in different programming languages.