How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Zillow
1. Zillow transitioned from using multiple messaging systems and data pipelines to using Kafka as their single streaming platform to unify their data infrastructure.
2. They took a bottom-up approach to gain trust from teams by publishing service level objectives, onboarding non-critical streams quickly, and meeting developers where they were with tools like Terraform.
3. An important lesson was to treat the platform as a product by providing documentation, libraries, and blog posts to make it easy for developers to use.
3
Technology at Zillow
•AWS Shop
• Many AWS Accounts
• 1000s of Microservices
• 100s of Data Pipelines
• “Right tool for the job”
Notable streaming use cases
• Click Stream Analysis
• Property & Listings Data
• Personalization
• Data Lake ingestion
10
Kafka Ecosystem asAll-in-One
● Better scale for both streaming & pub/sub scenarios
○ Especially for 1:many consumers
● All data is structured & documented through the Schema Registry
● One place for metadata - All data is “owned”
● Multi tenancy for cost efficiency
11.
11
The Reality ofa new Platform Team
● Top down approach is not healthy
● Trust needs to be earned
● Migration is an overhead/technical debt task
● Platform team is a bottleneck
12.
12
Lesson 1: GainTrust - Quickly!
● Published SLOs from day 1
○ Availability:
under min-isr = 0
successful read canary events
○ Latency:
P50 & P99 Produce
P99 Consume (without Remote)
● Onboarded large (~40MB/s) non
critical stream
13.
13
Lesson 2: MeetDevelopers Where They Are
● For us it was Terraform
● Open source Kafka Terraform exists but...
● Balance self-service & control
○ Ticketing / Merge Requests?
○ Another approval Flow?
○ Guardrail proxy
14.
14
Lesson 2: MeetDevelopers Where They Are
● Terraform is declarative
● K8s Custom Resource
Definitions + K8s Operator
● Cluster protection
● Allowed configs
● Metadata
● Devs are in control
● Authz with Namespaces
15.
15
Lesson 3: Migrationcomes from Value
● No one has time for migration now
● Engaging with the right customers first
○ Very high throughput
○ Many consumers
○ >1MB Message Support
○ Compaction
● Under the hood migrations
16.
16
Lesson 4: Platformis a Product
● Customer (Developer) is our north Star
● Documentation
● CLI
● Archetypes
● Abstracted Client Library
● Internal blog posts
17.
17
Lesson 5: Thisis Where we Failed - Schema Registry
● 1:1 Kafka → Schema Registry
environments
● CICD for schemas with
“RecordNameStrategy”
● Had to support Multiple Customer Envs
(e.g dev/qa in pre-prod)
Can you guess what happened next?
18.
18
Fail Reason 1- Schema Sharing
● Shared schemas between
customer environments
● Incompatible dev schema change
failed QA de-serializtion
19.
19
Fail Reason 2- Porting Data
● Kafka Avro wire format does not have
Schema Registry “ID”
● And access is needed too
20.
20
How we Lookat Schema Registry Now
● Still support multi customer envs
● Single global Schema Registry
(think “Maven for Runtime”)
● TopicNameStrategy
● Customer-environment-aware
CICD for schemas
21.
21
Summary
● Gain trust- quickly
● Meet developers where they are
● Bottom up migrations - show value
● Platform as a Product
● Stick to single Schema Registry