1
1
How Zillow Unlocked Kafka to 50
Teams in 8 months
2
PROPERTY
MANAGERS
& LANDLORDS
BUYERS &
SELLERS
RENTERS
HOMEOWNERS
REAL
ESTATE
AGENTS
MORTGAGE
PROVIDERS
“Give people the power to unlock life’s next chapter”
The Living Database of All Homes
3
Technology at Zillow
• AWS Shop
• Many AWS Accounts
• 1000s of Microservices
• 100s of Data Pipelines
• “Right tool for the job”
Notable streaming use cases
• Click Stream Analysis
• Property & Listings Data
• Personalization
• Data Lake ingestion
4
Messaging and Streaming Data at Zillow: 2017 Look
Microservice Communication
5
Messaging and Streaming Data at Zillow: 2017 Look
Data Streaming Platform
6
Proliferation of Streaming Scenarios
Capacity sharing
7
Proliferation of Streaming Scenarios
Data Quality Issues
8
Proliferation of Streaming Scenarios
Long tail costs
9
Proliferation of Streaming Scenarios
Slow to innovate
10
Kafka Ecosystem as All-in-One
● Better scale for both streaming & pub/sub scenarios
○ Especially for 1:many consumers
● All data is structured & documented through the Schema Registry
● One place for metadata - All data is “owned”
● Multi tenancy for cost efficiency
11
The Reality of a new Platform Team
● Top down approach is not healthy
● Trust needs to be earned
● Migration is an overhead/technical debt task
● Platform team is a bottleneck
12
Lesson 1: Gain Trust - Quickly!
● Published SLOs from day 1
○ Availability:
under min-isr = 0
successful read canary events
○ Latency:
P50 & P99 Produce
P99 Consume (without Remote)
● Onboarded large (~40MB/s) non
critical stream
13
Lesson 2: Meet Developers Where They Are
● For us it was Terraform
● Open source Kafka Terraform exists but...
● Balance self-service & control
○ Ticketing / Merge Requests?
○ Another approval Flow?
○ Guardrail proxy
14
Lesson 2: Meet Developers Where They Are
● Terraform is declarative
● K8s Custom Resource
Definitions + K8s Operator
● Cluster protection
● Allowed configs
● Metadata
● Devs are in control
● Authz with Namespaces
15
Lesson 3: Migration comes from Value
● No one has time for migration now
● Engaging with the right customers first
○ Very high throughput
○ Many consumers
○ >1MB Message Support
○ Compaction
● Under the hood migrations
16
Lesson 4: Platform is a Product
● Customer (Developer) is our north Star
● Documentation
● CLI
● Archetypes
● Abstracted Client Library
● Internal blog posts
17
Lesson 5: This is Where we Failed - Schema Registry
● 1:1 Kafka → Schema Registry
environments
● CICD for schemas with
“RecordNameStrategy”
● Had to support Multiple Customer Envs
(e.g dev/qa in pre-prod)
Can you guess what happened next?
18
Fail Reason 1 - Schema Sharing
● Shared schemas between
customer environments
● Incompatible dev schema change
failed QA de-serializtion
19
Fail Reason 2 - Porting Data
● Kafka Avro wire format does not have
Schema Registry “ID”
● And access is needed too
20
How we Look at Schema Registry Now
● Still support multi customer envs
● Single global Schema Registry
(think “Maven for Runtime”)
● TopicNameStrategy
● Customer-environment-aware
CICD for schemas
21
Summary
● Gain trust - quickly
● Meet developers where they are
● Bottom up migrations - show value
● Platform as a Product
● Stick to single Schema Registry
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Zillow

How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Zillow

  • 1.
    1 1 How Zillow UnlockedKafka to 50 Teams in 8 months
  • 2.
    2 PROPERTY MANAGERS & LANDLORDS BUYERS & SELLERS RENTERS HOMEOWNERS REAL ESTATE AGENTS MORTGAGE PROVIDERS “Givepeople the power to unlock life’s next chapter” The Living Database of All Homes
  • 3.
    3 Technology at Zillow •AWS Shop • Many AWS Accounts • 1000s of Microservices • 100s of Data Pipelines • “Right tool for the job” Notable streaming use cases • Click Stream Analysis • Property & Listings Data • Personalization • Data Lake ingestion
  • 4.
    4 Messaging and StreamingData at Zillow: 2017 Look Microservice Communication
  • 5.
    5 Messaging and StreamingData at Zillow: 2017 Look Data Streaming Platform
  • 6.
    6 Proliferation of StreamingScenarios Capacity sharing
  • 7.
    7 Proliferation of StreamingScenarios Data Quality Issues
  • 8.
    8 Proliferation of StreamingScenarios Long tail costs
  • 9.
    9 Proliferation of StreamingScenarios Slow to innovate
  • 10.
    10 Kafka Ecosystem asAll-in-One ● Better scale for both streaming & pub/sub scenarios ○ Especially for 1:many consumers ● All data is structured & documented through the Schema Registry ● One place for metadata - All data is “owned” ● Multi tenancy for cost efficiency
  • 11.
    11 The Reality ofa new Platform Team ● Top down approach is not healthy ● Trust needs to be earned ● Migration is an overhead/technical debt task ● Platform team is a bottleneck
  • 12.
    12 Lesson 1: GainTrust - Quickly! ● Published SLOs from day 1 ○ Availability: under min-isr = 0 successful read canary events ○ Latency: P50 & P99 Produce P99 Consume (without Remote) ● Onboarded large (~40MB/s) non critical stream
  • 13.
    13 Lesson 2: MeetDevelopers Where They Are ● For us it was Terraform ● Open source Kafka Terraform exists but... ● Balance self-service & control ○ Ticketing / Merge Requests? ○ Another approval Flow? ○ Guardrail proxy
  • 14.
    14 Lesson 2: MeetDevelopers Where They Are ● Terraform is declarative ● K8s Custom Resource Definitions + K8s Operator ● Cluster protection ● Allowed configs ● Metadata ● Devs are in control ● Authz with Namespaces
  • 15.
    15 Lesson 3: Migrationcomes from Value ● No one has time for migration now ● Engaging with the right customers first ○ Very high throughput ○ Many consumers ○ >1MB Message Support ○ Compaction ● Under the hood migrations
  • 16.
    16 Lesson 4: Platformis a Product ● Customer (Developer) is our north Star ● Documentation ● CLI ● Archetypes ● Abstracted Client Library ● Internal blog posts
  • 17.
    17 Lesson 5: Thisis Where we Failed - Schema Registry ● 1:1 Kafka → Schema Registry environments ● CICD for schemas with “RecordNameStrategy” ● Had to support Multiple Customer Envs (e.g dev/qa in pre-prod) Can you guess what happened next?
  • 18.
    18 Fail Reason 1- Schema Sharing ● Shared schemas between customer environments ● Incompatible dev schema change failed QA de-serializtion
  • 19.
    19 Fail Reason 2- Porting Data ● Kafka Avro wire format does not have Schema Registry “ID” ● And access is needed too
  • 20.
    20 How we Lookat Schema Registry Now ● Still support multi customer envs ● Single global Schema Registry (think “Maven for Runtime”) ● TopicNameStrategy ● Customer-environment-aware CICD for schemas
  • 21.
    21 Summary ● Gain trust- quickly ● Meet developers where they are ● Bottom up migrations - show value ● Platform as a Product ● Stick to single Schema Registry