"Incorrect data produced into Kafka can be a poison pill that has the potential to disrupt businesses built upon Kafka. The “Semantic Validation” feature is designed to address the challenges posed by incorrect or unexpected data in Kafka’s data processing pipelines, with the goal of mitigating such disruptions. By allowing users to define robust field constraints directly within schemas, such as Avro, we aim to enhance data quality and minimize the downstream impacts of inaccurate data in Kafka.
Furthermore, this feature can be expanded to include offline data processing, in addition to Kafka and Flink real-time processing. By combining real-time processing, batch analytics, and AI data pipelines, a global semantic validation system can be built.
In our upcoming talk, we will delve into the use cases of this feature, discuss its architecture, provide examples of defining rules, and explain how we enforce these rules. Ultimately, we will demonstrate how this feature can significantly enhance reliability and trustworthiness in Uber’s data processing pipelines."
3. Uber | Kafka London Summit 2024
Agenda
● Uber Kafka & Data Lake architecture
● Motivation
● Semantic Validation
● Use cases in both Streaming and Data Lake
● Future work
4. Uber Streaming & Data Lake Architecture
Ingestion
Online Storage
Events
Telemetry
Feeds
Kafka Data
Lake
Compute Fabric
Real-Time Analytics
Data Platform & Tools
Batch Analytics
Stream Processing
Complex Processing
Data Workflow
(Piper, uWorc)
BI Tools
(QueryBuilder, Dashbuilder)
Metadata Platform
(Databook, Quality, Lineage)
Interactive ETL
In-memory (Pinot)
storage
Security
Global Data
Warehouse
1000 services
6. ● Catastrophic impact to
business
● Difficult to detect on timely
● Recovery process is costly
Corrupted Data is Poison Pill
7. Semantic Validation
What’s Semantic Validation?
Verifies the content of the data being transmitted through Kafka topics.
Example types of Constraints:
● Number Constraint:
○ eg: Payment amount, Age
● String Constraint:
○ eg: Product name length, Address format
8. ● Platform Integration & reusability
○ Consistent with existing schema evolution flow.
○ Centralize validation flows.
● User Customizations
○ Provide users with the flexibility to customize
validation behavior and configure alerting.
● Timely Detection
○ Validate on Producer side before data enters
kafka.
Design Goals
9. Current Enforcement Limitations
● Current Checks Limitations:
● Relying on code application checks to verify data integrity can be insufficient.
● Often, validations in code are implemented downstream are reactive fixes
post-outage.
● Absence of Built-in Support in Avro:
● Avro lacks native mechanisms for expressing semantic constraints within
schemas.
● Custom validation outside Avro leads to inconsistency and complexity in data
pipelines.
10. Architecture
- Teams can easily access their schema and update constraints.
- Application services depend on producer client to fetch schema and validate.
- Validator will emit metrics for failed data and monitoring system will send out alert.
11. UI & Schema Evolution
● User create constraint on fields
● Frontend validate format
● Constraint change -> version
change
13. Future plan, adding custom constraints for a
shared object (eg: BillingEntry) allows
centralized validation on same object across
schemas, the object level validation design is
work in progress.
Reusing Constraints
● Predefined constraints ● Object level constraints
The address regex is predefined in schema
backend.
14. Encoding and Validation
● Validating during
encoding
● Different rules for
each data type
● Sampling mechanism
● Each record
encoding P99 latency
with validation is
~130 μs, without
validation ~100 μs
15. Open Questions #1
Should we drop the bad data directly?
Here’s trade-offs of each:
○ Drop invalid data : prevent bad data but will cause data loss
○ Alert only: non disruptive approach, won’t prevent polluted data flow
in
○ Setting up DLQ for producer: increased maintenance cost
○ Insert a new header: delegate to consumers to identify polluted
data.
Decision: we chose to make it opt-in configuration if user wants to discard data
directly, otherwise we’re creating alerts only for our 1st phase.
16. Open Questions #2
Backward compatibility for constraints update:
Day 0, user sets constraints to be a range (0-100)
Day 1, users updated constraints to be (0-90)
Now data with value of 95 which is not considered valid anymore. Do we allow
this change when user update schema?
- If a topic has multiple producers, one of them with latest schema may
start to trigger more violation errors causing inconsistency
- We decided to allow this for first phase but warning user when they
update schema.
17. Semantic Validation for both Online and Offline
- Offline paths can
extend validator
logic upon consume
- This allow each
consumer pipeline
flexibility to configure
different behavior
18. Limitations
Sampling cannot guarantee thorough validation.
● Backpressure based on capacity in realtime to try to maximize sample with low
latency
● Progressive validation when error pattern trends emerge.
● Auditing service to consume topic and perform comprehensive validation
19. Future Work
● Productionize it
● Upstream to OSS
● Dynamic sampling
● Comprehensive auditing
● Reusable constraints, cross field constraints
20. Q & A
Send questions to: shangxinli@apache.org